PySpark - Installation and configuration on Idea (PyCharm)


Installation and configuration of a PySpark (Spark Python) environment on Idea (PyCharm)


You have already installed locally a Spark distribution. See Spark - Local Installation


Install Python

  • Install Anaconda 2.7 (3.7 is also supported)
  • Add it as interpreter inside IDEA

  • Add Python as framework

Install Spark

cd venv\Scripts
pip install "pyspark=2.3.0"
Collecting pyspark
Collecting py4j==0.10.6 (from pyspark)
  Using cached
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.6 pyspark-2.3.0


Install third package

  • Install psutil to have better support with spilling
cd venv\Scripts
pip install psutil
Collecting psutil
  Downloading (219kB)
    100% |################################| 225kB 2.4MB/s
Installing collected packages: psutil
Successfully installed psutil-5.4.5

Default Run Configuration

  • Change the default run parameters for Python.
  • Add the HADOOP_HOME as environment variable (if not set on the OS leve) and set the working directory to your home project.

Do not add SPARK_HOME. It will otherwise call the spark-submit.cmd script and the PYTHONPATH is not set

If you want to set SPARK_HOME, you need also to add the PYTHONPATH. (You can see it in pyspark2.cmd


Run a test script

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('MyFirstStandaloneApp')
sc = SparkContext(conf=conf)

text_file = sc.textFile("./src/main/resources/shakespeare.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

print ("Number of elements: " + str(counts.count()))
  • Run the script

  • Output
2018-06-04 22:48:32 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Number of elements: 67109



With this kind of error, you have SPARK_HOME set but you don't the PYTHONPATH

Method showString does not exist


py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist

You have somewhere SPARK_HOME set but you don't have set PYTHONPATH

Two Resolution possible:

  • 1 - Suppress SPARK_HOME
  • 2 - or Add PYTHONPATH



Documentation / Reference

Powered by ComboStrap