PySpark - Installation and configuration on Idea (PyCharm)

Card Puncher Data Processing

PySpark - Installation and configuration on Idea (PyCharm)

About

Installation and configuration of a PySpark (Spark Python) environment on Idea (PyCharm)

Prerequisites

You have already installed locally a Spark distribution. See Spark - Local Installation

Steps

Install Python

  • Install Anaconda 2.7 (3.7 is also supported)
  • Add it as interpreter inside IDEA

Idea Python Interpreter Venv

  • Add Python as framework

Idea Add Python Framework

Install Spark

cd venv\Scripts
pip install "pyspark=2.3.0"
Collecting pyspark
Collecting py4j==0.10.6 (from pyspark)
  Using cached https://files.pythonhosted.org/packages/4a/08/162710786239aa72bd72bb46c64f2b02f54250412ba928cb373b30699139/py4j-0.10.6-py2.py3-none-any.whl
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.6 pyspark-2.3.0

or Idea Pyspark Install

Install third package

  • Install psutil to have better support with spilling
cd venv\Scripts
pip install psutil
Collecting psutil
  Downloading https://files.pythonhosted.org/packages/b6/ca/2d23b37e9b30908174d2cb596f60f06b3858856a2e595c931f7d4d640c03/psutil-5.4.5-cp27-none-win_amd64.whl (219kB)
    100% |################################| 225kB 2.4MB/s
Installing collected packages: psutil
Successfully installed psutil-5.4.5

Default Run Configuration

  • Change the default run parameters for Python.
  • Add the HADOOP_HOME as environment variable (if not set on the OS leve) and set the working directory to your home project.

Do not add SPARK_HOME. It will otherwise call the spark-submit.cmd script and the PYTHONPATH is not set

If you want to set SPARK_HOME, you need also to add the PYTHONPATH. (You can see it in pyspark2.cmd

PYTHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH%
PYTHONPATH=C:\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip;C:\spark-2.2.0-bin-hadoop2.7\python;

Idea Default Spark Python

Run a test script

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('MyFirstStandaloneApp')
sc = SparkContext(conf=conf)

text_file = sc.textFile("./src/main/resources/shakespeare.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

print ("Number of elements: " + str(counts.count()))
  • Run the script

Idea Local Pyspark Run

  • Output
2018-06-04 22:48:32 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Number of elements: 67109

Support

_PYSPARK_DRIVER_CALLBACK_HOST

With this kind of error, you have SPARK_HOME set but you don't the PYTHONPATH

Method showString does not exist

Error:

py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist

You have somewhere SPARK_HOME set but you don't have set PYTHONPATH

Two Resolution possible:

  • 1 - Suppress SPARK_HOME
  • 2 - or Add PYTHONPATH

Example:

set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH%

Documentation / Reference





Discover More
Card Puncher Data Processing
PySpark - Install and configuration

env: PYSPARK_PYTHON : Python binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it...
Card Puncher Data Processing
Spark

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Card Puncher Data Processing
Spark - pyspark

pyspark is the Spark Python API It's also the name of a the pyspark command client We can use lambda functions wherever function objects are required, but they're restricted to a single expression....



Share this page:
Follow us:
Task Runner