PySpark - SparkSession

PySpark - SparkSession


Spark DataSet - Session (SparkSession|SQLContext) in PySpark

The variable in the shell is spark


If SPARK_HOME is set

If SPARK_HOME is set, when getting a SparkSession, the python script calls the script SPARK_HOME\bin\spark-submit who call SPARK_HOME\bin\spark-class2

Example: The below sparksession builder code

spark = SparkSession \
    .builder \
    .appName("nico app") \
    .config("spark.debug.maxToStringFields", "50") \

Result in this command

java ^
  -cp "C:\spark-2.2.0-bin-hadoop2.7\bin\..\conf\;C:\spark-2.2.0-bin-hadoop2.7\bin\..\jars\*" ^
  -Xmx1g org.apache.spark.deploy.SparkSubmit ^
  --conf "spark.debug.maxToStringFields=50" ^
  --conf " app" ^

If SPARK_HOME is not set

Good question but it seems to call Java directly.

2018-07-02 12:46:11 WARN  NativeCodeLoader:62 - 
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Support is not recognized as an internal or external command

'C:\spark-2.2.0-bin-hadoop2.7\bin\spark-submit2.cmd" --conf "' is not recognized as an internal or external command

Possible Resolution: Verify your spark-submit.cmd script. Suppress the quotes.

  • Bad:
cmd /V /E /C "%~dp0spark-submit2.cmd" %*
  • Good:
cmd /V /E /C %~dp0spark-submit2.cmd %*

