Jupyter - SparkMagic

About

Sparkmagic is a kernel that provides Ipython magic for working with Spark clusters through Livy in Jupyter notebooks.

Installation Steps

Package Installation

Start a shell with admin right (The anaconda shell if you have installed Jupyter with Anaconda)

pip install sparkmagic

Show

pip show sparkmagic

Name: sparkmagic
Version: 0.12.5
Summary: SparkMagic: Spark execution via Livy
Home-page: https://github.com/jupyter-incubator/sparkmagic
Author: Jupyter Development Team
Author-email: [email protected]
License: BSD 3-clause
Location: c:\anaconda\lib\site-packages
Requires: autovizwidget, pandas, nose, requests, tornado, hdijupyterutils, numpy, ipython, ipykernel, notebook, ipywidgets, requests-kerberos, mock

Enable Extensions

Enable/be sure of ipywidgets extension

jupyter nbextension enable --py --sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: ok

Install the wrapper kernels.

# Location from sparkmagic package info ''pip show sparkmagic'' 
cd c:\anaconda\lib\site-packages
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

[InstallKernelSpec] Installed kernelspec sparkkernel in C:\ProgramData\jupyter\kernels\sparkkernel
[InstallKernelSpec] Installed kernelspec pysparkkernel in C:\ProgramData\jupyter\kernels\pysparkkernel
[InstallKernelSpec] Installed kernelspec pyspark3kernel in C:\ProgramData\jupyter\kernels\pyspark3kernel
[InstallKernelSpec] Installed kernelspec sparkrkernel in C:\ProgramData\jupyter\kernels\sparkrkernel

Enable the sparkmagic extension

Enable the server extension so that clusters can be programatically changed

jupyter serverextension enable --py sparkmagic

Enabling: sparkmagic
- Writing config: C:\Users\gerardn\.jupyter
    - Validating...
      sparkmagic  ok

Configure (config.json)

If you are creating/modifying this file, you need to restart the server

Create the config home

mkdir %USERPROFILE%/.sparkmagic

mkdir ~/.sparkmagic

Create in it the config.json configuration file.

Example on a not secured cluster (from config.json)

Endpoint:

Authentication method auth may be:
- None
- Kerberos
- Basic_Access

{
  "kernel_python_credentials" : {
    "username": "nico",
    "password": "pwd",
    "url": "http://10.10.6.30:8998",
    "auth": "Basic_Access"
  },

  "kernel_scala_credentials" : {
    "username": "nico",
    "password": "",
    "url": "http://10.10.6.30:8998",
    "auth": "None"
  },
  "kernel_r_credentials": {
    "username": "nico",
    "password": "",
    "url": "http://10.10.6.30:8998"
  },

  "logging_config": {
    "version": 1,
    "formatters": {
      "magicsFormatter": { 
        "format": "%(asctime)s\t%(levelname)s\t%(message)s",
        "datefmt": ""
      }
    },
    "handlers": {
      "magicsHandler": { 
        "class": "hdijupyterutils.filehandler.MagicsFileHandler",
        "formatter": "magicsFormatter",
        "home_path": "~/.sparkmagic"
      }
    },
    "loggers": {
      "magicsLogger": { 
        "handlers": ["magicsHandler"],
        "level": "DEBUG",
        "propagate": 0
      }
    }
  },

  "wait_for_idle_timeout_seconds": 15,
  "livy_session_startup_timeout_seconds": 60,

  "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",

  "ignore_ssl_errors": false,

  "session_configs": {
    "driverMemory": "1000M",
    "executorCores": 2
  },

  "use_auto_viz": true,
  "coerce_dataframe": true,
  "max_results_sql": 2500,
  "pyspark_dataframe_encoding": "utf-8",
  
  "heartbeat_refresh_seconds": 30,
  "livy_server_heartbeat_timeout_seconds": 0,
  "heartbeat_retry_seconds": 10,

  "server_extension_default_kernel_name": "pysparkkernel",
  "custom_headers": {
      "X-Requested-By": "admin"
  },
  "retry_policy": "configurable",
  "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
  "configurable_retry_policy_max_retries": 8
}

Validate

You can see in the log that Sparkmagic is enabled when starting the notebook server

jupyter notebook

[I 17:39:43.691 NotebookApp] [nb_conda_kernels] enabled, 4 kernels found
[I 17:39:43.696 NotebookApp] Writing notebook server cookie secret to C:\Users\gerardn\AppData\Roaming\jupyter\runtime\notebook_cookie_secret
[I 17:39:47.055 NotebookApp] [nb_anacondacloud] enabled
[I 17:39:47.091 NotebookApp] [nb_conda] enabled
[I 17:39:47.605 NotebookApp] ✓ nbpresent HTML export ENABLED
[W 17:39:47.606 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 17:39:48.112 NotebookApp] sparkmagic extension enabled!

Start a driver

print "Hello";

Magics By Kernel

iPython - Magic Function by kernel

magics are special commands that you can call with %%

%%MAGIC <args>

IPython

From a ipython kernel

Example from magics in IPython Kernel.ipynb

Load the Sparkmagic

%load_ext sparkmagic.magics

The %%manage_spark line magic lets you manage Livy endpoints and Spark sessions.

%manage_spark

%spark?
%spark logs -s testsession

%%spark -c sql
SHOW TABLES

%%spark -c sql -o df_hvac --maxrows 10
SELECT * FROM hivesampletable

Use the Pandas dataframe df_hvac created above via the -o option

df_hvac.head()

PySpark

https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb

Context

The contexts are automatically created. There is no need to create them. ie

sc = SparkContext('yarn-client')
sqlContext = HiveContext(sc)
spark = SparkSession \
    .builder.appName("yarn-client") \
    .getOrCreate()

Dependent of the version, you may have the following variable names:

spark for SparkSession
sc for SparkContext
sqlContext for a HiveContext (SparkContext with hive)

Configure

Session configuration

%%configure -f 
{"name":"remotesparkmagics-sample", "executorMemory": "4G", "executorCores":4 }

where:

-f change the running session
name is the application name and should start with remotesparkmagics to allow sessions to get automatically cleaned up if an error happened.

Sql

%%sql
use myDatabase

Then…

%%sql
select * from hivesampletable

Info

Help

%%help

Support

log

See HOME/.sparkmagic/log

You need to have at least 1 client created to execute commands.

Your kernel has crashed. Restart it ?

Add a jar

With the configure magic:

Set the jars parameters
or set the conf spark parameters with the maven coordinates

%% configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.10:1.4.0" }}

For HdInisght, see apache-spark-jupyter-notebook-use-external-packages