Setting up Apache (Py)Spark with Jupyter Notebook in Arch Linux

Blog / Rodney Pilgrim / September 14, 2016

This blog post details the process I took to install Apache Spark on Arch Linux and the following integration with Jupyter Notebook. In what follows, I’m describing the adaptation for Arch Linux of the Spark installation guide for Mac OS X[3] written by rocket-ron.

What Is Apache Spark?

Fig.1 – Apache Spark logo

Apache Spark is a fast and general engine for large-scale data processing. As the website states, Spark is an open source platform for processing data on a large scale, and supports cluster computing. It supports Java, Python, R, and Scala languages, but (as you can probably guess from this blog post title) I will focus on the Python API – PySpark.

Some features of Spark:

  • MLib – Machine Learning Library
  • Spark SQL Context – Interface for executing SQL queries on a dataset
  • Support for Multiple Data Formats – HDFS, SQL, Compress GZIP text files,plain text files and more…

More information can be found on the Apache Spark Website.

Installing Apache Spark

Arch & OS X use different package managers. On Arch I used the AUR helper packer, which allows the installation of Spark and its dependencies (e.g. Scala, hadoop, Python, etc.) with one single command:

packer -S apache-spark

Setup Jupyter PySpark Kernel

Even though the packer command installs Spark and all its dependencies, Spark still requires the environment to be set up correctly. Following rocket-ron’s example, I set up a Jupyter Notebook kernel to provide an interactive coding environment. This process involves two tasks: installing the JSON Kernel Configuration file, and setting up the startup script for the PySpark profile.

Of course, this requires Jupyter Notebooks to be installed on the system. This can be done running:

sudo pacman -S jupyter-notebook  

and typing the user password if requested.

Kernel Configuration

The content of the kernel.json file, which is used to install the PySpark kernel is:

{
    "language": "python",
    "argv": [
        "python",
        "-m",
        "ipykernel",
        "--profile=pyspark",
        "-f",
        "{connection_file}"
        ],
    "display_name": "pySpark (Spark 1 .6.1)"
}

I put this kernel.json in its own directory (e.g. ~/pyspark/) so that I could install it via Jupyter command line:

sudo jupyter kernel-spec install ~/pyspark/

PySpark Profile Startup Script

In order to create a PySpark profile at startup, we need to create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py, with the followingcontent:

# Configure the necessary Spark environment
    import os
    import sys
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS","")
    if not "pyspark-shell" in pyspark_submit_args :
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"]= pyspark_submit_args
        spark_home = os.environ.get("SPARK_HOME", '/opt/apache-spark')
        sys.path.insert(0, spark_home + "/python")
        # Add the py4j to the path.
        # You may need to change the version number to match your install
        sys.path.insert(0,os.path.join(spark_home, "python/lib/py4j-0.9-src.zip"))
        # Initialize
        PySparkmajor_version = sys.version_info.major
        pyspark_shell_file = os.path.join(spark_home, "python/pyspark/shell.py")
        if (major_version==2):
            execfile(pyspark_shell_file)elif (major_version==3):
            with open(pyspark_shell_file) as f:
                code = compile(f.read(), pyspark_shell_file, "exec")
                exec(code)
        else:
            error_str = "Unrecognised Python Version: {0}".format(major_version)
            raise EnvironmentError(1, error_str, "00-pyspark-setup.py")

This script runs initial code required to setup a Spark Context object ready for use. This object tells Spark how to access a cluster [8] and acts as the interface for initial Spark operations, such as loading a dataset into the Spark environment.

Note: At the time of writing the apache-spark AUR package installed Sparkin /opt/apache-spark. If this changes, or you manually install Spark elsewhere, be sure to set the environment variable SPARK_HOME to point to the installation. E.g. Add export SPARK_HOME=/path/to/spark to your .bashrc file

Checking It Works

Now, on a running jupyter notebook, you should be able to create a new notebook with a pySpark (Spark 1.6.1) kernel, which can be tested with the below wordcount example in a cell:

import pyspark
spark_context = pyspark.SparkContext()
lines = spark_context.textFile("filename")
words = lines.flatMap(lambda line: line.split())
count = words.count()
print("Word Count: " + str(count))

Where filename is the path of a text file.

ALAS, You Could Use Vagrant

In search for more power during an assignment, I decided to set Spark up on my gaming PC, which runs Windows. Predicting that installing Spark on Windows would have undoubtedly been a headache, I prepared a Vagrant file and scripts to provision an Arch VM. These files have been pushed to the ALAS[1](Arch Linux Apache Spark) GitHub Repository.

Debian/Ubuntu Based Systems

For those using a derivative of the Debian Linux distribution (e.g. Ubuntu), an in-depth guide is provided by Kristian Holsheimer on GitHub[7].

References

  1. ALAS – Arch Linux Apache Spark Repository
  2. Apache Spark Website
  3. AUR – Apache Spark Package
  4. AUR Helpers
  5. Configuring Spark 1.6.1 to work with Jupyter 4.x Notebooks on Mac OS X with Homebrew
  6. Jupyter Website
  7. Spark + PySpark setup guide
  8. Spark Programming Guide – Initialising Spark section
  9. Vagrant Website