Setting up Apache (Py)Spark with Jupyter Notebook in Arch Linux
This blog post details the process I took to install Apache Spark on Arch Linux and the following integration with Jupyter Notebook. In what follows, I’m describing the adaptation for Arch Linux of the Spark installation guide for Mac OS X[3] written by rocket-ron.
What Is Apache Spark?
Apache Spark is a fast and general engine for large-scale data processing. As the website states, Spark is an open source platform for processing data on a large scale, and supports cluster computing. It supports Java, Python, R, and Scala languages, but (as you can probably guess from this blog post title) I will focus on the Python API – PySpark.
Some features of Spark:
- MLib – Machine Learning Library
- Spark SQL Context – Interface for executing SQL queries on a dataset
- Support for Multiple Data Formats – HDFS, SQL, Compress GZIP text files,plain text files and more…
More information can be found on the Apache Spark Website.
Installing Apache Spark
Arch & OS X use different package managers. On Arch I used the AUR helper packer, which allows the installation of Spark and its dependencies (e.g. Scala, hadoop, Python, etc.) with one single command:
packer -S apache-spark
Setup Jupyter PySpark Kernel
Even though the packer command installs Spark and all its dependencies, Spark still requires the environment to be set up correctly. Following rocket-ron’s example, I set up a Jupyter Notebook kernel to provide an interactive coding environment. This process involves two tasks: installing the JSON Kernel Configuration file, and setting up the startup script for the PySpark profile.
Of course, this requires Jupyter Notebooks to be installed on the system. This can be done running:
sudo pacman -S jupyter-notebook
and typing the user password if requested.
Kernel Configuration
The content of the kernel.json
file, which is used to install the PySpark kernel is:
{
"language": "python",
"argv": [
"python",
"-m",
"ipykernel",
"--profile=pyspark",
"-f",
"{connection_file}"
],
"display_name": "pySpark (Spark 1 .6.1)"
}
I put this kernel.json
in its own directory (e.g. ~/pyspark/
) so that I could install it via Jupyter command line:
sudo jupyter kernel-spec install ~/pyspark/
PySpark Profile Startup Script
In order to create a PySpark profile at startup, we need to create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
, with the followingcontent:
# Configure the necessary Spark environment
import os
import sys
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS","")
if not "pyspark-shell" in pyspark_submit_args :
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"]= pyspark_submit_args
spark_home = os.environ.get("SPARK_HOME", '/opt/apache-spark')
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0,os.path.join(spark_home, "python/lib/py4j-0.9-src.zip"))
# Initialize
PySparkmajor_version = sys.version_info.major
pyspark_shell_file = os.path.join(spark_home, "python/pyspark/shell.py")
if (major_version==2):
execfile(pyspark_shell_file)elif (major_version==3):
with open(pyspark_shell_file) as f:
code = compile(f.read(), pyspark_shell_file, "exec")
exec(code)
else:
error_str = "Unrecognised Python Version: {0}".format(major_version)
raise EnvironmentError(1, error_str, "00-pyspark-setup.py")
This script runs initial code required to setup a Spark Context object ready for use. This object tells Spark how to access a cluster [8] and acts as the interface for initial Spark operations, such as loading a dataset into the Spark environment.
Note: At the time of writing the apache-spark
AUR package installed Sparkin /opt/apache-spark
. If this changes, or you manually install Spark elsewhere, be sure to set the environment variable SPARK_HOME to point to the installation. E.g. Add export SPARK_HOME=/path/to/spark
to your .bashrc
file
Checking It Works
Now, on a running jupyter notebook, you should be able to create a new notebook with a pySpark (Spark 1.6.1) kernel, which can be tested with the below wordcount example in a cell:
import pyspark
spark_context = pyspark.SparkContext()
lines = spark_context.textFile("filename")
words = lines.flatMap(lambda line: line.split())
count = words.count()
print("Word Count: " + str(count))
Where filename
is the path of a text file.
ALAS, You Could Use Vagrant
In search for more power during an assignment, I decided to set Spark up on my gaming PC, which runs Windows. Predicting that installing Spark on Windows would have undoubtedly been a headache, I prepared a Vagrant file and scripts to provision an Arch VM. These files have been pushed to the ALAS[1](Arch Linux Apache Spark) GitHub Repository.
Debian/Ubuntu Based Systems
For those using a derivative of the Debian Linux distribution (e.g. Ubuntu), an in-depth guide is provided by Kristian Holsheimer on GitHub[7].