Data Mining using Spark and Python

·

Recently, I was doing a project with relatively high volume of data. Which made me curious about the possibility of using Spark to process data in a distributed setup. Accordingly, I decided to setup Spark and get my hands dirty with PySpark (python interface for Spark). In this post, I will delve into the steps to set up Spark and the basic concepts of MapReduce paradigm which is a model introduced by Google to process large distributed datasets in an efficient scalable parallel way.

Setting Up Apache Spark on Windows

First of all we should install the requirements including PythonCondaJDK (Java Development Kit).

Create a new environment variable in system settings with JAVA_HOME as name, and provide the JDK installation address (in my case: JAVA_HOME=C:\Program Files\Java\jdk-19) as the value.

Download the latest Spark binaries along with the Winutils and extract them in a corresponding folder.

Create an environment variable as SPARK_HOME = “Spark binaries address”. In my case it is SPARK_HOME = “C:\bigdatalocal\spark”

Create an environment variable as HADOOP_HOME = “Winutils binary address”. example : HADOOP_HOME = “C:\bigdatalocal\hadoop”. Note that you should create a folder named bin and copy winutils.exe in it.

Then “%SPARK_HOME%\bin” and “%HADOOP_HOME%\bin” to the path.

Now, to confirm installation, open command prompt and enter spark-shell the following result should be shown:

C:\Windows\System32>spark-shell

Spark context Web UI available at http://host.docker.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1677192682852).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

To be able to run PySpark code in a Jupyter notebook, we need to take some extra steps.

First, create a new Conda environment and install the necessary python packages.

conda create --name py39 python==3.9
conda activate py39

Note that to avoid weird errors and exceptions, it’s better running all the commands in Anaconda Command Prompt with administer permissions. Also, it’s better to check the PySpark library and choose the proper python version to create the environment with. Otherwise, you might end up struggling with an error indication the python version running the Jupyter notebook is not the same as the PySpark version.

Next, we should install PySpark, Py4J, Notebook, and FindSpark by running the following commands.

pip install pyspark
pip install findspark
pip install py4j
pip install notebook

Next, we should create another environment variable PYSPARK_PYTHON and refer it to the created Conda environment’s python binary like PYSPARK_PYTHON=C:\ProgramData\Anaconda3\envs\py39\python.exe

Now, if you open command prompt and enter pyspark the following result will be shown:

 Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Python version 3.9.16 (main, Jan 11 2023 16:16:36)
Spark context Web UI available at http://host.docker.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1677191027145).
SparkSession available as 'spark'.

The last step is to add following two environment variables which runs the PySpark bundled with Jupyter notebook on port 4050.

PYSPARK_DRIVER_PYTHON = jupyter

PYSPARK_DRIVER_PYTHON_OPTS=notebook –no-browser –port=4050

Finally, running the pyspark command will result in running a jupyter notebook on indicated port 4050.

(py39) C:\Users\amiri>pyspark
[W 18:37:58.491 NotebookApp] Loading JupyterLab as a classic notebook (v6) extension.

[I 18:37:58.494 NotebookApp] Serving notebooks from local directory: C:\Users\amiri
[I 18:37:58.494 NotebookApp] Jupyter Notebook 6.5.2 is running at:
[I 18:37:58.494 NotebookApp] http://localhost:4050/?token=2f203d7e89781e656d1283c0e7a26c9fe45b3ca848468e11
[I 18:37:58.495 NotebookApp]  or http://127.0.0.1:4050/?token=2f203d7e89781e656d1283c0e7a26c9fe45b3ca848468e11
    Or copy and paste one of these URLs:
        http://localhost:4050/?token=2f203d7e89781e656d1283c0e7a26c9fe45b3ca848468e11
     or http://127.0.0.1:4050/?token=2f203d7e89781e656d1283c0e7a26c9fe45b3ca848468e11

Open the provided URL in your browser and run the following lines of code:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

data = sc.parallelize(range(10))
print(data.collect())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *