Create AWS EMR_Notebook with Spark and Attach it to Git Repo for Version Control

6 min readNov 25, 2021

AWS EMR is an advanced technology when dealing with big data and complex computation. We can further boost the progress of data analysis by applying Jupyter Notebook in EMR cluster with Spark. In this article, I am going to show you how to create EMR_Notebook and attach it to a git repo for version control. Of course, I will mention some possible problems you may encounter and solution to them.

Introduction

This article will be divided into several part from creation of clusters to performing git operation. Below list all the segments. If you are interested in certain part, feel free the navigate to there by clicking the title.

Create VPC
Create Clusters
Create Notebook
Start Using EMR_Notebook
Perform Git Operation
Conclusion
Reference

Create VPC

To attach notebook to a Git Repo, it is necessary to create clusters using private subset, which is not present in default VPC. Therefore, you have to create a new VPC that contains a private subnet. Also, you will have to configure security group to that VPC because security groups are attached to VPCs and VPCs are region specific.

This is the official document talking about creating VPC with private subnets for EMR clusters and I found it useful enough. Please visit here for more information.

Create Cluster

Visit EMR page in AWS console, click clusters →Create Clusters. At first, you will see quick options, but it is recommended that you choose “go to advanced options” since you have many things to configure.

Choose advanced options instead of quick options

There are four parts to configure in advanced options: Software and Steps, Hardware, General Cluster Settings and Security

Software and Steps: In this step, you configure what software to install in your cluster. In this article, we use spark framework as example, you may as well adjust settings according to your demand. After configuration, click Next.

Remember to choose JupyterEnterpriseGateway if you wish to use Jupyter Notebook or JupyterHub
Remember to choose Livy so that we can interact with EMR cluster running Spark using Apache Livy
Remember to choose Hadoop and Spark(Of course, since we want to use Spark framework)

Screen shot for software and steps section

Hardware: In this section, you can adjust the type and number of instances you want for your EMR application. Also, you can change networking of EMR. If you want to attach Git repository to your EMR notebook, remember to choose a private subnet from VPC with private subnet (What you just created in the first step).

General Cluster Settings: You can modify cluster name, logging folder and some other basic settings here

Security: This is a very important part if you want to link to Git repository.

EC2 key pair: The key pair is used to connect to EC2 instance
EC2 security groups: Security groups define the inbound and outbound rules. If you have not created security when creating VPC, you can simply click create a security group to make one that is attached to the VPC. For master node and core&task, allow SSH inbound connection through port 22 and allow all outbound traffic. Other requirements may be loaded upon cluster creation (But do not allow free inbound traffic, it will cause failure to cluster creation)
For Service Access security group, if you have not created one when creating VPC, simply click create a security group to make one that is attached to the VPC. Allow inbound traffic from yout master and core&task security group through port 9443 as well as outbound traffic to your master and core&task security group through port 8443.

Inbound setting of ServiceAccessSecurityGroup

Outbound setting of ServiceAccessSecurityGroup

Now you finish the creation of cluster part. Simple click create cluster and wait till the status turns waiting. When the

Create Notebook

Here are steps and hints for you to check when creating EMR notebook.

Choose an cluster. It’s recommended that you create one with private subnet beforehand.
Choose security groups. Here you have to configure these securiy group to have certain inbound and outbound permissions in VPC management console.

Security groups for master instance:

Have custom inbound TCP connection from security group of notebook instance through port 18888

Screen shot of master instance security group inbound setting

Security groups for notebook instance:

Have custom outbound TCP connection to security group of master instance through port 18888

Screen shot of notebook instance security group outbound setting

3. AWS service role. You need to configure role permissions to have secretmanager policy. This is used for reading Personal Access Token of Git Repo.

Role permission of EMR_Notebooks_DefaultRole

4. Git repository. Click choose repository and let’s assume you don’t already have one. Click add new repository when popup appears.

Enter repository name, URL and branch (You need to create one repository in Github first.)
Click create a new secret(if you don’t already have one). Type name of the secret and Personal Access Token. This helps you prove that you are the owner of that repository so that AWS can push commits to that repo.
Click add repository.
Go back to notebook creation and click choose repository again. You should be able to see the repo.

5. Finally, click create notebook

Start Using EMR_Notebook

On the notebook page, choose EMR clusters and start the notebook. If successful, you should see the green linked and ready. Click open in JupyterLab to start the notebook.

Screen shot of starting EMR_Notebook successfully

Now you can see JupyterLab as below. Remember to work on your project with the folder that was named after your git repo. This is where git tracks your project.

Perform Git Operation

There is at least two ways to perform git operations.

Using JupyterLab

You should see a git logo on the toolbar. The UI will show staged, changed and untracked files here. You can also stage and commit here. Subsequently, you can use Git on top toolbar to push, pull and etc.

Using terminal

Open a terminal on JupyterLab landing page. The git directory lies within work->Repo_Name. Then, simply perform git operation as usual.

Conclusion

EMR with Spark is a ideal framework for analyzing large amount of data. This could be even better when introducing JupyterLab as well as git. However, the setting for this environment requires some practice and attention. This is the reason for this article. Hope anyone visiting this article can successfully start their data exploring journey without difficulties.

Reference

AWS official dicument
Stack Overflow