AoM Big Data Workshop

If you are attending the workshop, I look forward to meeting you at the conference!

At the AoM Conference on Big Data (University of Surrey, April 18-20, 2018), I will be presenting a hands-on workshop: Wrangling Big Data for Data-Driven Research: Hands-on with Apache Spark and Jupyter Notebooks.  In this workshop, we will be using Jupyter notebooks to run Apache Spark and analyze semi-structured data.

Below are instructions on how to sign up for the cloud-based community edition of the Databricks analytics platform which includes Apache Spark and Jupyter notebooks. 

We will be using a dataset from USASpending.gov that includes data on U.S. government contracts dating back over a decade.  We will be using both JSON and CSV data downloaded from their website.  Some of the JSON data you will be downloading directly to your notebook (running under Databricks' community edition on AWS) through the USASpending API, and other data will have been downloaded, compressed, and staged on AWS, so your notebook will be able to communicate directly with AWS to load the data.  This will allow us to analyze data across agencies and contract data from 2015 and 2017 (we will use approximately 1million contracts from the 4th quarter of each year).  While we will still be using the Wi-Fi at the conference, this will eliminate the need to download and upload data if the network is under a heavy load due to all of the workshops and attendees.

At the end of the workshop we will discuss another dataset available from Yelp as part of their Dataset Challenge that I use in a course for business students in information systems.

In the coming week I will be posting instructions on downloading prepopulated Jupyter notebooks we will use as a starting point for the workshop.

Databricks Community Edition of Apache Spark and Jupyter Notebooks - Please sign up before the workshop:

Databricks community edition: https://databricks.com/ce

The sign-up process is straight-forward.  Databricks will send you an email confirming your account, and when you click on the link in the email, it will take you into your Databricks account.  Feel free to use it or poke around in it, and when you are done, log out.

Written instructions on how to sign up: Signing up for Databricks

Jupyter Notebook:

Notebook for the conference: workshop.zip

Download the workshop.zip file at the above link and unzip it.  On most Macs the file will be unzipped automatically.  On a PC, open the folder and drag the file out.  After unzipping the file, the notebook is named workshop ipynb.

Jupyter notebooks are JSON documents, so this is a small file.  Since the notebook is pre-populated with code, you do not need to have any prior experience with JSON, Python, SQL, or Spark.  If you do, that's great, but this workshop only requires you to have a laptop and Internet access (wireless is provided at the conference). In the workshop we will be walking through what the notebook is doing, iterating and changing some of the code and visualizations. You will be running your notebook on a Databricks community account, so be sure to sign up for a free account as described above.

We will be making some changes as you calculate your notebook in the workshop, but if you want to see what one version would look like, click on this link.  One feature of Jupyter notebooks is that you can share a read-only version of your work and you decide when you want to update the shared version.

Even if you don't download it early, the notebooks only contain code and markdown in a JSON format, so they are small files that should work well even on the Wi-Fi.

Yelp Dataset Challenge (round 11): https://www.yelp.com/dataset

If you are interested in a dataset to use in a classroom setting, I recommend checking out the Yelp Dataset Challenge.  It is currently on round 11 and they release a new version near the start of the Spring and Fall semesters (at least based on San Jose State's academic calendar).