Intensive Data and Analytics Summer Workshop
At the Intensive Data and Analytics Summer Workshop (Orlando, June 4-7, 2018), Esperanza Huerta and I will presenting Data Wrangling in Spark with Python, an intermediate level workshop. This workshop will use cloud-based Jupyter notebooks and Apache Spark, with a focus on data wrangling - the profiling and transformation activities that constitute 80% of the work in data science. The workshop is geared towards faculty teaching business students who have a basic programming background.
In the workshop we will be using the cloud-based community edition of the Databricks analytics platform which includes Apache Spark and Jupyter notebooks. This is a web-based platform that can be used for free in the classroom in teaching Big Data and data analytics. Jupyter notebooks and Apache Spark are two of the hottest tools in data science today due to their ability to combine code, output, and documentation in a notebook format that allows for easy experimentation and iteration when exploring data.
We will be using a dataset from USASpending.gov , a government website that hosts over a decade of data on transactions related to U.S. government contracts. We will be using both JSON and CSV data downloaded from the USA Spending website. Workshop participants will be downloading JSON data directly to their notebooks through the USASpending API. The contract data (available in a CSV format through the API) will have been downloaded, compressed, and staged on AWS, so participants will be able to load the data directly from AWS to their notebooks. This will allow us to analyze contract data from 2015 and 2017 (we will use approximately 1million contracts from the 4th quarter of each year). While we will still be using the Wi-Fi at the conference to connect to the Jupyter notebooks, this approach eliminates the need to download and upload data if the network is under a heavy load due to all of the workshops and attendees. A similar approach can be used in a classroom setting if connectivity is limited.
Below are instructions on how to sign up for a Databricks community account. Please sign up for an account prior to the workshop. Prior to the workshop a Jupyter notebook containing the code and markdown (documentation) will be posted below.
Databricks Community Edition of Apache Spark and Jupyter Notebooks - Please sign up before the workshop
Databricks community edition: https://databricks.com/ce
The sign-up process is straight-forward. Databricks will send you an email confirming
your account, and when you click on the link in the email, it will take you into your
Databricks account. Feel free to use it or poke around in it, and when you are done,
Written instructions on how to sign up for an account and import the notebook we will be using in the workshop: Signing up for a Databricks account
Workshop Jupyter Notebook - Please Download
Notebook for the conference: workshop.zip
Download the workshop.zip file at the above link and unzip it, or download the notebook from the Dropbox folder for this workshop session (if you are attending the workshop, you received an email). On most Macs the file will be unzipped automatically. On a PC, open the folder and drag the file out. After unzipping the file, the notebook is named workshop ipynb.
The above instructions for creating a Databricks account also covers importing the notebook. These same instructions are also in the Dropbox folder.
Jupyter notebooks are JSON documents, so this is a small file. Since the notebook is pre-populated with code, you do not need to have any prior experience with JSON, Python, SQL, or Spark. If you do, that's great, but this workshop only requires you to have a laptop and Internet access (wireless is provided at the conference). In the workshop we will be walking through what the notebook is doing, iterating and changing some of the code and creating visualizations. You will be running your notebook on a Databricks community account, so be sure to sign up for a free account as described above.
We will be making some changes as you calculate your notebook in the workshop, but if you want to see what one version would look like, click on this link. One feature of Jupyter notebooks is that you can share a read-only version of your work and you decide when you want to update the shared version. You can also publish your notebook as an HTML file that you can load as a webpage.