Spark & Jupyter Notebooks Seminar

Instructor: Dr. Scott Jensen

This seminar was first presented at San Jose State University in the Spring 2019 semester.

Why Attend this Seminar?

Apache Spark and Jupyter notebooks are currently two of the hottest tools in data science and this seminar provides the opportunity to work hands-on with these tools even if you have no prior experience in programing or data science! You don’t even need your own computer!

Jupyter Notebooks and Apache Spark are being used by data scientists at some of the largest web-based companies in the Silicon Valley. Apache Spark allows data scientists to explore large datasets in varied formats to quickly identify patterns in the data. Jupyter notebooks allow them to not only visualize and document their results, but also easily share their research with colleagues and even generate publications, webpages, and presentations. Together, through a web-based interface, these tools allow you to explore and experiment with large datasets, quickly ask questions about your data, generate visualizations, and share your work (with a couple clicks you can even publish your notebook to the web and share a link with family, friends, recruiters, or include it on your LinkedIn profile)–all without extensive coding!

Seminar Objectives

After participating in the seminar and completing the post-seminar assessment, you will be able to:

  • Load data into Spark DataFrames and ask basic questions of your data using PySpark
  • Understand the importance of documenting your work and using markdown in Jupyter notebooks
  • Create basic visualizations in Jupyter
  • Share and publish your results

Seminar Materials

To complete the exercises in this seminar, sign up for a FREE Databricks community account. Databricks was started by the creators of Apache Spark (at the UC Berkeley AMPLab, just down the road if you are at SJSU). Your web-based community account combines Apache Spark and a Notebook interface and is hosted by Databricks at no cost to you on Amazon's AWS cloud. After the seminar, you can continue to use your account for class projects or side projects even after you graduate.

Once you have signed up, if you forget the login URL, follow this link to log into your Databricks community account.

A short exercise on Apache Spark and Jupyter notebooks. See this document for an introduction to loading a notebook from Databricks into your account to explore the types of charts you can create in your notebook. Although the notebook runs in Databricks, the focus is more on possible data visualizations than on Apache Spark (we’ll get into Spark more in the seminar).

Seminar slides. This is a PDF file containing the slides for the seminar. Feel free to look at them beforehand, but if you don’t understand them before the seminar, that’s fine! We will be walking through learning about the topics covered in the slides.

The completed seminar notebook. As discussed above, in the seminar you will import a notebook that contains some calculations and markdown documenting what you are doing, and we will walk through wrangling the data, adding queries, and visualizing the data. Your notebook will look like this at the end of the seminar.

The data we will be using

The data we will use is from a Federal website named USASpending.gov. This site has information on all of the payments on U.S. Government contracts by Federal agencies large and small. We will be analyzing nearly 20 million transactions covering the period 2014–2018. This covers multiple years of two different administrations, led by two different political parties, led by two different presidents with different world views. Is spending different across the two administrations? Does the spending by agency reflect different priorities? Are the vendors used located in different states? Does the government’s spending follow any annual patterns across the months of each year? The notebook we will be using will load the data for you, but if you want to create the dataset or change what is included, follow the instructions below.

Creating the data files and loading them manually (totally optional). The data files we will be using in the seminar are based on a download from the USASpending website and contain data from 2014-2018. Some wrangling of the data will have already been done (though you will do some more in the seminar), and the data will be compressed and staged on Amazon’s S3 so the notebook can pull it directly into your Databricks account (avoiding having us all use the network to load large data files at the same time). However, if you want to see the sausage making and get your hands into the data, see these documents for more of the details. The second document covers loading data into Databricks, so you can also follow that if you want to explore further with a different dataset of your own.

The data files.  If you want to use or load the data files used in the seminar, but you do not want to create them, download this zip file and follow the above instructions for loading the unzipped files. (Inside the zip file, each year's data is still zipped using the bzip2 format - leave those zipped.)  the data files are 670MB zipped up, so you may want to download on a faster connection.

Faculty Materials & Community Colleges

  • If you are a faculty member at SJSU or any university or community college, and you would like to host a seminar at your school or use the materials in your course, see the Teaching Materials page to request the additional materials available.
  • If you are a Dean or faculty member at a a Bay Area community college, we would like to hear from you! We are working with community college faculty in the Bay Area and provide stipends to attend the seminar and assist in presenting it at your school.
  • Are you a Bay Area Community College student? Ask your professors if they could incorporate the seminar into your current class or host a student event.