Data Wrangling Seminar

Why should I take this seminar?

This seminar gives you the foundations for an essential skill in data science: data wrangling. Data scientists analyze a great deal of data that has been gathered from multiple sources and has been prepared for analysis. Data scientists code computer programs to acquire the data and to transform the data.

This seminar allows you to gain hands-on experience as a data wrangler. A data wrangler is a person who can conduct the first two steps of the data science system: data input (step 1), and data cleaning and transformation (step 2). These two steps together are also known as Extract, Transform, and Load (ETL).

Being a data wrangler is not a small thing. Data scientists spend about 80% of their time wrangling data. The data science system depends on the good work of data wranglers.

For this and other seminars, the programming language used is Python. We will use basic Python and pandas (a Python package).

Objectives

After completing this seminar, you should be able to:

  1. List different sources of data and data classifications
  2. Describe the data science system and data wrangling
  3. Interpret, modify and create basic Python programs to wrangle data using pandas.

Seminar Structure

This seminar has three parts. To earn a digital badge, you need to complete all three parts: pre-seminar, live seminar, and post-seminar.

You can complete some parts of the seminar, only the live seminar, or only do the pre or post-seminar, but to earn a digital badge you must complete all three parts. 

Seminar Description

During the live seminar we will interpret, execute, and trace basic Python programs using pandas in Google Colaboratory. After completing the live seminar, you should be able to interpret, modify, create, and execute basic Python programs that:

  1. receive data input from the keyboard and from text files and output data to the screen and to text files
  2. use basic data types (string, integer, float, and boolean)
  3. use lists
  4. import csv and pandas
  5. use pandas to identify and correct simple data anomalies
  6. wrangle data from Spotify 

Seminar materials

The materials listed in this page are materials for students. If you are faculty, please contact leslie.albert@sjsu.edu to request access to faculty materials. 

The pre-seminar module contains:

  1. A note discussing data sources and classifications (beta version)
  2. A note discussing data science and wrangling (beta version)
  3. A note on using google colaboratory 

Live seminar materials

The live seminar module contains:

  1. A pdf of the Data wrangling presentation
  2. A pdf of the Wrangling Spotify data presentation
  3. Wrangling budget data: guide, data and programs (in Jupyter notebooks) used in the examples during the live seminar.
  4. Wrangling charging stations data: guide and data to create python programs in Jupyter notebooks to clean data on electric charging stations. The solution to these programs are provided in the post-seminar module.
  5. Wrangling Spotify data: Narrative and programs in Jupyter notebooks to wrangle data from Spotify. 

The post-seminar module contains:

  1. Solutions to the electric charging station data provided during the live seminar
  2. Resources to reaffirm your knowledge on the functionality of pandas
  3. Resources to reaffirm your knowledge on the functionality of APIs
  4. Other resources to learn and practice pandas and wrangling data from Spotify
  5. Practice test
  6. Final test to earn your digital badge

How long does it take to complete the post-seminar work?

The post-seminar works is designed to be completed in 3 hours approximately to practice for the final test. However, you are encouraged to practice more. The more you practice the more comfortable you will feel programming.

You should allocate an additional hour for the final test. The test can be solved less than 30 minutes, but you can take up to an hour.