BUS4 118D - Big Data

BUS4 118D, Big Data

This course counts as an MIS elective for purposes of the MIS concentration. In the past it has also counted as an elective in the Business Analytics concentration, but be sure to check with your advisor in the Jack Holland Student Success Center to ensure it will count towards your concentration.  For Fall 2021 there are two sections offered:

Section 1:

Course number:  43278
Class Time:  Tuesday & Thursday 12:30pm - 1:45pm
Hybrid:  Online (Tuesdays) and In-Person (Thursdays)
Location (Thursdays): BBC 103

Section 2:

Course number:  43279
Class Time:  Tuesday & Thursday 2:15pm - 3:30pm
Hybrid:  Online (Tuesdays) and In-Person (Thursdays)
Location (Thursdays): BBC 103

For the detailed syllabus and weekly schedule, please see Canvas

Course Format: The course is very hands-on using Big Data tools to wrangle, analyze, and visualize a Yelp dataset.  We will discuss how companies approach data wrangling (working with Big Data), a framework for data projects, how companies are using Big Data, issues of ethics and privacy (the CCPA went into effect this year), and data visualization.

Since this class is hands-on, you will need a computer. If you don't have a laptop or desktop, you can check one out from the University (we will discuss in Canvas and other students have done this in the past).

Since we are hybrid this semester, some of the hands-on exercises will have videos posted that walk you through the exercise.  When we were meeting in person, we will do these in class. The exercises and labs are designed to get everyone up-to-speed and comfortable with the tools since you will use them on a team project. If you have taken BUS4 92 (introduction to programming - which uses Python), and BUS4 112 (Databases), both of these courses will be helpful.  During the scheduled class times we will have discussions, exercises, and team meetings. The class lectures will be recorded and you will be expected to watch them prior to attending class.  After the first few weeks, each team will select a project question and we will alternate between having weeks as class discussions and weeks where I meet with each team to discuss their progress.  Attendance for the class discussions is required.  For the team meetings, you are required to be online with your team the day we are meeting (Tuesday or Thursday), and contributing to the team discussion.

We will form teams early in the semester.  Your team will work together to answer a potential business question of a real-world dataset provided by Yelp and then apply the framework we learn in class along with the Big Data tools to answer that question.  Since you do not know the answer to your question at the start, you are graded on how you apply the process, how you document your work, your identification of issues in the data, and whether you are curious about your data - not getting a specific result.  The team will be responsible for deliverables throughout the semester and each team presents their results at the end of the semester. To encourage everyone to contribute to their team, part of our team deliverable grade is based on your contribution to your team. Each deliverable requires a team discussion as to the contribution of the team members.  Based on the team's assesment, the instructor will allocate the team points - depending on your contribution, you may earn more or less than what the team earned overall.

Course Goals and Description:  Data Science is currently a hot topic in industry and Big Data is the fuel of data science.  In the early years, data scientists were often Ph.D.'s from the hard sciences (such as astrophysics), but increasingly data science is a team sport.  The aim of this course is to prepare you for the aspects of data science that consume most of the team's effort and give you skills that can help you enter this exciting field.

Across many industries, 80% of a data scientist's day is spent wrangling data.  This includes getting data, formatting it, transforming it, and profiling it - asking questions of the data to understand it.  The "sexy" aspect of developing complex models is a small part of the job, and then being able to visualize and communicate the results to upper management is required for businesses to get any value out of the analysis.  In this course we will focus on the data wrangling aspect using a dataset provided by Yelp, you will ask questions of the data, doing your data wrangling in Apache Spark using a notebook interface. The tools are web based and both Spark and Jupyter notebooks are some of the hottest tools in Big Data and data science. Your team will then create a visualization using Tableau that is embedded back into your notebook.

The importance of data wrangling was summed up by DJ Patil, the first Chief Data Scientist for the U.S. Government (in the Obama administration), who stated that: "Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem."

The goal is for every team to create a notebook that the team members can discuss and show to recruiters if interested in being a data analyst or getting into a data-related career. The notebook and visualizations your team create can be published, shown, and shared with recruiters and you can include a link on your website or LinkedIn profile. Keep in mind that you need to do the exciting but hard work of producing a team project you understand and are proud of. If you do not understand your team project, a recruiter will not be impressed no matter how well your team did.

The Yelp data is available for academic use and a new dataset was posted in March of 2021.  Last summer, Yelp also published a supplemental dataset related to how businesses are reacting to the COVID crisis and we may also use that data. The current dataset contains data on reviews, businesses, users, tips (mini-reviews), and user check-ins.  Each time Yelp published a new version of the dataset, it grows.  The version we will be using covers 160K businesses in 8 metro areas in the United States. The current version contains over 8 million reviews in total.

The tools in this field continue to evolve, but we will be using the following:
Apache Spark: Currently one of the fastest growing Big Data tools, it is hosted by Databricks on Amazon's AWS using Jupyter notebooks which are currently popular with data scientists.  Databricks was founded by members of Berkeley's AMPLab who developed and open sourced Apache Spark.
Tableau: One of the most popular visualization tools and a skill that prior students have found to be in demand by recruiters.

Textbooks and Materials:  We will be using chapters from a number of books that are available online from the MLK Library through the Safari Online database.  This allows us to use selected chapters from a number of excellent books for free thanks to the MLK Library.  The cost for books and materials in this class is $0, but there is a heavy time commitment expected.

Prerequisites: Both BUS4 92, Introduction to Business Programming, and BUS4 112, Database Management Systems are helpful knowledge for this course, but are not required. The exercises will walk you through step-by-step to learn the tools.  We will also have a couple sessions where we do a hands-on review of topics you may have covered in more detail in those courses. 

Curiosity is a greater asset than specific technical skills.