This course counts as an MIS elective for purposes of the MIS concentration and can be repeated (if there are different seminar courses offered). As discussed in the course catalog, special topics courses augment the regularly scheduled electives and this topics course covers the emerging topic of Big Data. For Fall 2017 there are two sections offered:
Course number: 48769
Class Time: Tuesday & Thursday 12:00pm - 1:15pm
Location: BBC 103
Course number: 48770
Class Time: Tuesday & Thursday 1:30pm - 2:45pm
Location: BBC 103
For the detailed syllabus and weekly schedule, please see Canvas
Course Format: The course is very hands-on using Big Data tools to wrangle, analyze, and visualize
a social media dataset. There will be some lecture/discussion towards the end of
the semester on the use of Big Data in different industries, issues of ethics and
privacy, organizational approaches to Big Data, and organizational issues.
Most classes will be hands-on, so if you plan to use your own laptop, please bring it to class. If you don't have a laptop, you can check one out either from the Jack Holland Success Center here in the BBC or from the MLK library. From talking with staff at the library, there is a lot of demand for their laptops, so it may be easier to check one out from the Jack Holland Center. A few of the tools you will use locally on your laptop, but you will be using Apache Spark from a web-based interface that only requires a browser.
During class we will be doing hands-on exercises, each of which should be completed in class. There will also be a few take-home lab assignments in which you apply these same tools to answer a question about the dataset we are working with. The exercises and labs are designed to get everyone up-to-speed and comfortable with the tools since you will use them on a team project.
We will form 3-person teams early in the semester. Your team will work together to answer a potential business question of a real-world social media dataset and then apply the framework we learn in class along with the Big Data tools to answer that question. Since you do not know the answer to your question at the start, you are graded on how you apply the process, how you document your work, your identification of issues in the data, and whether you are curious about your data - not getting a specific result. The team will prepare a progress report during the semester and each team presents their results at the end of the semester.
Course Goals and Description: Data Science is currently a hot topic in industry and Big Data is the fuel for data science. In the early years, data scientists were often Ph.D.'s from the hard sciences (such as astrophysics), but increasingly data science is a team project. The aim of this course is to prepare you for the aspects of data science that consume most of the team's effort and give you skills that can help you enter this exciting field.
Across many industries, 80% of a data scientist's day is spent wrangling data. This includes getting data formatted, transforming it, and profiling it - asking questions of the data to learn about it. The "sexy" aspect of developing complex models is a small part of the job, and then being able to visualize and communicate the results to upper management is required for businesses to get any value out of the analysis. In this course we will focus on the data wrangling aspect using a dataset provided by Yelp, you will ask questions of the data, and then create a visualization to present at the end of the semester.
The Yelp data is available as part of their Dataset Challenge; a data competition for students. A new dataset is available at the start of each semester and contains reviews, business data, and user data. Each semester the dataset grows. Last semester it contained over 4 million reviews for businesses in 11 cities.
The tools in this field continue to evolve, but we will be using the following:
- OpenRefine: A data wrangling tool open sourced by Google
- Apache Spark: Currently one of the fastest growing Big Data tools, hosted on the web using Jupyter notebooks which are currently popular with data scientists
- Tableau: One of the most popular visualization tools and a skill that prior students have found in demand by recruiters
- Neo4j: A graph database for visualizing social networks (graph databases are one of the hottest NoSQL database tools)
Textbooks: We will be using chapters from a number of books that are available online from the MLK Library for free to you as a student or available for free from some of the tool vendors. For thinking about how to frame the questions you ask of the data, we will be using Thinking with Data: How to Turn Information into Insights, by Max Shron. You can download this book to a smartphone or tablet from the MLK Library for free.
Prerequisites: None. Both the BUS4 92, Introduction to Business Programming, and BUS4 112, Database Management Systems courses are helpful knowledge for this course, but are not required. The exercises will walk you through step-by-step to learn the tools. We will also have a couple sessions where we do a hands-on review of topics you may have covered in more detail in those courses. Curiosity is a greater asset than specific technical skills.