Statistical Foundations Seminar

Instructor : Dr. Subhankar Dhar

 

Seminar description:

The seminar introduces data science as an interdisciplinary field and delves into different steps needed to become a data scientist. It also covers important statistical concepts along with relevant computer programming fundamentals, in conjunction with hands-on analysis of real-world datasets. It consists of several statistical modules developed using Jupyter Notebook and Python.

 

Why this seminar has been designed?

Data Science is an important field and there is a growing number of career opportunities. Data Science has applications in almost every industry. For example, recommendations for movies and restaurants, improving customer loyalty and retention, hiring the right people, loan approval, measuring brand exposure, detecting credit card fraud, predictive maintenance, early detection of supply chain disruption, to name a few. This seminar is designed to introduce students to various problems and use cases arising from industry and the statistical concepts necessary to deal with these problems. These modules are meant to introduce students to data science early in their academic careers. No prior knowledge of Data Science is required.

 

Why this topic is relevant?

Good knowledge of statistics is absolutely necessary to solve problems in Data Science. Statistical tools and techniques are useful for exploratory data analysis and decision making. Hence, this topic is chosen to introduce statistical concepts that are relevant to data scientists.

 

After completing the seminar, you should be able to:

  • Understand basic statistical principles often used by data scientists
  • Apply common statistical tools and techniques used in Data Science
  • Use Python and Jupyter Notebook to analyze large datasets
  • Visualize and interpret results for decision making

 

After participating in the seminar and completing the post-seminar assessment, you should be able to:

  • Work with Jupyter Notebook on your computer
  • Use various python toolkits and related statistical packages most commonly used in data science 
  • Run statistical applications using Python 
  • Understand the landscape of data science tools and their applications, and how to identify and dig into new technologies and algorithms needed for the job at hand
  • Analyze large datasets for visualization 
  • Analyze large datasets to get insights and make business decisions

 

What you will be doing during the seminar:

You will be working with open datasets made available by Kaggle and we will be looking at housing prices. We will analyze various features and also try to predict prices based on various parameters. No prior experience is needed, but to get the most out of the seminar, please do the following:

How to get started:

  • Register for the seminar – its 100% free, but registering for the seminar will get you to access to a Canvas course with all of the seminar materials, optional pre-seminar exercises, and additional materials (some of these are included below, but more convenient in Canvas).
  • Take the survey.
  • Get familiar with the pre-seminar review material in Canvas. This includes basic concepts in probability and statistics, documentation on Anaconda Distribution – world’s most popular Python/R Data Science open-source platform. It is the easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. With over 11 million users worldwide, it is the industry standard for developing, testing, and training on a single machine.

 

Seminar materials

Note : Additional materials are available in Canvas after you register

  • Datasets 
    • Download the dataset (kc_house_data.csv) from here
      • The dataset contains house sale prices between May 2014 and May 2015 for King County
      • It has 21613 observations and  includes 19 house features plus the price and the id columns.
      • The features are number of bedrooms, bathrooms, square feet, year in which built, etc.
    • Download the data (googleplaystore.csv) from here:
      • This dataset contains information about apps from Google Play store.
      • It has 13 columns describing various features like the name of the app, it’s rating, category, whether its’s free or paid etc.
      • The reviews/ratings column can be used to deduce how many people use the app
  • Pre-seminar module:  Basic probability review (Khan Academy) In addition to datasets and websites with examples using python, there is a review material in the Canvas module that covers fundamental concepts of probability and statistics for data science. The module contains introductory statistical concepts that are widely used in data science. If you want to learn more, there are PDFs and links to other resources that discuss in detail about various applications of statistics in data science for decision making.  
  • Introductory statistics : It is a free online book. Read chapters 1 through 6 to get an overview of the material that will be covered in this seminar. URL: https://openstax.org/details/books/introductory-statistics
  • Seminar Slides are provided on Canvas.
  • We will be using Google Colaboratory, a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. All you need is Internet access and a browser. With Colaboratory you can write and execute code, save and share your analysis, and access powerful computing resources, all for free from your browser.
  • Faculty: If you are a faculty member at SJSU or any university or community college and you would like to host a seminar at your school or use the materials in your course, or if you are a Dean looking to offer data science instruction for your students, we would like to hear from you! Please contact leslie.albert@sjsu.edu for more information.