Additional resources for students and faculty

Each of the seminar pages has links to additional materials directly relevant to that seminar, but if you want to explore further, you may want to know if there are additional resources such as websites, blogs, or podcasts you should follow.  You may be wondering if there are additional books or videos on related topics.  Possibly you want to explore further and are wondering if there are data sets you can explore, software you could get access to, web-based resources you could get access to, or credentials or badges you could earn.  This page is a collection of such materials, broken out by the type of material.  As we learn of new resources that are available, we will continue to update this site, so please bookmark it and come back later and explore some more.

Our Playlist: Data Science for All Playlist on Safari Online. This is a public link to a Safari online playlist for some books we have used and recommend.  It's a tiny sample from the sea of books available through Safari.  You will need a Safari account to access it.  See below for how SJSU students, staff, and faculty can access Safari Online for free through the MLK Library.

Websites, Newsletters, Blogs, and Podcasts

Title Topic Description
Data Science for All blog Data Science

Yes, this is a bit self-serving, but we figured we would list our blog first, even if we have not been blogging consistently through the pandemic. For those who participated in any of the workshops, we will be posting on a semi-regular basis about events, software, datasets, and other data science related topics and things we have seen on the web that hopefully will be of interest to you.

Introduction to Pandas Tutorial Data Analytics

This zip file includes an original case study (Dr. Leslie Albert), a modified version of the Tableau Sample-Superstore data-set Dataset (original available here), and a detailed tutorial that builds on the case to teach basic Pandas functions. This tutorial is also available on our Presentations and Publications page. 

Datanami Data

Articles about companies, products, and issues related to data such as Big Data, databases, and visualization. (Some articles are written by vendors from their viewpoint, so read the by line.)

Data Science Central Data Science Articles about companies, tools, and techniques in data science. (Some articles are written by vendors from their viewpoint, so read the by line.)
Tech Crunch Technology industry What's going on in the tech industry, particularly with tech start-ups.
KD Nuggets Machine Learning & Big Data Focus is more on techniques and learning - often has links to good materials on learning a topic. Often has links to free books on programming and stats topics.
Data Elixir Data Science Newsletter with articles on data science and visualization - brings together articles from other sources too.
O'Reilly Tech & Business O'Reilly is a publisher, but the link here is to free newsletters you can sign up for on various tech topics.  See also the separate link for Safari Online.
Storytelling With Data Visualization Blog on visualization techniues and what makes a good visualization.  Often has good articles on when to use different types of visualizations.
Data Science Weekly Data Science Weekly newsletter that brings together stories about data science.
Safari Online Tech Books and Videos

Safari is an online source with a lot of technology topics.  I use a number of their books in class and create a reading list for my class.

If you are an SJSU student or employee, you have free access.  Go to this link at the SJSU library and log in.

If you are not an SJSU student, but at another school, you can go to the Safari Online website, Click the "Sign In" button, enter your school ID, and try to sign in with SSO (single sign-on).  Alternately, if you are a student, sign up for the ACM (Association for Computing Machinery), and for $19/yr you have access to educational resources including Safari Online.

Stack Overflow Programming

Stack Overflow is best know as the help sitewhere you go to get programming questions answered (a community), but they also have a newsletter.

The Batch Deep Learning

Newsletter by DeepLearning.AI which is an educational initiative started by Andrew Ng.  Much of the content is on the deep end (pardon the pun), but even for newbies there are some interesting and less technical articles.

Tableau Training Videos

This is a set of training videos that Tableau makes available (that are about working hands-on with Tableau).  Most are very short and digestible.  If you have not already signed up for Tableau through the software download, it may make you sign up first (it's free).

Neo4j Connections Videos Graph Databases

Neo4j hosts monthly virtual half-day conferences on graph topics (that are free).  You can register to participate live, or you can visit the Connections page and watch the recorded sessions.

MBAStack Data Science

Visit this site to learn more about data science resourves and potential careers! Analytics

Visit this site to learn more about different analytics degree options.

Software for Students and Faculty

Following are software available to faculty and students at any or most schools in the United States and often overseas (not just SJSU - see SJSUOne for software available to SJSU students). Some software is available for faculty to request for their students, but not directly to students. Other software is only available to students. In almost all cases you are limited to non-commercial, educational, and/or research use; be sure to read the license if that's an issue for you.

Title Cost Student
Databricks $0 both We use the community edition of Databricks in the Spark and Jupyter Notebook seminar.  This is an awesome tool with a great interface and it's always up-to-date on the latest Spark features.  The link is to the community edition.
Databricks University Alliance $0 different features for each If you are using Databricks, check this site out.  If you are faculty, you can sign up for additional resources and they are very responsive. If you are a student, check out some of the self-paced training materials available.
Tableau $0 different options for each Tableau makes licenses available for students and faculty, with faculty also able to request licenses for their students and a Tableau Online license to use in their class.  There is also a Tableau Public option that you can publish to and share your visualizations with the world.
NOTE: if you're a student, click on the "free student license" button in the upper right-hand corner.
Neo4j $0 anyone This is the graph database used in the Exploring Relationships in Graphs seminar. For the seminar, we are using the community edition, and for the digital badge quiz we are using the Sandbox (click on the "Get Started with Sandbox" button). The sandbox is their web-based implementation, so there is no setup on your part. However, the sandbox is temporary (3 days, renewable for another week) and limited in the resources available. The community edition is limited by the resources on your laptop and how you configure it (see the notes for the seminar for guidance on configuring).
Google Colaboratory $0 anyone The link takes you to a Jupyter notebook on Google Colaboratory, which is a site Google hosts that allows you to run Jupyter notebooks and even use Tensorflow.  You can save notebooks you create to your Google Drive account and you can share notebooks you create (and others can share with you), or you can use notebooks shared on GitHub.  Multiple seminars in the Data Science for All series use Google Colab.
OpenRefine $0 anyone This is an open-source desktop data wrangling tool for transforming your data. There is a large community of users and it's particularly popular in the information retrieval / library sciences community.  This used to be know as Google Refine.
Anaconda $0 (individual version) anyone Anaconda (individual edition) is a free and easy-to-install desktop tool for data science.  It includes a number of tools that are useful for working in Python, including a desktop version of Jupyter that will run locally in your web browser.  We use this do do some of the data wrangling for the seminar data sets (and provide the notebooks as optional materials on the seminar webpages).
Safari Online Sandboxes See above (free for SJSU students & employees) both Safari Online is mostly about books and videos on technical and business topics (see above), but they started adding sandboxes on their main page that allow you to play with different technologies.

Data Sets You Can Use

Title Description
Yelp Open Dataset We use part of this dataset in the Exploring Relationships in Graphs seminar. Yelp updates this dataset approximately each year. When you download the data, it will have all of the JSON data files zipped up in a tarball (if you are not sure what that means, see the additional materials for the graph seminar - it walks through how we built the data file for that seminar).
Update: In Spring 2022 Yelp updated the dataset to include 2021 and it uses a different set of 11 different metro areas than the prior datasets. 10 of the metro areas are in the United States and 1 is in Canada. Although more metro areas are included, the dataset itself is actually slightly smaller.
Yelp Datset on Kaggle This is the same dataset as at the above link, but you can download individual data files in JSON whereas the dataset above is all one zipped tarball (more compact if you are downloading all of it).  It does not include the photos, but if you are on Kaggle, you can create a notebook from it, so easy to start playing with it.  A few prior versions are also available on Kaggle (click on the version link when looking at the data).
Social Security Baby Names We use these data sets in the digital badge quiz for the Spark and Jupyter Notebooks seminar.  If you want a notebook that loads the data for you - see the notebook for the quiz in the post-seminar module for that seminar.  This is a data set that the Social Security administration updates each spring with the number of people born in each year who applied for a social security card with a given first name.  They have counts by year and gender going back to 1880 (not a typo), or by state back to 1910.
USA Spending In the Spark and Jupyter Notebooks seminar we use a pre-wrangled subset of the data from this site.  The notebook for that seminar will load the data, and if you want the data we use in the seminar, you can also download the data files from the seminar page.  If you want to use the USA Spending API to download a different set of years or accounts, see the optional materials for the seminar - we have some Jupyter notebooks (using Anaconda from your desktop) that use the API to download the data.  If you start with those and modify them for your needs, you can save yourself some headaches.  USA Spending has data on all of the contracts and grants the the government spends money on, so you can see where the money goes.  Recently they also added data n COVID spending.
College Scorecard This is a data set made available by the Department of Education tha contains data about universities - costs, completion rates, etc.
Centers for Medicare & Medicaid Services (CMS) CMS makes data available (by year) on medical payments.  This was used in this project on Data4Good for looking at payments from drug companies to doctors and hospitals.  The payment files are fairly large zip files 300MB-800MB.  You may want to write code to directly access the files from the platform you are using if it's cloud-based.  That project also has links to other data files they used in tracking opioid prescribing.
COVID-19 data The linked site is the Tableau COVID-19 Data Hub which pulls data from John Hopkins on a daily basis and makes it available. They also have some starter workbooks for analyzing and visualizing the data.  The data is also part of the AWS Marketplace (click on the "Data Products" tile). The AWS Marketplace also contains other COVID-19 datasets that may be of interest.