Preface from the Second Edition

by Jan Pearce and Jacqueline Boggs

We are excited to bring you this enhanced version of this book. As we were planning to teach a course in data analytics, a course which is cross listed in computer science and business at our institution, we found it quite challenging to identify a book that had appropriate content for this type of interdisciplinary course. We were so very excited to find this open source book due to the clear focus on the data. We both believe that curiosity is exactly what drives data science and data analytics. When we encounter a set of data, it leads us to ask provocative questions that can often be answered by the data techniques covered in this book.

As professors, we believe it is crucially important that students build life-long learning skills. We have found that it is sometimes difficult for students to transfer learning to another area/topic/dataset. For these reasons, we wanted to add some additional datasets into this book, so we could help students learn to better apply and transfer their knowledge.

Some of the key changes from the First Edition include:

  • Learning Goals, Learning Objectives, and Glossaries added to each chapter.

  • Chapter titles that identify the data technique to be utilized while still letting curiosity about each of the datasets drive the exploration.

  • The fourth chapter has been significantly expanded to include a targetted introduction/review of Python.

  • The option to choose to use Google Colaboratory Notebooks or an Anaconda installation using Jupyter Notebooks.

  • Additional datasets presented as case studies that focus on business applications added in addition to the existing case studies on other interesting topics.

One can find data science offered by departments such as computer science, math or statistics, as well as business, so this edition strives to appeal to the interests of students in each of these disciplines. Of course, the applications of data science are even broader and have broad application across the entire curriculum. Our best hope is that the second edition of this text can be used for courses in Data Science, Data Analytics, Business Analytics, and possibly beyond!

We hope you like it and would love to hear from you!

Preface from the First Edition

by Brad Miller

It is said that the most important characteristic of a data scientist is curiosity. Curiosity has certainly led me on a path of discovery throughout the world of data science and many fascinating data sets that I have encountered. So, the premise of this book is to let the data sets lead you to learning. The best and most interesting way to learn is to find some data and then begin to ask questions about it an analyze it, visualize it, and then write down new questions that have occurred to you as you have been doing your initial analysis.

This is how I organized the first two data science courses I ever taught, and surprisingly it worked. In fact it worked so well that I would never want to teach it any other way. Nevertheless it may not be clear from a high level look at the table of contents what this course covers and the learning goals it strives to achieve. So let me lay it out for you in a different organization.

Learning Objectives

  • Articulate the data science processing pipeline

  • Extract data using SQL

  • Gather data from the Internet using web API’s and screen scraping

  • combine data from different sources

  • Clean the data

  • Handle missing data/finding outliers/fixing data

  • Normalize and rescaling data

  • Visualize the data

  • Translate questions to analysis and analysis to interesting stories

  • Analyze data

    • Single variable regression, logistic regression

    • Market basket analysis

    • Cohort analysis

    • Sentiment analysis, exposure to Bayes Theorem

    • Time series

    • Geographic analysis

    • Simulations, Monte Carlo

  • Understand statistical significance and how to test for it using practical simulation techniques.

More Traditional Topic Outline

  • Data Gathering

    • Using Web APIs

    • reading CSV files

    • Screen Scraping

    • Reading data from relational databases with SQL

  • Data Munging

    • dealing with missing data

    • string processing

    • regular expressions

    • re-encoding data (one-hot)

    • re-scaling data

  • Data Querying

    • filter

    • group by and aggregation

    • joining

    • sorting

    • reshaping

    • pivoting

  • Analytical techniques

    • Linear Regression

    • Sentiment analysis

    • Market basket analysis

    • Cohort analysis

    • Time series

  • Visualization

    • Understanding Distributions

      • Histogram

      • Box and whisker plot

      • Violin plot

    • Understanding relationships

      • scatter plot

      • bubble plot

      • heat map

      • Network diagrams

      • chord charts

    • Making Comparisons

      • bar chart / stacked bar chart

      • line chart

      • spider plot

    • Geographic analysis

      • Choropleth maps

You have attempted of activities on this page