Preface

It is said that the most important characteristic of a data scientist is curiosity. Curiosity has certainly led me on a path of discovery throughout the world of data science and many fascinating data sets that I have encountered. So, the premise of this book is to let the data sets lead you to learning. The best and most interesting way to learn is to find some data and then begin to ask questions about it an analyze it, visualize it, and then write down new questions that have occurred to you as you have been doing your initial analysis.

This is how I organized the first two data science courses I ever taught, and surprisingly it worked. In fact it worked so well that I would never want to teach it any other way. Nevertheless it may not be clear from a high level look at the table of contents what this course covers and the learning goals it strives to achieve. So let me lay it out for you in a different organization.

Learning Objectives

  • Articulate the data science processing pipeline
  • Extract data using SQL
  • Gather data from the Internet using web API’s and screen scraping
  • combine data from different sources
  • Clean the data
  • Handle missing data/finding outliers/fixing data
  • Normalize and rescaling data
  • Visualize the data
  • Translate questions to analysis and analysis to interesting stories
  • Analyze data
    • Single variable regression, logistic regression
    • Market basket analysis
    • Cohort analysis
    • Sentiment analysis, exposure to Bayes Theorem
    • Time series
    • Geographic analysis
    • Simulations, Monte Carlo
  • Understand statistical significance and how to test for it using practical simulation techniques.

More Traditional topic Outline

  • Data Gathering
    • Using Web APIs
    • reading CSV files
    • Screen Scraping
    • Reading data from relational databases with SQL
  • Data Munging
    • dealing with missing data
    • string processing
    • regular expressions
    • re-encoding data (one-hot)
    • re-scaling data
  • Data Querying
    • filter
    • group by and aggregation
    • joining
    • sorting
    • reshaping
    • pivoting
  • Analytical techniques
    • Linear Regression
    • Sentiment analysis
    • Market basket analysis
    • Cohort analysis
    • Time series
  • Visualization
    • Understanding Distributions
      • Histogram
      • Box and whisker plot
      • Violin plot
    • Understanding relationships
      • scatter plot
      • bubble plot
      • heat map
      • Network diagrams
      • chord charts
    • Making Comparisons
      • bar chart / stacked bar chart
      • line chart
      • spider plot
    • Geographic analysis
      • Choropleth maps
Next Section - 1. Introduction