Chapter 3: Tables of Data¶
In this class, we started out talking about how characters were represented on the computer, and how that choice (e.g., between ASCII and Unicode) can be a barrier to some people’s access to computing. We considered how computers might recognize and generate sentences, and about the challenges of understanding different dialects of English with the same language model. We then moved to the Internet and the Web, tracing its history and how grounded it was in English and male culture (e.g., the history of BASIC at Dartmouth). We wrote programs to pull data out of the Web (e.g., pulling out URLs and web-scraping) and to generate HTML for the Web. In this latest section of the course, we have written programs to analyze data from the Web.
Section 1: Billionaires¶
In this class, we have been writing Snap programs to take apart and analyze CSV files. In this ebook, we’ll be using the databases and blocks found in the Database Microworld with Titanic project. For example, here’s a script that selects the billionaires from 2014, sorts those billionaires in terms of their rank (in column B), then selects just the names.
The list is long – 16543 names. We can get Snap to show us all of them. Here’s what the top of that list looks like.
We can also describe the same script as a pipe where the billionaires are first selected, then sorted, and then the names selected out.
The SQL Version¶
In the database world, these kinds of queries are typically written in SQL, which stands for Structured Query Language. SQL is a domain-specific languages for managing data in tables. A database table is different than a CSV file. One difference is that columns in a database table usually have a type. You might store numbers in one column, dates in another, and text in yet another. Once set up, you may not put text in a column that can only hold numbers. Click ‘Run’ to generate a random sentence.
Database tables can be set up to have indices. An index (which can be one variable or a set of variables) should be unique for each row of a database. For example, an ID number associated with a customer or office could be an index. Access via an index is especially fast, much faster than searching the CSV table for information.
Here is the SQL code that does the same thing as the Snap programs above. Click the Run button to execute it.
Try answering these questions about the code above.
Section 2: Titanic¶
You may remember that we also did several analyses with the list of passengers on the Titanic. For example, here is a grouping by passenger class. This is how many there were in each class:
In this second example, we can first select by whether or not the passengers survived, and then group by class.
Computing the average for each class is left for you to do.
Here are each of these analyses, in SQL. Press Run to execute each of them.
The SQL Versions¶
Here are some questions about each of these:
Names on the Titanic¶
One of the more interesting analyses that we did with the Titanic data was to look at the most common first names of men vs. women.
This turns out to be much more difficult in SQL. SQL can generate substrings, but it can’t easiliy split by word. The Snap version is splitting by wording, then grouping and sorting the result.
Here’s an example that’s getting close – picking out the first three characters from all male names on the Titanic.
Now, we can see adults (“Mr.”) vs children (“Mas” for “Master”). But there’s not an easy way of getting the second word.