9.2. Exploring Bike Rental Data with SQL



A lot of the data that we interact with today is stored in databases. For example:

All these bits of information are stored in various kinds of databases. Some of these are stored in relational databases that are available as open source tools like Postgresql, MySQL and SQLite, as well as commercial databases such as Google BigQuery, Oracle, Microsoft SQL Server, or Amazon Aurora

Others are stored in proprietary systems like Google’s BigTable or Facebook’s Haystack Object Store.

9.2.1. Query Language

Whatever the database might, there needs to be a way to extract data from it and a lot of these systems have agreed on a shared language for accessing data. For relational database, this language is called SQL (Structured Query Language, pronounced like “sequel”).

Before you stress out about learning a new language, lets take a minute and review the things you have already learned how to do with Pandas.

  • You can change the shape of a DataFrame by selecting the columns you want or computing new columns.
  • You can filter a DataFrame by using conditions to select just the rows you want.
  • You can reorder a DataFrame by sorting on one or more columns.
  • You can group by one or more columns and compute aggregate summaries of other columns in the group.
  • You can join two dataframes together using the merge function.

The operations just described comprise a basic set of tools that any data manipulation language should support, and SQL supports these operations very well, in a very natural way. You are not going to have to learn any new concepts in this chapter you are just learning some new query syntax that will open up whole new worlds of data access for you. Most businesses run on a relational database of some kind, so it follows that a lot of real world data analysis requires you to get data from one. In this section we will teach you how to get started.

9.3. Getting Started with the Bike Data

In this Lesson, we will be hands on and try out SQL with the Capital bike sharing dataset, hosted on a SQLLite database.

There are just two lines we need to execute at the top of our notebook, One line loads an extension so that we can write SQL in the cells of our notebook. The second ‘connects’ to our SQLLite database. You may need to install the ipython-sql module using conda install ipython-sql

%load_ext sql

Now connect to the bikeshare database

%sql sqlite:///bikeshare.db
'Connected: @bikeshare.db'

9.3.1. Verify access to the dataset

Let’s verify that you have access to the dataset by running a simple SQL query.

The code snippet contains a few lines: * The first line of that code block is just a magic invocation that lets Jupyter know that this cell contains SQL and not Python. * The second line introduces SQL syntax for the first time. To help you understand the SQL commands we are using, the SQL syntax words are listed in CAPITAL letters, the lowercase words are the names of tables or columns. The SQL statement translates to: grab (SELECT) all the values (*) in the table called trip_data but only show me the first ten (LIMIT 10).

 * sqlite:///bikeshare.db
index duration start_date end_date start_station end_station bike_number member_type
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member
1 346 2011-01-01 00:02:46.000000 2011-01-01 00:08:32.000000 31105 31101 W00675 Casual
2 562 2011-01-01 00:06:13.000000 2011-01-01 00:15:36.000000 31400 31104 W00357 Member
3 434 2011-01-01 00:09:21.000000 2011-01-01 00:16:36.000000 31111 31503 W00970 Member
4 233 2011-01-01 00:28:26.000000 2011-01-01 00:32:19.000000 31104 31106 W00346 Casual
5 158 2011-01-01 00:32:33.000000 2011-01-01 00:35:11.000000 31605 31618 W01033 Member
6 560 2011-01-01 00:35:48.000000 2011-01-01 00:45:09.000000 31203 31201 W00766 Member
7 503 2011-01-01 00:36:42.000000 2011-01-01 00:45:05.000000 31203 31201 W00506 Member
8 449 2011-01-01 00:45:55.000000 2011-01-01 00:53:24.000000 31201 31202 W00506 Member
9 442 2011-01-01 00:46:06.000000 2011-01-01 00:53:28.000000 31201 31202 W00766 Member

The trip_data table is composed of several columns:

index BIGINT,
duration BIGINT,
start_date DATETIME,
end_date DATETIME,
start_station BIGINT,
end_station BIGINT,
bike_number TEXT,
member_type TEXT

We don’t always want to read all the columns in a table. For example, if we just want the subscriber type, start time, and duration in minutes columns we could select:


  member_type, start_date, duration
 * sqlite:///bikeshare.db
member_type start_date duration
Member 2011-01-01 00:01:29.000000 3548
Casual 2011-01-01 00:02:46.000000 346
Member 2011-01-01 00:06:13.000000 562
Member 2011-01-01 00:09:21.000000 434
Casual 2011-01-01 00:28:26.000000 233
Member 2011-01-01 00:32:33.000000 158
Member 2011-01-01 00:35:48.000000 560
Member 2011-01-01 00:36:42.000000 503
Member 2011-01-01 00:45:55.000000 449
Member 2011-01-01 00:46:06.000000 442

Tips: SQL doesn’t care about line breaks so we can spread a SQL query over multiple lines just to make it easier to read.

Next Section - 9.4. Filtering