11.6. Joining¶

It is frequently the case that the data we need is spread across multiple tables in our database. For example, we might want to store additional information about the starting and ending location of the ride beside their IDs in a table called bikeshare_stations.

Here’s the columns in table bikeshare_stations

Field name	Type	Description
station_id	INTEGER	Unique identifier of a station
name	STRING	Public name of the station
status	STRING	Status of the station, either open or closed
latitude	FLOAT	latitude of the station
longitude	FLOAT	longitude of the station

This means that we now have the data to answer questions like “How many bike trips originated from bike station that’s at Van Ness Metro / UDC?” but the data are spread across two tables?

We could imagine storing the name column in our trip_data table since we list the start and end stations IDs for each trip but there are a few important reasons why that’s a bad idea:

We would waste space by duplicating data (This is not a big deal for this example but a real concern for large systems)
Updating data (for example status of station from active to closed) would require updating each row in trip_data that refers to that station ID. This is time-consuming and error-prone.

Instead we leave the data in two separate tables and need a way to ‘join’ the values together. We can do that by just listing multiple table names but the result is a mess:

If you look carefully you might notice that the rows are identical for the first few columns and then start to differ after duration. That’s because SQL joins each row in the first table with each row in the second table. With 144 rows in bikeshare_stations and 1,226,767 rows in trips_data, we end up with a table of 176,654,448 rows.

This rarely if ever is what we want. In most cases, we want to match up some aspect of the rows in the first table with some aspect of the rows in the second table. In most cases, we want to match up based on some column being equal.

In our bike sharing example, the station_id column of bikeshare_stations matches up with the start_station or end_station column of trip_data. To force this match, we filter out the ones that don’t have the same value for both of these columns:

Notice that the result looks more sensical: we end up with one row from trip_data and the corresponding row from bikeshare_stations (copied multiple times since there were only 144 rows in bikeshare_stations).

We can check the size of the resulting table by running:

You might also see some cases where the comma between the table names is replaced with the keyword JOIN and WHERE is replaced with ON. This is synonymous but sometimes preferred to make it clear that you are joining two tables and that your filters are there to specify how those tables are to be joined:

We can now use all the SQL tools that we’ve learned on this combined table. For example, to find out which open bike station which has the highest bike trip counts so we can ensure there is always plenty of bikes available, we can run:

11.6.1. Practice Exercises¶

Use JOIN to show the station IDs of active stations and what’s the average duration of bike trip originating and ending at the same station with member type Member.

What is the name of the station where the most rides start?

What is the name of the station where the most rides end?

What is the name of the station where most rides both start and end?

What is the name of the most popular ending station for rides that begin at Massachusetts Ave & Dupont Circle NW (Station: 31200)?

You have attempted of activities on this page