9.7. Joining

It is frequently the case that the data we need is spread across multiple tables in our database. For example, we might want to store additional information about the starting and ending location of the ride beside their IDs in a table called bikeshare_stations.

Here’s the columns in table bikeshare_stations

Field name

Type

Description

station_id

INTEGER

Unique identifier of a station.

name

STRING

Public name of the station.

status

STRING

Status of the station, either open or closed

latitude

FLOAT

latitude of the station

longitude

FLOAT

longitude of the station

This means that we now have the data to answer questions like “How many bike trips originated from bike station that’s at Van Ness Metro / UDC?” but the data are spread across two tables?

We could imagine storing the name column in our trip_data table since we list the start and end stations IDs for each trip but there are a few important reasons why that’s a bad idea:

  1. We would waste space by duplicating data (not a big deal for this example but a real concern for large systems)

  2. Updating data (for example status of station from active to closed) would require updating each row in trip_data that refers to that station ID. This is time-consuming and error-prone.

Instead we leave the data in two separate tables and need a way to ‘join’ the values together. We can do that by just listing multiple table names but the result is a mess:

If you look carefully you might notice that the rows are identical for the first few columns and then start to differ after duration. That’s because SQL joins each row in the first table with each row in the second table. With 144 rows in bikeshare_stations and 1,226,767 rows in trips_data, we end up with a table of 176,654,448 rows.

This rarely if ever is what we want. In most cases, we want to match up some aspect of the rows in the first table with some aspect of the rows in the second table. In most cases, we want to match up based on some column being equal.

In our bike sharing example, the station_id column of bikeshare_stations matches up with the start_station or end_station column of trip_data. To force this match, we filter out the ones that don’t have the same value for both of these columns:

Notice that the result looks more sensical: we end up with one row from trip_data and the corresponding row from bikeshare_stations (copied multiple times since there were only 144 rows in bikeshare_stations).

We can check the size of the resulting table by running:

You might also see some cases where the comma between the table names is replaced with the keyword JOIN and WHERE is replaced with ON. This is synonymous but sometimes preferred to make it clear that you are joining two tables and that your filters are there to specify how those tables are to be joined:

We can now use all the SQL tools that we’ve learned on this combined table. For example, to find out which open bike station which has the highest bike trip counts so we can ensure there is always plenty of bikes available, we can run:

9.7.1. Practice Exercises

Use JOIN to show the station IDs of active stations and what’s the average duration of bike trip originating and ending at the same station with member type Member. For station 31000 what is the average duration from above?

What is the name of the station where the most rides start?

What is the name of the station where the most rides end?

What is the name of the station where most rides both start and end?

What is the name of the most popular ending station for rides that begin at Massachusetts Ave & Dupont Circle NW?

Lesson Feedback

    During this lesson I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this lesson took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
You have attempted of activities on this page
Next Section - 9.8. Getting SQL Data into a DataFrame