9.7. Joining

It is frequently the case that the data we need is spread across multiple tables in our database. For example, we might want to store additional information about the starting and ending location of the ride beside their IDs in a table called bikeshare_stations.

Here’s the columns in table bikeshare_stations

Field name Type Description
station_id INTEGER Unique identifier of a station.
name STRING Public name of the station.
status STRING Status of the station, either open or closed
latitude FLOAT latitude of the station
longitude FLOAT longitude of the station
%%sql

SELECT
  *
FROM
  bikeshare_stations
LIMIT
  10
 * sqlite:///bikeshare.db
Done.
index station_id name status latitude longitude
0 31620 5th & F St NW open 38.8976372907417 -77.0181261878149
1 31105 14th & Harvard St NW open 38.9266377729228 -77.0321262099169
2 31400 Georgia & New Hampshire Ave NW closed 38.9356379239833 -77.0241261968097
3 31111 10th & U St NW open 38.9176376162918 -77.0251261993328
4 31104 Adams Mill & Columbia Rd NW open 38.9226377090252 -77.0421262266777
5 31605 3rd & D St SE open 38.8856370929868 -77.0021126160281
6 31203 14th & Rhode Island Ave NW open 38.9086374822707 -77.0311262091468
8 31201 15th & P St NW open 38.909637497344 -77.0341262134231
10 31300 Van Ness Metro / UDC open 38.9438591003638 -77.0633468627126
12 31007 Crystal City Metro / 18th & Bell St open 38.8903694152637 -77.0319595336126

This means that we now have the data to answer questions like “How many bike trips originated from bike station that’s at Van Ness Metro / UDC?” but the data are spread across two tables?

We could imagine storing the name column in our trip_data table since we list the start and end stations IDs for each trip but there are a few important reasons why that’s a bad idea:

  1. We would waste space by duplicating data (not a big deal for this example but a real concern for large systems)
  2. Updating data (for example status of station from active to closed) would require updating each row in trip_data that refers to that station ID. This is time-consuming and error-prone.

Instead we leave the data in two separate tables and need a way to ‘join’ the values together. We can do that by just listing multiple table names but the result is a mess:

%%sql

SELECT
  *
FROM
  trip_data, bikeshare_stations

LIMIT
  10
 * sqlite:///bikeshare.db
Done.
index duration start_date end_date start_station end_station bike_number member_type index_1 station_id name status latitude longitude
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 0 31620 5th & F St NW open 38.8976372907417 -77.0181261878149
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 1 31105 14th & Harvard St NW open 38.9266377729228 -77.0321262099169
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 2 31400 Georgia & New Hampshire Ave NW closed 38.9356379239833 -77.0241261968097
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 3 31111 10th & U St NW open 38.9176376162918 -77.0251261993328
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 4 31104 Adams Mill & Columbia Rd NW open 38.9226377090252 -77.0421262266777
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 5 31605 3rd & D St SE open 38.8856370929868 -77.0021126160281
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 6 31203 14th & Rhode Island Ave NW open 38.9086374822707 -77.0311262091468
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 8 31201 15th & P St NW open 38.909637497344 -77.0341262134231
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 10 31300 Van Ness Metro / UDC open 38.9438591003638 -77.0633468627126
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 12 31007 Crystal City Metro / 18th & Bell St open 38.8903694152637 -77.0319595336126

If you look carefully you might notice that the rows are identical for the first few columns and then start to differ after duration. That’s because SQL joins each row in the first table with each row in the second table. With 144 rows in bikeshare_stations and 1,226,767 rows in trips_data, we end up with a table of 176,654,448 rows.

This rarely if ever is what we want. In most cases, we want to match up some aspect of the rows in the first table with some aspect of the rows in the second table. In most cases, we want to match up based on some column being equal.

In our bike sharing example, the station_id column of bikeshare_stations matches up with the start_station or end_station column of trip_data. To force this match, we filter out the ones that don’t have the same value for both of these columns:

%%sql

SELECT
  *
FROM
  trip_data, bikeshare_stations
WHERE
  start_station = station_id
LIMIT
  10
 * sqlite:///bikeshare.db
Done.
index duration start_date end_date start_station end_station bike_number member_type index_1 station_id name status latitude longitude
0 3548 2011-01-01 00:01:29.000000 2011-01-01 01:00:37.000000 31620 31620 W00247 Member 0 31620 5th & F St NW open 38.8976372907417 -77.0181261878149
1 346 2011-01-01 00:02:46.000000 2011-01-01 00:08:32.000000 31105 31101 W00675 Casual 1 31105 14th & Harvard St NW open 38.9266377729228 -77.0321262099169
2 562 2011-01-01 00:06:13.000000 2011-01-01 00:15:36.000000 31400 31104 W00357 Member 2 31400 Georgia & New Hampshire Ave NW closed 38.9356379239833 -77.0241261968097
3 434 2011-01-01 00:09:21.000000 2011-01-01 00:16:36.000000 31111 31503 W00970 Member 3 31111 10th & U St NW open 38.9176376162918 -77.0251261993328
4 233 2011-01-01 00:28:26.000000 2011-01-01 00:32:19.000000 31104 31106 W00346 Casual 4 31104 Adams Mill & Columbia Rd NW open 38.9226377090252 -77.0421262266777
5 158 2011-01-01 00:32:33.000000 2011-01-01 00:35:11.000000 31605 31618 W01033 Member 5 31605 3rd & D St SE open 38.8856370929868 -77.0021126160281
6 560 2011-01-01 00:35:48.000000 2011-01-01 00:45:09.000000 31203 31201 W00766 Member 6 31203 14th & Rhode Island Ave NW open 38.9086374822707 -77.0311262091468
7 503 2011-01-01 00:36:42.000000 2011-01-01 00:45:05.000000 31203 31201 W00506 Member 6 31203 14th & Rhode Island Ave NW open 38.9086374822707 -77.0311262091468
8 449 2011-01-01 00:45:55.000000 2011-01-01 00:53:24.000000 31201 31202 W00506 Member 8 31201 15th & P St NW open 38.909637497344 -77.0341262134231
9 442 2011-01-01 00:46:06.000000 2011-01-01 00:53:28.000000 31201 31202 W00766 Member 8 31201 15th & P St NW open 38.909637497344 -77.0341262134231

Notice that the result looks more sensical: we end up with one row from trip_data and the corresponding row from bikeshare_stations (copied multiple times since there were only 144 rows in bikeshare_stations).

We can check the size of the resulting table by running:

%%sql

SELECT
  COUNT(*)
FROM
  trip_data, bikeshare_stations
WHERE
  start_station = station_id
 * sqlite:///bikeshare.db
Done.
COUNT(*)
1226767

You might also see some cases where the comma between the table names is replaced with the keyword JOIN and WHERE is replaced with ON. This is synonymous but sometimes preferred to make it clear that you are joining two tables and that your filters are there to specify how those tables are to be joined:

%%sql

SELECT
  COUNT(*)
FROM
  trip_data JOIN bikeshare_stations ON start_station = station_id
 * sqlite:///bikeshare.db
Done.
COUNT(*)
1226767

We can now use all the SQL tools that we’ve learned on this combined table. For example, to find out which open bike station which has the highest bike trip counts so we can ensure there is always plenty of bikes available, we can run:

%%sql

SELECT
  station_id, COUNT(*) AS trip_count
FROM
  trip_data join bikeshare_stations
ON
  start_station = station_id
WHERE
  duration >= 3600
  AND status = 'open'
GROUP BY
  station_id
ORDER BY
  trip_count DESC
LIMIT
  10
 * sqlite:///bikeshare.db
Done.
station_id trip_count
31217 4114
31219 2677
31235 2669
31200 2572
31225 2093
31222 1969
31215 1786
31228 1572
31218 1488
31232 1378

9.7.1. Practice Exercises

Q-1: Use JOIN to show the station IDs of active stations and what’s the average duration of bike trip originating and ending at the same station with member type Member. For station 31000 what is the average duration from above?

Q-2: What is the name of the station where the most rides start?

Q-3: What is the name of the station where the most rides end?

Q-4: What is the name of the station where most rides both start and end?

Q-5: What is the name of the most popular ending station for rides that begin at Massachusetts Ave & Dupont Circle NW?

Lesson Feedback

    During this lesson I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this lesson took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
Next Section - 9.8. Getting SQL Data into a DataFrame