9.8. Joining¶
It is frequently the case that the data we need is spread across
multiple tables in our database. For example, we might want to store
additional information about the starting and ending location of the
ride beside their IDs in a table called bikeshare_stations
.
Here are the columns in table bikeshare_stations
Field name |
Type |
Description |
---|---|---|
station_id |
INTEGER |
Unique identifier of a station |
name |
STRING |
Public name of the station |
status |
STRING |
Status of the station, either open or closed |
latitude |
FLOAT |
latitude of the station |
longitude |
FLOAT |
longitude of the station |
This means that we now have the data to answer questions like “How many bike trips originated from bike station that’s at Van Ness Metro / UDC?” but the data are spread across two tables?
We could imagine storing the name
column in our trip_data
table
since we list the start and end stations IDs for each trip but there are
a few important reasons why that’s a bad idea:
We would waste space by duplicating data (This is not a big deal for this example but a real concern for large systems)
Updating data (for example status of station from active to closed) would require updating each row in
trip_data
that refers to that station ID. This is time-consuming and error-prone.
Instead, we leave the data in two separate tables and need a way to ‘join’ the values together. We can do that by just listing multiple table names but the result is a mess:
If you look carefully you might notice that the rows are identical for
the first few columns and then start to differ after duration
.
That’s because SQL joins each row in the first table with each row in
the second table. With 144 rows in bikeshare_stations
and 1,226,767
rows in trips_data
, we end up with a table of 176,654,448 rows.
This rarely if ever is what we want. In most cases, we want to match up some aspect of the rows in the first table with some aspect of the rows in the second table. In most cases, we want to match up based on some column being equal.
In our bike sharing example, the station_id
column of
bikeshare_stations
matches up with the start_station
or
end_station
column of trip_data
. To force this match, we filter
out the ones that don’t have the same value for both of these columns:
Notice that the result looks more sensical: we end up with one row from
trip_data
and the corresponding row from bikeshare_stations
(copied multiple times since there were only 144 rows in
bikeshare_stations
).
We can check the size of the resulting table by running:
You might also see some cases where the comma between the table names is
replaced with the keyword JOIN
and WHERE
is replaced with
ON
. This is synonymous but sometimes preferred to make it clear that
you are joining two tables and that your filters are there to specify
how those tables are to be joined:
We can now use all the SQL tools that we’ve learned on this combined table. For example, to find out which open bike station which has the highest bike trip counts so we can ensure there is always plenty of bikes available, we can run:
9.8.1. Practice Exercises¶
Lesson Feedback
-
During this lesson I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
-
Completing this lesson took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
-
Based on my own interests and needs, the things taught in this lesson...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
-
For me to master the things taught in this lesson feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try