It is frequently the case that the data we need is spread across
multiple tables in our database. For example, we might want to store
additional information about the starting and ending location of the
ride beside their IDs in a table called
Here are the columns in table
Unique identifier of a station
Public name of the station
Status of the station, either open or closed
latitude of the station
longitude of the station
This means that we now have the data to answer questions like “How many bike trips originated from bike station that’s at Van Ness Metro / UDC?” but the data are spread across two tables?
We could imagine storing the
name column in our
since we list the start and end stations IDs for each trip but there are
a few important reasons why that’s a bad idea:
We would waste space by duplicating data (This is not a big deal for this example but a real concern for large systems)
Updating data (for example status of station from active to closed) would require updating each row in
trip_datathat refers to that station ID. This is time-consuming and error-prone.
Instead, we leave the data in two separate tables and need a way to ‘join’ the values together. We can do that by just listing multiple table names but the result is a mess:
If you look carefully you might notice that the rows are identical for
the first few columns and then start to differ after
That’s because SQL joins each row in the first table with each row in
the second table. With 144 rows in
bikeshare_stations and 1,226,767
trips_data, we end up with a table of 176,654,448 rows.
This rarely if ever is what we want. In most cases, we want to match up some aspect of the rows in the first table with some aspect of the rows in the second table. In most cases, we want to match up based on some column being equal.
In our bike sharing example, the
station_id column of
bikeshare_stations matches up with the
end_station column of
trip_data. To force this match, we filter
out the ones that don’t have the same value for both of these columns:
Notice that the result looks more sensical: we end up with one row from
trip_data and the corresponding row from
(copied multiple times since there were only 144 rows in
We can check the size of the resulting table by running:
You might also see some cases where the comma between the table names is
replaced with the keyword
WHERE is replaced with
ON. This is synonymous but sometimes preferred to make it clear that
you are joining two tables and that your filters are there to specify
how those tables are to be joined:
We can now use all the SQL tools that we’ve learned on this combined table. For example, to find out which open bike station which has the highest bike trip counts so we can ensure there is always plenty of bikes available, we can run:
9.8.1. Practice Exercises¶
During this lesson I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
Completing this lesson took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
Based on my own interests and needs, the things taught in this lesson...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
For me to master the things taught in this lesson feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try