18.4. Joining¶
It is frequently the case that the data we need is spread across multiple tables in a database.
Imagine that we write a query on the trip_data
table to find
the station with the longest overall ride:
That query gives us the station id (31104), but what if we want its name?
The name of each station is in the bikeshare_stations table. To get the name
of station 31104, we could query it (notice the FROM
clause):
We need the name that is a part of that table, but to find the right station, we
need to use the duration that is a part of the other trip_data
table. Rather
than do the first query and then write the second based on the answer we get, we would
like to write one query that works on both tables. In SQL, when you need to join data
across multiple spreadsheets, you use the JOIN
keyword.
JOIN
is used to merge the columns in one table with the columns in another table.
However, when we want to join records from two tables, we don’t want to join them randomly.
We want to make sure that records that talk about station 31104 in trip_data
are
joined with the records that talk about that same station number in bikeshare_stations
.
To specify how to JOIN
tables, we must use an ON
clause that specifies which column
in the first table to match which which column in the second table. To name the columns, we
use dot syntax: table_name.column_name
. The full syntax for a JOIN
looks like:
SELECT
* -- or whatever
FROM first_table
JOIN second_table
ON
-- look for records where the value in column_A in the first_table
-- match the value in column_B in the second table
first_table.column_A == second_table.column_B
-- WHERE or ORDER BY if we want
So to merge the two queries we did earlier, we will have trip_data
as the first table
and bikeshare_stations
as the second. In trip_data
, the station number is stored as
start_station
. In bikeshare_stations
those same values are stored in the station_id
column.
Try the query - notice that it has all the columns from both tables. Which is really
more than we want. We just want the the station name. So change the *
to name
so that we just get that column. If we want to be really clear about what column we
are talking about (or if both tables had a column called name
so we had to be
more specific), we could write it as bikeshare_stations.name
because it is the
name
column from bikeshare_stations
.
As another example, suppose you want to find the number of bikeshare trips that end at the
Van Ness Metro / UDC station. You can join the end_station
field from
the trip_data
table to the station_id
field in the bikeshare_stations
table.
Again, in the WHERE clause, we are being very specific. Because there is only
one name
column in the two tables, we could just say name
. But it never hurts
to be clear about which table we are talking about once we start joining data.