It is frequently the case that the data we need is spread across multiple tables
in a database. For example, along with the
trip_data table you have already
seen with data on each trip, there might be a table storing additional
information the locations of the start and end stations, called
Unique identifier of the station
Public name of the station
Status of the station (either ‘open’ or ‘closed’)
latitude of the station
longitude of the station
In Sheets, when you need to join data across multiple spreadsheets, you use
VLOOKUP. In SQL, you use the
JOIN is used to join
multiple tables on a common column. You specify the common column on which you
wish to join using the
Suppose you want to find the number of bikeshare trips that originated at the
Van Ness Metro / UDC station. You can join the
start_station field from
trip_data table to the
station_id field in the
There is a lot of new material to unpack in this query.
Both tables are aliased using the
trip_datatable is named
bikeshare_stationstable is named
stations.) Whenever you use
JOIN, it is good practice to alias all tables used in the join. That way, columns can always be according to which table they appear in.
INNER JOINreturns only rows for which the join key appears in both tables. For example, if a row in the
tripstable has station 12345 as the
start_stationbut it does not appear as a
stationstable, this row will not be returned. The converse is also true. You can read more about other types of joins here.
ONclause uses the table names to reference the columns. SQL looks for rows in the
stationstable where the
station_idmatches each of the
start_stationvalues in the
WHEREclause limits the query to count rows where the start station is Van Ness Metro / UDC. Note that the table alias
tripsis used here again to reference the column.
It is sometimes the case that you can’t accomplish everything you want to within one query. In these cases, in the same way that you can wrap one Sheets function within another, you can wrap one query within another. The inner query is often referred to as a subquery.
One use case for a subquery is within the
WHERE clause. This can be used
instead of joining, and can often be more efficient. For example, the following
query calculates the mean duration of trips that start at an open station.
This could equivalently be done by joining the
trips table to the
stations table on
trips.start_station = stations.station_id, then
WHERE stations.status = 'open'. However, this involves joining
every row of both tables. While using the subquery means having two distinct
SELECT steps, each step involves less data, since each
filters each table down to much fewer rows.
Another common use case for subqueries involves aggregating functions. For
example, suppose you want to calculate the percentage of trips that start at
each station. While you can calculate the number of trips that start at each
start station in a single query, and the the total number of trips in a single
query, it is complex to calculate both in a single
However, a subquery can be used to simply and logically accomplish this.
The above query contains three
SELECT statements. The “inner query” counts
the number of trips for each start station. There is also a one-line subquery
that sums the total number of trips. The “outer query” combines the two
subqueries to calculate the percentage of total trips that start at each start