Joining

It is frequently the case that the data we need is spread across multiple tables in a database. For example, along with the trip_data table you have already seen with data on each trip, there might be a table storing additional information the locations of the start and end stations, called bikeshare_stations.

Field Name

Description

station_id

Unique identifier of the station

name

Public name of the station

status

Status of the station (either ‘open’ or ‘closed’)

latitude

latitude of the station

longitude

longitude of the station

In Sheets, when you need to join data across multiple spreadsheets, you use VLOOKUP. In SQL, you use the JOIN keyword. JOIN is used to join multiple tables on a common column. You specify the common column on which you wish to join using the ON keyword.

Suppose you want to find the number of bikeshare trips that originated at the Van Ness Metro / UDC station. You can join the start_station field from the trip_data table to the station_id field in the bikeshare_stations table.

There is a lot of new material to unpack in this query.

Q-3: What question is the following query answering?

SELECT
  trips.start_station,
  AVG(duration) AS mean_duration
FROM
  trips_data AS trips
INNER JOIN
  bikeshare_stations AS stations
ON
  trips.start_station = stations.station_id
WHERE
  stations.status = 'closed'

How many trips were started at Adams Mill & Columbia Rd NW?

Write a query to display the mean duration of trip for each start station name. For example, one row could read as White House Station | 12345. sort the result by the average duration in ascending order.

What is the name of the station has the most trips by riders with member type as casual?

Write a query to compare, in terms of number of trips and mean duration, stations that are listed as open and closed.

Extension: Subqueries

It is sometimes the case that you can’t accomplish everything you want to within one query. In these cases, in the same way that you can wrap one Sheets function within another, you can wrap one query within another. The inner query is often referred to as a subquery.

One use case for a subquery is within the WHERE clause. This can be used instead of joining, and can often be more efficient. For example, the following query calculates the mean duration of trips that start at an open station.

This could equivalently be done by joining the trips table to the stations table on trips.start_station = stations.station_id, then filtering on WHERE stations.status = 'open'. However, this involves joining every row of both tables. While using the subquery means having two distinct SELECT steps, each step involves less data, since each WHERE clause filters each table down to much fewer rows.

Another common use case for subqueries involves aggregating functions. For example, suppose you want to calculate the percentage of trips that start at each station. While you can calculate the number of trips that start at each start station in a single query, and the the total number of trips in a single query, it is complex to calculate both in a single SELECT statement. However, a subquery can be used to simply and logically accomplish this.

The above query contains three SELECT statements. The “inner query” counts the number of trips for each start station. There is also a one-line subquery that sums the total number of trips. The “outer query” combines the two subqueries to calculate the percentage of total trips that start at each start station.

You have attempted of activities on this page