Aggregating

Sheets provides aggregating/grouping functionality via pivot tables, which you have seen in this section. Aggregating data means to collect multiple units into one. Again, SQL provides the same functionality as Sheets through keywords. In this case, the GROUP BY statement allows you to group and aggregate data. For example, you can use GROUP BY to count how many bike trips were taken for each subscriber type, or to calculate the mean duration for trips starting from each station.

The following query counts the number of trips (COUNT(*)) for each member type (GROUP BY member_type).

The result of this query has one rows for each distinct member_type, and the column named n_trips counts the number of rows for each type.

Whenever you use GROUP BY, SQL expects you to use an aggregating function. This is equivalent to which function is used to summarize data in a Sheets pivot table. The most common examples are listed below.

For example, the query below calculates the minimum and maximum trip duration for each start station.

Write a query to find start station with the earliest start date.

You must include the GROUP BY column(s) in your SELECT clause. All columns that are not in the GROUP BY clause must contain an aggreating function.

If you want to calculate a summary statistic for the entire population, you can use an aggregating function without the GROUP BY clause. For example, the query below calculates the mean duration of all trips.

You can of course use filtering and ordering with aggregating functions. The query below calculates the mean trip duration just for trips with a casual member type.

What is the total duration of all trips taken on bike W01274?

You can then combine this with the GROUP BY clause as you like. Filtering with the GROUP BY clause can serve multiple purposes. Either way, you must put the WHERE clause before the GROUP BY clause.

If you filter by a column that is not grouped by, you can filter the aggregated results to a subset of the data. For example, you can calculate the sum of all trip durations, by start station, for a specific end station.

If you filter by a column that is being grouped by, you can filter the aggregated results to just show the row you are interested in. For example, you can display the total trip duration for each pair of start station and end station, just for trips that end at station 31111.

It can also be useful to combine aggregating functions with the ORDER BY clause. This can allow you to sort by an aggregated column to find the minimum or maximum.

Q-10: Explain what the following query will return.

SELECT
  start_station,
  AVG(duration) AS mean_duration
FROM
  trip_data
WHERE
  duration >= 3600
GROUP BY
  start_station
ORDER BY
  AVG(duration) DESC

Note that you can also use the column alias in the ORDER BY clause. The above query would have been identical (in almost all SQL servers) if it had been ORDER BY mean_duration DESC.

what is the bike number and trip count of the bike number with the most trips?

For the station with the most trips that started and ended at the same station, find the mean duration (in seconds) of all trips, rounded to the nearest whole number.

Extension: Counting Unique Values

It is often helpful to not only count the number of rows, but count the number of unique values of a column. You can do this using the DISTINCT keyword.

To count the distinct values of a column, you can simply use COUNT along with DISTINCT. For example, the query below counts the number of bike numbers used.

This can also be used alongside to count the unique values for each group. For example, the following query counts the unique bike numbers used for each start station.

How many unique start stations are in the table?

Write a query to display the number of unique end stations per each start station.

It is sometimes also useful to select the distinct rows of a table. You can do that by using DISTINCT without an aggregating function.

Note that selecting distinct values is equivalent to selecting grouped values.

Extension: The HAVING Clause

You can never include an aggregating function within the WHERE clause. For example, suppose you are trying to show all start stations and the number of trips, but only for start stations with over 100 trips. The following query produces an error, since the WHERE condition includes the COUNT function.

The HAVING clause can be used instead here. The query below executes successfully.

As with the ORDER BY clause, HAVING is an operation on the results. While WHERE filters the data being queried, HAVING filters the results based on the value of an aggregating function. The HAVING clause can only be used immediately following the GROUP BY clause.

Write a query to show the bikes that have been used for over 1 hour (3600 seconds), and the total duration they have been ridden. Order the result from shortest total duration to longest

Extension: Numbered Column Aliases

This section covers no new concepts, but introduces a convenient shorthand notation. When using GROUP BY and ORDER BY, you can often (this is supported in almost all SQL servers, but not all) reference columns using the number in which they are selected. For example, the following two queries are identical, listing member types by descending number of trips.

The first query uses the column names in the GROUP BY and ORDER BY.

SELECT
  member_type,
  COUNT(*) AS n_trips
FROM
  trip_data
GROUP BY
  member_type
ORDER BY
  n_trips DESC

The second query uses the column number in the GROUP BY and ORDER BY.

SELECT
  member_type,
  COUNT(*) AS n_trips
FROM
  trip_data
GROUP BY
  1
ORDER BY
  2 DESC

Note that 1 refers to the first column being selected, member_type, and 2 refers to the second column being selected, n_trips.

You can always use the column names in these clauses, but the numbered aliasing can be extremely useful, especially if you have several columns. Note that you can use this notation only with the GROUP BY and ORDER BY clauses. Using such notation anywhere else will produce an error.

Q-22: Explain what question the following query is answering.

SELECT
  start_station,
  end_station,
  AVG(duration) AS mean_duration
FROM
  trip_data
GROUP BY
  1, 2
ORDER BY
  3 DESC
You have attempted of activities on this page