# Aggregating¶

Sheets provides aggregating/grouping functionality via pivot tables, which you
have seen in this section. Aggregating data means to
collect multiple units into one. Again, SQL provides the same
functionality as Sheets through keywords. In this case, the `GROUP BY`

statement allows you to group and aggregate data. For example, you can use
`GROUP BY`

to count how many bike trips were taken for each subscriber type,
or to calculate the mean duration for trips starting from each station.

The following query counts the number of trips (`COUNT(*)`

) for each member
type (`GROUP BY member_type`

).

The result of this query has one rows for each distinct `member_type`

, and the
column named `n_trips`

counts the number of rows for each type.

Whenever you use `GROUP BY`

, SQL expects you to use an aggregating function.
This is equivalent to which function is used to summarize data in a Sheets pivot
table. The most common examples are listed below.

`COUNT`

counts the rows`SUM`

adds the values of a numeric column`MIN`

calculates the minimum of a numeric column`MAX`

calculates the maximum of a numeric column`AVG`

calculates the mean of a numeric column`MEDIAN`

calculates the median of a numeric column (only available in some SQL servers)

For example, the query below calculates the minimum and maximum trip duration for each start station.

You must include the `GROUP BY`

column(s) in your `SELECT`

clause. All
columns that are not in the `GROUP BY`

clause must contain an aggreating
function.

If you want to calculate a summary statistic for the entire population, you can
use an aggregating function without the `GROUP BY`

clause. For example, the
query below calculates the mean duration of all trips.

You can of course use filtering and ordering with aggregating functions. The query below calculates the mean trip duration just for trips with a casual member type.

You can then combine this with the `GROUP BY`

clause as you like. Filtering
with the `GROUP BY`

clause can serve multiple purposes. Either way, you must
put the `WHERE`

clause before the `GROUP BY`

clause.

If you filter by a column that is not grouped by, you can filter the aggregated results to a subset of the data. For example, you can calculate the sum of all trip durations, by start station, for a specific end station.

If you filter by a column that is being grouped by, you can filter the aggregated results to just show the row you are interested in. For example, you can display the total trip duration for each pair of start station and end station, just for trips that end at station 31111.

It can also be useful to combine aggregating functions with the `ORDER BY`

clause. This can allow you to sort by an aggregated column to find the minimum
or maximum.

Note that you can also use the column alias in the `ORDER BY`

clause. The
above query would have been identical (in almost all SQL servers) if it had been
`ORDER BY mean_duration DESC`

.

## Extension: Counting Unique Values¶

It is often helpful to not only count the number of rows, but count the number
of unique values of a column. You can do this using the `DISTINCT`

keyword.

To count the distinct values of a column, you can simply use `COUNT`

along
with `DISTINCT`

. For example, the query below counts the number of bike
numbers used.

This can also be used alongside to count the unique values for each group. For example, the following query counts the unique bike numbers used for each start station.

It is sometimes also useful to select the distinct rows of a table. You can do
that by using `DISTINCT`

without an aggregating function.

Note that selecting distinct values is equivalent to selecting grouped values.

## Extension: The HAVING Clause¶

You can never include an aggregating function within the `WHERE`

clause. For
example, suppose you are trying to show all start stations and the number of
trips, but only for start stations with over 100 trips. The following query
produces an error, since the `WHERE`

condition includes the `COUNT`

function.

The `HAVING`

clause can be used instead here. The query below executes
successfully.

As with the `ORDER BY`

clause, `HAVING`

is an operation on the results.
While `WHERE`

filters the data being queried, `HAVING`

filters the results
based on the value of an aggregating function. The `HAVING`

clause can only be
used immediately following the `GROUP BY`

clause.

## Extension: Numbered Column Aliases¶

This section covers no new concepts, but introduces a convenient shorthand
notation. When using `GROUP BY`

and `ORDER BY`

, you can often (this is
supported in almost all SQL servers, but not all) reference columns using the
number in which they are selected. For example, the following two queries are
identical, listing member types by descending number of trips.

The first query uses the column names in the `GROUP BY`

and `ORDER BY`

.

```
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
member_type
ORDER BY
n_trips DESC
```

The second query uses the column number in the `GROUP BY`

and `ORDER BY`

.

```
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
1
ORDER BY
2 DESC
```

Note that `1`

refers to the first column being selected, `member_type`

, and
`2`

refers to the second column being selected, `n_trips`

.

You can always use the column names in these clauses, but the numbered aliasing
can be extremely useful, especially if you have several columns. Note that you
can use this notation only with the `GROUP BY`

and `ORDER BY`

clauses. Using
such notation anywhere else will produce an error.