9.6. Aggregation or Group By¶

One very powerful features of SQL is that it allows us to create summary information by grouping rows together. For example, we could ask ourselves how many bike trips were taken for each subscriber type, and which subscriber type has the most bike trips?

%%sql
SELECT
member_type, COUNT(*)
FROM
trip_data
GROUP BY
member_type
ORDER BY
COUNT(*) DESC
LIMIT
10

 * sqlite:///bikeshare.db
Done.

member_type COUNT(*)
Member 979814
Casual 246949
Unknown 4

GROUP BY subscriber_type takes all the rows with a given subscriber_type and produces a single row in the result. This means that we need to tell SQL how we want to combine the other columns’ values into a single row. The above example uses COUNT(*) which reports of the number of rows that were combined.

Aggregating the values for subscriber_type is not hard, since they’re all the same, SQL just gives us a single copy of the publisher name. Other columns, we need to either ignore (causing them to be omitted from the output) or specify a way to aggregate them.

We must specify an aggregate function for any column that we SELECT in our query (except the column that we’re grouping by) in order for the command to succeed. If we don’t specify a way to aggregate the value, SQL will complain. For example, the following query should fail:

%%sql

SELECT
member_type, duration
FROM
trip_data
GROUP BY
subscriber_type
ORDER BY
COUNT(*) DESC

 * sqlite:///bikeshare.db
(sqlite3.OperationalError) no such column: subscriber_type [SQL: 'SELECTn  member_type, durationnFROMn  trip_datanGROUP BYn  subscriber_typenORDER BYn  COUNT(*) DESC'] (Background on this error at: http://sqlalche.me/e/e3q8)


Let’s go back briefly to the first query in the Aggregation section. The top result was the count of bike trips for member_type Member:

member_type COUNT(*)
Member 979814

If you’d like to get a more granular break down of the count, you may specify multiple columns to aggregate within the GROUP BY clause, for example: further breakdown the aggregate count by the start station IDs:

%%sql

SELECT
member_type, start_station, count(*)
FROM
trip_data
WHERE
member_type = 'Casual'
GROUP BY
member_type, start_station
ORDER BY
COUNT(*) DESC
LIMIT
20

 * sqlite:///bikeshare.db
Done.

member_type start_station count(*)
Casual 31200 10922
Casual 31217 10912
Casual 31235 9829
Casual 31219 8736
Casual 31225 7180
Casual 31228 6111
Casual 31222 5943
Casual 31215 5224
Casual 31201 4991
Casual 31218 4960
Casual 31237 4906
Casual 31232 4905
Casual 31623 4853
Casual 31205 4751
Casual 31613 4162
Casual 31212 4029
Casual 31238 3920
Casual 31104 3908
Casual 31203 3772
Casual 31204 3675

Great! Now that you’re familiar with how to aggregate data using SQL query by using COUNT() as your aggregation function, let’ take a look at other aggregation functions.

There are many such functions. Some common ones include:

• SUM: To add the values together
• AVG: To compute the mean of the values
• MIN or MAX: To compute the minimum and maximum respectively

So we could for example compute the total number of minutes of all bike trips for all subscriber types

%%sql

SELECT
member_type, SUM(duration)
FROM
trip_data
GROUP BY
member_type
LIMIT
10

 * sqlite:///bikeshare.db
Done.

member_type SUM(duration)
Casual 687530197
Member 759503541
Unknown 3434

9.6.1. Exercise 3¶

Compute the average duration of bike trips for each starting station id and list the results in order of highest average to lowest average.

start_station AVG(duration)
31806 40669.5
31052 4325.0
31705 3787.787878787879
31262 3563.8636363636365
31704 3550.0
31703 3134.6492146596856
31266 2906.0
31217 2431.043944420405
31016 2414.4292185730465
31235 2368.5348916450866
• What is the userid of the Member with the most rides?
• what is the userid of the Casual rider with the most rides?
• What is the station that has the most rides that start and end at the same station?

Lesson Feedback

During this lesson I was primarily in my...
• Comfort Zone
• Learning Zone
• Panic Zone
Completing this lesson took...
• Very little time
• A reasonable amount of time
• More time than is reasonable
Based on my own interests and needs, the things taught in this lesson...
• Don't seem worth learning
• May be worth learning
• Are definitely worth learning
For me to master the things taught in this lesson feels...
• Definitely within reach
• Within reach if I try my hardest
• Out of reach no matter how hard I try
Next Section - 9.7. Joining