.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
.. _grouping_data:
Grouping Data
=============
Filtering data allows you to subset data to a set of rows based on some
conditions. This allows for analysis on that subset of data. Finding summary
statistics for a specific group of observations is called **grouping**. For
example, if you have income data for the entire USA but want to find the mean
income just for California, you would group the data by state.
Grouping can be done in Sheets using one of three “conditional summary”
functions.
.. admonition:: Grouping Functions in Sheets
**The COUNTIF function counts the number of cells that satisfy a condition.**
The syntax looks like ``=COUNTIF(cell range to be filtered, filter
condition)``.
Earlier in the introduction you saw an example of the ``COUNTIF`` function
used on a table of standard information about people. That example utilizes
the formula ``=COUNTIF(B1:B26, “=Los Angeles”)``. This counts the number of
cells in the range B1:B26 that match the word “Los Angeles”. The **cell range
to be filtered** is ``B1:B26``, and the **filter condition** is
``“=Los Angeles”``.
.. image:: figures/table_countif_example.png
:align: center
:alt: A screenshot from Sheets of a table of information using the COUNTIF grouping function to count how many people are from Los Angeles.
**The SUMIF function sums the values of cells that satisfy a condition.** The
syntax looks like ``=SUMIF(cell range to be filtered, filter condition, cell
range to be summed)``.
Given the same table of standard information about people, you can find the
total income of those from New York City. Consider the formula
``=SUMIF(B2:B26, "=New York City", E2:E26)``. This sums the values in the
cells ``E2:E26``, but only using rows for which the cell in that row in
column B matches the word "New York City". The **cell range** to be filtered
is ``B2:B26``, the **filter condition** is ``“=New York City”``, and the
**cell range to be summed** is ``E2:E26``.
Pictured is a screenshot of part of the dataset in Sheets with the grouping
function ``SUMIF``.
.. image:: figures/table_sumif_example.png
:align: center
:alt: A screenshot from Sheets of a table of information using the SUMIF grouping function to sum the total income of people from New York City.
**The AVERAGEIF function finds the mean of cells that satisfy a condition.**
The syntax looks like ``=AVERAGEIF(cell range to be filtered, filter
condition, cell range to be averaged)``.
Given a table of `the top 50 songs on Spotify in 2019`_, you can calculate
the mean popularity of songs by a certain artist. For example, consider the
formula ``=AVERAGEIF(C4:C53, "=Ed Sheeran", E4:E53)``. This finds the mean
of cells in the range ``E4:E53``, but only using rows for which the cell in
that row in column C matches the word "Ed Sheeran." The **cell range** to be
filtered is ``C4:C53``, the **filter condition** is ``“=Ed Sheeran”``, and
the **cell range to be averaged** is ``E4:E53``.
.. image:: figures/spotify_averageif_example.png
:align: center
:alt: A screenshot from Sheets of a table of Spotify's top 50 songs in 2019 using the AVERAGEIF function to average the total popularity of Ed Sheeran's songs.
Note that for ``SUMIF`` and ``AVERAGEIF``, if you want to sum or average the
rows in column B or C themselves, you can enter ``B2:B26`` or ``C2:C53`` as the
third argument or leave the third argument empty.
Example: Painters
-----------------
Grouping functions are most useful when finding summary statistics for a
specific group. Consider the painters dataset, one last time. Suppose you want
to count the number of French painters in this dataset, and find the total
number of paintings made by those painters. This can be done using filtering or
using grouping.
Create a filter to limit the dataset to only French painters. (For this example,
assume “French” means the only nationality is French.) Copy the filtered table
to a new sheet.
.. fillintheblank:: number_of_french_painters
Use ``COUNT`` on the filtered table to find the number of French painters.
|blank|
- :13: Correct
:15: Incorrect: In this example, “French” means exclusively French.
:x: Incorrect
.. fillintheblank:: number_of_paintings_by_french_painters
Use ``SUM`` on the filtered table to find the number of paintings by French
painters. |blank|
- :2120: Correct
:2618: Incorrect: In this example, “French” means exclusively French.
:x: Incorrect
**Why was it necessary to copy-paste the filtered data to new rows?** Cell
ranges do not behave well in filtered tables. This is best illustrated via an
example. Consider the filter applied above, to limit the painters dataset to
French painters.
.. image:: figures/filtered_index.png
:align: center
:alt: A screenshot from Sheets of a painters dataset filtered by nationality and counting the number of French painters.
Notice that when trying to count the number of rows, selecting the cells selects
all cells between the first and the last, not just the filtered cells. Note that
the selected range is ``H5:H48`. While there are only 13 filtered rows, the
range selects all rows between 5 and 48. This returns a count of 44 French
artists, which is incorrect. The same is true for cell ranges when using ``SUM``
and ``AVERAGE``.
Instead of copy-pasting the filtered table to a new sheet and then using summary
functions, you can just use grouping functions. For example, to count the number
of French painters, you can use ``COUNTIF``. The range to be counted is the
“nationality” column, and the column condition checks whether the value is
“French”.
.. image:: figures/french_painters_using_countif.png
:align: center
:alt: A screenshot from Sheets of a painters dataset grouped to count the number of painters whose nationality is French.
.. fillintheblank:: number_of_paintings_by_french_painters_using_countif
Use ``SUMIF`` to find the number of paintings by French painters. |blank|
- :2120: Correct
:x: Incorrect
Note that if you wanted to use the more general definition of “French” (any
painter who has French as one of their nationalities), you would need to use a
different filter condition. In general, the filter condition for checking if
"word" appears anywhere in the text looks like ``"*word*"``. `This forum
discussion goes into more detail.`_ This can apply to any grouping function.
Use grouping functions (``COUNTIF``, ``SUMIF``, ``AVERAGEIF``) when answering
the following questions.
.. fillintheblank:: number_of_italian_painters
How many Italian (only nationality is Italian) painters are in the list?
|blank|
- :8: Correct
:x: Incorrect
.. fillintheblank:: mean_number_of_paintings_by_italian_painters
What is the mean number of paintings by Italian painters? (Round your answer
to the nearest whole number.) |blank|
- :136: Correct
:x: Incorrect
.. mchoice:: impressionism_vs_romanticism
Which genre produced more paintings: impressionism or romanticism? (You may
simplify this by only looking at painters whose only genre is impressionism
or romanticism.)
- Impressionism
+ Correct
- Romanticism
- Incorrect
.. fillintheblank:: painters_with_multiple_genres
How many painters were associated with multiple genres? (Hint: Look for
painters whose "genre" field contains a comma.) |blank|
- :12: Correct
:x: Incorrect
Example: Titanic
----------------
The `Titanic`_ was a passenger ship that sank on its journey from Southampton
(England) to New York (USA) in 1912, `killing over 1,500 people`_. This example
uses passenger data from the tragedy. Each row records a passenger on the ship.
The purpose of this example is to find out whether some groups, for example,
women and children who had priority access to life rafts in case of emergency,
had a higher survival rate than others. For example, did women and children have
a higher survival rate than men? This can be done very conveniently using
grouping functions.
In its raw state, the survival of each passenger is encoded as “Dead” or “Alive”
in column B. These words are hard to deal with numerically, so you should
probably first transform these values to numbers. For example, the following
formula maps “Dead” to 0 and “Alive” to 1. (The example is for cell B2, but it
can be copy-pasted for the other rows.) You can insert a column on the left of
column C and use this column for the formula.
.. code-block:: none
=IF(B2=“Alive”, 1, 0)
.. image:: figures/titanic_adding_survived_column.png
:align: center
:alt: A screenshot from Sheets of a titanic passenger dataset using the IF function to label dead as a 0 and alive as a 1 in a new column called Survived number.
.. fillintheblank:: titanic_survival_rate
What is the survival rate on the Titanic? (Give your answers as a
percentage, to two decimal places.) |blank| %
- :32.25: Correct
:0.32: Remember to give your answer as a percentage.
:x: Incorrect
This survival rate you just calculated is the overall survival rate for all
passengers. What if you want to know the survival rate just for men, or just for
women, or just for children?
To calculate the survival rate just for men, you need to find the mean of column
C, but only if column E is equal to “Man”. This is a perfect use case for
``AVERAGEIF``.
.. code-block:: none
=AVERAGEIF(E$2:E$2209, "=Man", C$2:C$2209)
.. image:: figures/titanic_men_survival_rate.png
:align: center
:alt: A screenshot from Sheets of a titanic passenger dataset using the average function to find the men's survival rate.
.. fillintheblank:: titanic_women_survival_rate
What is the survival rate for women? (Give your answers as a percentage, to
two decimal places.) |blank| %
- :75.69: Correct
:0.76: Remember to give your answer as a percentage.
:x: Incorrect
.. fillintheblank:: titanic_children_survival_rate
What is the survival rate for children? (Give your answers as a percentage,
to two decimal places.) |blank| %
- :51.61: Correct
:0.52: Remember to give your answer as a percentage.
:x: Incorrect
.. shortanswer:: titanic_survival_rate_by_class
Compare the survival rate across the classes: Crew, Class 3, Class 2, Class
1. Is this in line with what you expected?
.. _This forum discussion goes into more detail.: https://stackoverflow.com/questions/17152704/google-spreadsheet-count-if-contains-a-string
.. _Titanic: https://en.wikipedia.org/wiki/RMS_Titanic
.. _killing over 1,500 people: https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic
.. _the top 50 songs on Spotify in 2019: https://www.kaggle.com/leonardopena/top50spotify2019/data