2.3. Part II Adding More Data

2.3.1. Happiness by Region

Now let’s create a table where we compute the average happiness score for all of the countries in the region. To get started we need to answer a couple of questions: What are the unique region names? How can we compute an average for the countries that are in the same region?

  1. We can get a table of the unique region names by using the UNIQUE function. In Cell A180 the UNIQUE function takes the range that contains all of the region names and will populate a few rows with just the unique names.

Q-1: How many unique regions are there?

  1. Now lets make that list of countries look a little nicer by sorting it. Select the countries and then from the menu select Data -> Sort range. Then click on the Sort button. Hmmm what happened here?
  2. The problem is that UNIQUE is a special kind of function that returns an array of values. We need to make the values in the columns permanent before we sort them. To do this we can select the names again, then copy them to the clipboard. Now immediately choose Edit -> Paste Special -> Paste values only and this will eliminate the formula and leave us with cells containing only the names of the regions.
  3. With the table of Regions we can use a combination of SUMIF and COUNTIF to compute the average happiness. Lets do this incrementally to start. Lets create a column right next to the region names that contains the number of countries in the region. The COUNTIF function takes a range of cells and a condition for those cells to match. In our case the range is B2:B141 that is all of the regions. The condition is the name of the region which we can get from a cell in our newly created table of region names.

Q-2: There are countries in Western Europe and countries in Sub-Saharan Africa.

Now lets create a column that sums the happiness score for each region using the SUMIF function. The SUMIF function is a bit more complicated than COUNTIF in that it takes a separate (parallel) range for us to SUM. Once again the first parameter will be the range containing the regions, the second parameter will be the name of a region to match, but the third parameter is the range of cells containing the happiness scores. When a row in the Region column matches the given Region the function includes the value from the Happiness score column in the sum.

Q-3: The sum of all happiness in South Asia is and the sum of all happiness in Western Europe is (out to three decimal places)

With a column for count and a column for the sum you can now calculate the average by dividing our two columns.

Q-4: The average happiness score for Latin America and Caribbean is

  1. We can combine the work we did above using a single function called AVERAGEIF. Lets use it and compare our answers – They should be identical. By now you should be feeling some respect for the spreadsheet jockeys of the world. This is definitely not a toy!
  2. OK, last but not least let us sort the happiness scores so we can see the regions from most to least happy.
  3. Select the table and then from the Data menu select Sort Range choose column B and check the box for Z->A then sort
  4. Nice - North America and ANZ is the happiest (Aussie Aussie Aussie) and Sub-Saharan Africa is the least.
  5. Add another column to our little table that tells us how many countries are in each region (COUNTIF)
  6. Now Using MAXIFS, MINIFS, MATCH and INDEX lets find the most and least happy country in each region. MAXIFS and MINIFS work alot like AVERAGEIF and COUNTIF but allow for more conditions. We still need only one. If you read the popup you will know what to do.

Q-5: What is the happiest country in East Asia?

2.3.2. Joining Data from other Sources

So far we have limited our analysis to the data provided for us in the original happiness spreadsheet. But what if we wanted to look at other factors for happiness such as cell phone ownership? Internet access? birth rates? or anything else we can think of? Seldom does one file contain all the data you need to answer the questions you may have. In this part of the project we will import a spreadsheet that has a lot more data about each country including its continent (see question 5). This is an important lesson as it sets the stage nicely for what we will learn about later when using SQL to “join” two tables of data.

  1. The first thing we need to do is to import the countries of the world spreadsheet. This has a huge amount of data about each country and you may wish to explore some of the other data provided later. For now we are interested in how we can use the information on this new spreadsheet to give us the continent of each country.
  2. You can start by either copy/pasting the whole sheet into a new tab or importing the csv file into a new tab.
  3. Next we will want to add a column to the happiness spreadsheet that contains the Population for each country. The way we do this is to use the VLOOKUP function. Pay attention to this as it is one of the most powerful functions you will learn about for doing high powered calculations on a spreadsheet. The main idea behind this is also widely used in the database world so it is worth learning in detail.

The idea goes like this. On our happiness spreadsheet we have a column that contains the name of each country. It has a bunch of happiness related data about each country in other columns. On our countries of the world sheet we have a column of country names and a bunch of other information about countries (including their population) in other columns. The two sheets do not have the countries in the same order, nor do they necessarily have the same list of countries – they do have most of the same but not all.

When we use VLOOKUP our goal is to join together these two sheets adding columns to the happiness sheet using values from the row in the countries of the world sheet from the row where the country names match. For example in our happiness sheet Ireland is on row 15 but in the countries of the world sheet it is on row 101. What we want to do is take (at least) column B row 101 from the countries sheet and add it to the happiness sheet on row 15 column M.

With VLOOKUP with do this by allowing the function to search for the value in one cell in another column, and then return the value from a different cell in the same row but in some other column. To find the continent of Israel we would use VLOOKUP(A15, Sheet1!$A$6:$F$229, 5, FALSE)

Let’s unpack that:

A15 is the cell containing Ireland

Sheet1!$A$6:$F$229 the range of cells we can search in as well as get values from.

2 tells Sheets that when we find a match for Israel we want the value from the same row but in column 2 of our range. (Sigh – sheets and Python use different counting systems)

Notice that column 5 of our range is the continent/Region column! Nice You may have noticed that VLOOKUP is a bit like using match and index together, but it is a little less flexible as the column you search in must always be on the far left side of the range.

To add a whole new column to fill in the region for each country we would type the following into O2 =vlookup(A2,Sheet1!$A$6:$F$229,5,FALSE) Now if you double click on the blue square in the lower right corner when you have M2 selected you will see that sheets will automatically copy/paste the formula down the column. It will do this until it finds a blank cell to the left and then it will stop. If your spreadsheet has some missing data this can lead to some unexpected results, so it’s always a good idea to make sure it has pasted all the way down.

Q-6: What does your spreadsheet show for the population of the United States? what does the countries of the world sheet show for the united states?

As you found out there are some rows that have a value of #N/A in them. This is because one spreadsheet has the name “United States” and the other spreadsheet has “United States of America”, We know these are the same but the computer does not make the match. You will need to clean up this data manually by making the names match where they don’t already. This is also why the countries of the world spreadsheet contains the column that has a three letter code for each country. These codes are internationally agreed upon and are always the same for each country. This avoids the kind of problems we have where there is more than on common spelling.

Any time you are introducing data from another source you are likely to run into inconsistencies and missing data. That is just a simple fact of life for a data scientist. You will need to either search further to fill in the missing pieces, or learn to live without some pieces of data.

    Q-7: Which of the following countries are NOT in the world countries spreadsheet?

  • Kosovo
  • correct
  • Palestine
  • correct
  • Palau
  • No, Palau is there
  • Ivory Coast
  • Technically this one is there but you need to translate it to Côte d’Ivoire

Now that you have country names unified, and the population data in place you can practice some calculations on this new piece of data.

  1. Calculate the average population for each region?

Q-8: The average population is for East Asia.

  1. Find the name of the country in each region with the largest population

Q-9: has the largest population in Latin America and Caribbean

  1. What is the country in each region with the smallest population?

Q-10: has the smallest population in the Middle East and North Africa region.

Q-11: Write down two questions of your own, that you can explore with the combined data set.

Now, using your new spreadsheet skills answer your own questions.

Q-12: Use this space to provide answers to the questions above, explaining briefly how you arrived at the answers.

2.3.3. Introducing Pivot Tables

We can make all of this a bit easier using a Pivot Table ! This is a really useful tool to have in your toolbox and many other tools you use will support the creation of pivot tables as well.

  1. The idea behind a pivot table is to take the unique values from some column and make them the titles of a bunch of columns, while summarizing the data for those columns from a number of rows. For example suppose you had a three of columns: user, movie, rating What would be more easy to look at would be to have a column for each movie and a row for each user with the rating in the cell corresponding to a user and a movie. This is exactly the use case for pivot tables. You can see an example of transforming the orginal data to the pivot table view below:
../_images/pivot_example.png

Start a new worksheet, and recreate the data and pivot table you see in the example above.

  1. Another good use case is to replace what we have just done. We want to use the unique values for each continent as the row, and then calculate a number of summary statistics for each. For practice you should redo the exercise of finding the average population for each region that you did above.

Q-13: Using a Pivot table find the median value of the Life Ladder column for each region. The median value for South Asia is to 3 decimal places

Pivot tables can be treated like any other part of your spreadsheet. Use a pivot table to find the least happy country in each region. Then using MATCH and index add an additional column that contains the name of that country.

Q-14: The least happy country in Southeast Asia is

Q-15: Without adding another column you can quickly change the function to summarize and find that the most happy country in Southeast Asia is

You will find that understanding Pivot tables and when to use them to be a very powerful tool to have in your toolbox. Many other systems including Pandas and Relational databases, like Postgresql and others also support making pivot tables. The interface in the Spreadsheet is the simplest so it is a good one to learn on.

Lesson Feedback

    During this lesson I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this lesson took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
Next Section - 2.4. Part III Comparing Happiness Data across years