8.3. Merging and Tidying Data

In this section, we will learn how to merge two data frames that contain information that we wish to put together. This will create a new data frame that contains the information of both data frames as long as they have a primary and foreign key, meaning they have to have a unique column in common. Then we will learn the basic principles of tidying up data and the relationships that a data set can have.

Now that we know how the file is encoded, we can read it easily.

c_codes = pd.read_csv('Data/country_codes.csv', encoding='iso-8859-1')
c_codes.head()
country code_2 code_3 country_code iso_3166_2 continent sub_region region_code sub_region_code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia 142.0 34.0
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe 150.0 154.0
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe 150.0 39.0
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa 2.0 15.0
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia 9.0 61.0

This DataFrame has a lot of information, and we can add all or just a bit of it to our United Nations DataFrame using the merge method of Pandas.

Before we merge, let’s clean up the column names on the undf data frame and rename country to code_3 to be consistent with the above.

undf.columns = ['session', 'year', 'code_3', 'text']
undf.head()
session year code_3 text
0 44 1989 MDV It is indeed a pleasure for me and the member...
1 44 1989 FIN \nMay I begin by congratulating you. Sir, on ...
2 44 1989 NER \nMr. President, it is a particular pleasure ...
3 44 1989 URY \nDuring the debate at the fortieth session o...
4 44 1989 ZWE I should like at the outset to express my del...

Now, we can merge our two data frames. We will keep all the columns from the original undf DataFrame and add country, continent, and subregion from the c_codes DataFrame. We will merge the two data frames on the code_3 column. That is, for every row in undf, we will look for a row in the c_codes DataFrame where the values for code_3 match. Pandas will then add the rest of the columns from the matching row in c_codes to the current row in undf.

In the c_codes data frame, code_3 is the “primary key”, as no two rows have the same value for code_3. In the undf data frame, code_3 is a “foreign key”, as we use it to look up additional information in a table where code_3 is a primary key. More on this when we study SQL queries.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']])
undfe.head()
session year code_3 text country continent sub_region
0 44 1989 MDV It is indeed a pleasure for me and the member... Maldives Asia Southern Asia
1 68 2013 MDV I wish to begin by \nextending my heartfelt co... Maldives Asia Southern Asia
2 63 2008 MDV I am delivering this \nstatement on behalf of ... Maldives Asia Southern Asia
3 46 1991 MDV Allow me at the outset on behalf of the deleg... Maldives Asia Southern Asia
4 41 1986 MDV It is indeed a pleasure for me and all the mem... Maldives Asia Southern Asia
undfe[undf.code_3 == 'EU ']
/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.
session year code_3 text country continent sub_region

Wait! What? What happened to the EU?! Why did it disappear after the merge? What else may have disappeared? The reason the EU disappeared is that it is not in the c_codes data frame, and as you may recall, the merge function does the equivalent of a set intersection. That is, the key must be in BOTH data frames for it to be in the result. We can do our merge using an outer join to preserve the data, then see which countries have no text and which texts have no country name.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']],
                   how='outer')
undfe.head()
session year code_3 text text_len country continent sub_region
0 44.0 1989.0 MDV It is indeed a pleasure for me and the member... 3011.0 Maldives Asia Southern Asia
1 68.0 2013.0 MDV I wish to begin by \nextending my heartfelt co... 2252.0 Maldives Asia Southern Asia
2 63.0 2008.0 MDV I am delivering this \nstatement on behalf of ... 1909.0 Maldives Asia Southern Asia
3 46.0 1991.0 MDV Allow me at the outset on behalf of the deleg... 2330.0 Maldives Asia Southern Asia
4 41.0 1986.0 MDV It is indeed a pleasure for me and all the mem... 2630.0 Maldives Asia Southern Asia

Now let’s see which country names are not filled in.

undfe[undfe.country.isna()].code_3.unique()
array(['YDYE', 'CSK', 'YUG', 'DDR', 'EU'], dtype=object)
undfe[undfe.text.isna()].code_3.unique()
array(['ALA', 'ASM', 'AIA', 'ATA', 'ABW', 'BMU', 'BES', 'BVT', 'IOT',
       'CYM', 'CXR', 'CCK', 'COK', 'CUW', 'FLK', 'FRO', 'GUF', 'PYF',
       'ATF', 'GIB', 'GRL', 'GLP', 'GUM', 'GGY', 'HMD', 'HKG', 'IMN',
       'JEY', 'MAC', 'MTQ', 'MYT', 'MSR', 'NCL', 'NIU', 'NFK', 'MNP',
       'PCN', 'PRI', 'REU', 'BLM', 'SHN', 'MAF', 'SPM', 'SRB', 'SXM',
       'SGS', 'SJM', 'TWN', 'TKL', 'TCA', 'UMI', 'VGB', 'VIR', 'WLF',
       'ESH'], dtype=object)
undfe[undfe.text.isna()].country.unique()
array(['Åland Islands', 'American Samoa', 'Anguilla', 'Antarctica',
       'Aruba', 'Bermuda', 'Bonaire, Sint Eustatius and Saba',
       'Bouvet Island', 'British Indian Ocean Territory',
       'Cayman Islands', 'Christmas Island', 'Cocos (Keeling) Islands',
       'Cook Islands', 'Curaçao', 'Falkland Islands (Malvinas)',
       'Faroe Islands', 'French Guiana', 'French Polynesia',
       'French Southern Territories', 'Gibraltar', 'Greenland',
       'Guadeloupe', 'Guam', 'Guernsey',
       'Heard Island and McDonald Islands', 'Hong Kong', 'Isle of Man',
       'Jersey', 'Macao', 'Martinique', 'Mayotte', 'Montserrat',
       'New Caledonia', 'Niue', 'Norfolk Island',
       'Northern Mariana Islands', 'Pitcairn', 'Puerto Rico', 'Réunion',
       'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha',
       'Saint Martin (French part)', 'Saint Pierre and Miquelon',
       'Serbia', 'Sint Maarten (Dutch part)',
       'South Georgia and the South Sandwich Islands',
       'Svalbard and Jan Mayen', 'Taiwan, Province of China', 'Tokelau',
       'Turks and Caicos Islands', 'United States Minor Outlying Islands',
       'Virgin Islands (British)', 'Virgin Islands (U.S.)',
       'Wallis and Futuna', 'Western Sahara'], dtype=object)

Fill in the country names for YDYE, CSK, YUG, DDR, and EU by hand.

undfe.loc[undfe.code_3 == 'EU', 'country'] = 'European Union'
by_country = undfe.groupby('country',as_index=False)['text'].count()
by_country.loc[by_country.text.idxmin()]
country    South Sudan
text                 5
Name: 161, dtype: object
c_codes[c_codes.code_2 == 'EU']
country code_2 code_3 country_code iso_3166_2 continent sub_region region_code sub_region_code

I suspect that EU indicates the European Union, which has a place in the UN but is not a country.

South Sudan has only spoken 5 times. Why is that? There is a very logical explanation, but it only makes you want to check out the 5 or 10 countries that have spoken the least.

But why did EU seem to disappear? When we do a merge, if the key is missing, then the row is not included in the final result.

len(undfe)
7406
len(undf.code_3.unique())
199
len(undfe.code_3.unique())
194
set(undf.code_3.unique()) - set(undfe.code_3.unique())
{'CSK', 'DDR', 'EU', 'YDYE', 'YUG'}

Can you figure out what each of the above stands for? Why are they not in the list presented earlier?

At this point, you may want to edit the CSV file and add the data for these countries to the file. Then, you can rerun the whole notebook and we will not lose as much data.

8.3.1. Tidy Data

A lot of the work in data science revolves around getting data into the proper format for analysis. A lot of data comes in messy formats for many different reasons. But if we apply some basic principles from the world of database design, data modeling, and some common sense (as outlined in the Hadley Wickham paper), we can whip our data into shape. Wickham says that tidy data has the following attributes.

  • Each variable belongs in a column and contains values.

  • Each observation forms a row.

  • Each type of observational unit forms a table.

How does our United Nations data stack up? Pretty well. We have four columns: session, year, country, and text. If we think of the text of the speech as the thing we can observe, then each row does form an observation, and session, year, and country are attributes that identify this particular observation.

Some of the common kinds of messiness that Wickham identifies include the following.

  • Column headers are values not variable names. Imagine this table if we had one row for each year and a column for each country’s text! Now that would not be tidy!

  • Multiple variables are stored in one column. We’ve seen this untidiness in the movie data a couple of chapters ago. We’ll revisit that very soon to deal with it correctly.

  • Variables are stored in both rows and columns.

  • Multiple types of observational units are stored in the same table.

  • A single observational unit is stored in multiple tables.

Many of the problems with untidy data stem from not knowing how to handle relationships between multiple entities. Most of the time, things that we want to observe interact with other things we can observe, and when we try to combine them into a single data frame, that causes trouble. There are three kinds of relationships that we should consider.

  • one-to-one relationships

  • one-to-many relationships

  • many-to-many relationships

An example of a one-to-one relationship would be a person and their passport. A person can have one passport, and a given passport belongs to only one person. There is data that we can collect about a person and that could be stored in a DataFrame. There is also data that we can collect from a passport, such as the countries that person has visited, the place the passport was issued, and this could also be stored in a DataFrame.

An example of a one-to-many relationship is a customer and the things they have ordered from Amazon. A particular customer may have ordered many things, but a given order can only belong to a single customer.

An example of a many-to-many relationship is a student and a class. A student can be enrolled in more than one class, and a class can have many students who are enrolled in it.

Whenever you see a DataFrame that has a column that contains a list or a dictionary, that is a sure sign of untidiness. It is also something that can be fixed an in the end will make your analysis easier.

8.3.2. Tidying the Movie Genres

Let’s look at the genres column of the movies dataset. You may recall that it looks odd. Here is the result of df.iloc[0].genres.

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

It looks like a list of dictionary literals, except that it is in double-quotes like a string. Let’s first figure out how we can get it to be an actual list of dictionaries. Then, we’ll figure out what to do with it. Python has a nifty function called eval that allows you to evaluate a Python expression that is a string.

eval(df.iloc[0].genres)
[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

Even better, we can assign the result of eval to a variable, and then we can use the list and dictionary index syntax to access parts of the result, just like we learned about when we discussed JSON in an earlier chapter.

glist = eval(df.iloc[0].genres)
glist[1]['name']
'Comedy'

One way we could solve this is to duplicate all of the rows for as many genres as the movie has, storing one genre on each line, but that would mean we would have to needlessly duplicate all of the other information on our first movie three times.

A better strategy for solving this problem is to create a new DataFrame with just two columns: one containing the movie’s unique id number, and a second containing the genre. This allows you to use the merge method on the two data frames, but only temporarily when you need to know the genre of a particular movie.

Illustration of the merging of two data frames. One data frame contains movie info, and the other contains genre. The data frames are merged into a single data frame with two columns: genre and movie info.

To construct this table, we need to iterate over all the rows of the DataFrame and gather the genres for this movie. For each genre of the movie, we will add an item to a list that contains the imdb_id of the movie and add an item to a list that contains the name of the genre. These two lists are in sync with each other so that the i th element of each list will represent the same movie.

Here is some code you can use to construct the two lists.

movie_list = []
genre_list = []

def map_genres(row):
    try:
        glist = eval(row.genres)
    except:
        glist = []
        print(f"bad data for {row.title}")
    for g in glist:
        movie_list.append(row.imdb_id)
        genre_list.append(g['name'])

_ = df.apply(map_genres, axis=1)

Using these two lists, construct a new DataFrame with a column for imdb_id and genre.

Now let’s calculate the average revenue for the Comedy genre. We’ll do this is a couple of steps.

  1. We will reduce the genre DataFrame so it only has the Comedies left.

  2. Then we will merge the movie data frame with the genres DataFrame using the imdb_id column.

  3. We will be left with a DataFrame that only contains the rows for the movies that are comedies. You can think of a merge like this as being the intersection of the set of comedies and the set of all movies.

Problems to work on

  1. What is the total revenue for each genre?

  2. What is the average vote_average for each genre?

  3. What genre has the most votes?

  4. Use a similar process to create a data frame of collections and their movies. Which collection has the most movies?

  5. Again a similar process can be used for spoken_languages. How many movies are there for each language? Is English the most popular movie language?

Lesson Feedback

    During this lesson I was primarily in my...
  • 1. Comfort Zone
  • 2. Learning Zone
  • 3. Panic Zone
    Completing this lesson took...
  • 1. Very little time
  • 2. A reasonable amount of time
  • 3. More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • 1. Don't seem worth learning
  • 2. May be worth learning
  • 3. Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • 1. Definitely within reach
  • 2. Within reach if I try my hardest
  • 3. Out of reach no matter how hard I try
You have attempted of activities on this page