8.2. Merging and Tidying Data

Now that we know how the file is encoded we can read it easily

c_codes = pd.read_csv('Data/country_codes.csv', encoding='iso-8859-1')
country code_2 code_3 country_code iso_3166_2 continent sub_region region_code sub_region_code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia 142.0 34.0
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe 150.0 154.0
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe 150.0 39.0
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa 2.0 15.0
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia 9.0 61.0

This data frame has a lot of information and we can add all or just a bit of it to our united nations DataFrame using Pandas’ merge method.

Before we merge lets clean up the column names on the undf data frame and rename country to code_3 to be consistent with the above.

undf.columns = ['session', 'year', 'code_3', 'text']
session year code_3 text
0 44 1989 MDV It is indeed a pleasure for me and the member...
1 44 1989 FIN \nMay I begin by congratulating you. Sir, on ...
2 44 1989 NER \nMr. President, it is a particular pleasure ...
3 44 1989 URY \nDuring the debate at the fortieth session o...
4 44 1989 ZWE I should like at the outset to express my del...

Now we can merge our two data frames! We will keep all the columns from the original undf data frame and add country, continent, and subregion from the c_codes data frame. We will merge the two data frames on the code_3 column. That is for every row in undf we will look for a row in the c_codes data frame where the values for code_3 match. Pandas will then add the rest of the columns from the matching row in c_codes to the current row in undf.

Dramatic foreshadowing In the c_codes data frame code_3 is the ‘primary key’ as no two rows will have the same value for code_3. In the undf data frame code_3 is a ‘Foreign key’ as we use it to lookup additional information in a table where code_3 is a primary key. More on this when we study SQL queries.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']])
session year code_3 text country continent sub_region
0 44 1989 MDV It is indeed a pleasure for me and the member... Maldives Asia Southern Asia
1 68 2013 MDV I wish to begin by \nextending my heartfelt co... Maldives Asia Southern Asia
2 63 2008 MDV I am delivering this \nstatement on behalf of ... Maldives Asia Southern Asia
3 46 1991 MDV Allow me at the outset on behalf of the deleg... Maldives Asia Southern Asia
4 41 1986 MDV It is indeed a pleasure for me and all the mem... Maldives Asia Southern Asia
undfe[undf.code_3 == 'EU ']
/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.
session year code_3 text country continent sub_region

Wait! What? what happened to EU?!! Why did it dissappear after the merge? What else may have dissappeared? The reason the EU dissappeared is that it is not in the c_codes data frame, and as you may recall the merge function does the equivalent of a set intersection. That is the key must be in BOTH data frames in order for it to be in the result. We can do our merge using an outer join to preserve the data and then see which countries have no text and which texts have no country name.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']], how='outer')
session year code_3 text text_len country continent sub_region
0 44.0 1989.0 MDV It is indeed a pleasure for me and the member... 3011.0 Maldives Asia Southern Asia
1 68.0 2013.0 MDV I wish to begin by \nextending my heartfelt co... 2252.0 Maldives Asia Southern Asia
2 63.0 2008.0 MDV I am delivering this \nstatement on behalf of ... 1909.0 Maldives Asia Southern Asia
3 46.0 1991.0 MDV Allow me at the outset on behalf of the deleg... 2330.0 Maldives Asia Southern Asia
4 41.0 1986.0 MDV It is indeed a pleasure for me and all the mem... 2630.0 Maldives Asia Southern Asia

Now lets see which country names are not filled in.

array(['YDYE', 'CSK', 'YUG', 'DDR', 'EU'], dtype=object)
array(['ALA', 'ASM', 'AIA', 'ATA', 'ABW', 'BMU', 'BES', 'BVT', 'IOT',
       'CYM', 'CXR', 'CCK', 'COK', 'CUW', 'FLK', 'FRO', 'GUF', 'PYF',
       'ATF', 'GIB', 'GRL', 'GLP', 'GUM', 'GGY', 'HMD', 'HKG', 'IMN',
       'JEY', 'MAC', 'MTQ', 'MYT', 'MSR', 'NCL', 'NIU', 'NFK', 'MNP',
       'PCN', 'PRI', 'REU', 'BLM', 'SHN', 'MAF', 'SPM', 'SRB', 'SXM',
       'SGS', 'SJM', 'TWN', 'TKL', 'TCA', 'UMI', 'VGB', 'VIR', 'WLF',
       'ESH'], dtype=object)
array(['Åland Islands', 'American Samoa', 'Anguilla', 'Antarctica',
       'Aruba', 'Bermuda', 'Bonaire, Sint Eustatius and Saba',
       'Bouvet Island', 'British Indian Ocean Territory',
       'Cayman Islands', 'Christmas Island', 'Cocos (Keeling) Islands',
       'Cook Islands', 'Curaçao', 'Falkland Islands (Malvinas)',
       'Faroe Islands', 'French Guiana', 'French Polynesia',
       'French Southern Territories', 'Gibraltar', 'Greenland',
       'Guadeloupe', 'Guam', 'Guernsey',
       'Heard Island and McDonald Islands', 'Hong Kong', 'Isle of Man',
       'Jersey', 'Macao', 'Martinique', 'Mayotte', 'Montserrat',
       'New Caledonia', 'Niue', 'Norfolk Island',
       'Northern Mariana Islands', 'Pitcairn', 'Puerto Rico', 'Réunion',
       'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha',
       'Saint Martin (French part)', 'Saint Pierre and Miquelon',
       'Serbia', 'Sint Maarten (Dutch part)',
       'South Georgia and the South Sandwich Islands',
       'Svalbard and Jan Mayen', 'Taiwan, Province of China', 'Tokelau',
       'Turks and Caicos Islands', 'United States Minor Outlying Islands',
       'Virgin Islands (British)', 'Virgin Islands (U.S.)',
       'Wallis and Futuna', 'Western Sahara'], dtype=object)

Do some research and fill in the country names for YDYE, CSK, YUG, DDR, and EU by hand.

undfe.loc[undfe.code_3 == 'EU', 'country'] = 'European Union'
by_country = undfe.groupby('country',as_index=False)['text'].count()
country    South Sudan
text                 5
Name: 161, dtype: object
c_codes[c_codes.code_2 == 'EU']
country code_2 code_3 country_code iso_3166_2 continent sub_region region_code sub_region_code

I suspect that EU is the European Union which has a place in the UN but is not a country. So OK, South Sudan has only spoken 5 times. Why is that? There is a very logical explanation, but it only makes you want to check out the 5 or 10 countries that have spoken the least. What are they?

Ok, but why did EU seem to dissappear? When we do a merge if the key is missing then the row is not included in the final result.

set(undf.code_3.unique()) - set(undfe.code_3.unique())
{'CSK', 'DDR', 'EU', 'YDYE', 'YUG'}

Can you figure out what each of the above stand for? Why are they not in the list I gave you?

At this point you may want to edit the csv file and add the data for these countries to the file. Then you can rerun the whole notebook and we will not lose as much data.

8.2.1. Tidy Data

A lot of the work in data science revolves around getting data into the proper format for analysis. A lot of data comes in messy formats for many different reasons. But if we apply some basic principles from the world of database design, data modeling, and some good old common sense as outlined in the Hadley Wickham paper we can whip our data into shape. Wickham says that tidy data has the following attributes:

  • Each variable belongs in a column and contains values

  • Each observation forms a row

  • Each type of observational unit forms a table

How does our United Nations data stack up? Pretty well. We have four columns: session, year, country, and text. If we think of the text of the speech as the thing we can observe then each row does, in fact, form an observation and session, year, and country are attributes that identify this particular observation.

Some of the common kinds of messiness that Wickham identifies include:

  • Column headers are values not variable names - imagine this table if we had one row for each year and a column for each country’s text! Now that would not be tidy!

  • Multiple variables are stored in one column. We’ve seen this untidiness in the movie data a couple of chapters ago. We’ll revisit that very soon to deal with it correctly.

  • Variables are stored in both rows and columns

  • Multiple types of observational units are stored in the same table

  • A single observational unit is stored in multiple tables.

Many of the problems with untidy data stem from not knowing how to handle relationships between multiple entities. Most of the time things that we want to observe interact with other things we can observe and when we try to combine them into a single data frame that causes trouble! There are three kinds of relationships that we should consider:

  • one to one relationships

  • one to many relationships

  • many to many relationships

An example of a one to one relationship would be a person and their passport. A person can have one passport, and a passport belongs to one person. There is data that we can collect about a person and that would make a fine DataFrame. There is also data that we can collect from a passport, such as the countries that you have visited, the place the passport was issued, This also would make a fine DataFrame.

An example of a one to many relationship is a customer and the the things they have ordered from Amazon. A particular customer may have ordered many things, but an order can only belong to a single customer.

An example of a many to many relationship is a student and a class. A student can be enrolled in more than one class, and a class has many students that have enrolled in it.

Whenever you see a DataFrame that has a column that contains a list, or a dictionary that is a sure sign of untidiness! It is also something that can be fixed an in the end will make your analysis easier.

8.2.2. Tidying the Movie Genres

Lets look at the genres column of the movies dataset. You may recall that it looks odd. In fact here is the result of df.iloc[0].genres

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

It looks like a list of dictionary literals. Except it is in double quotes like a string. Lets first figure out how we can get it to be an actual list of dictionaries. Then we’ll figure out what to do with it. Python has a nifty function called eval that allows you to evaluate a Python expression that is a string. For example:


Will return this:

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

Even better, we can assign the result of eval to a variable and then we can use the list and dictionary index syntax to access parts of the result! Just like we learned about when we discussed JSON in an earlier chapter!

glist = eval(df.iloc[0].genres)

One way we could solve this is to duplicate all of the rows for as many genres as the movie has storing one genere on each line, but that would mean we had to needlessly duplicate all of the other information on our first movie three times!

The better strategy for doing solving this problem is to create a new DataFrame with just two columns. One containing the movie’s unique id number and a second containing the genre. This allows you to use the merge method on the two DataFrames, but only temporarily when you need to know the genre of a particular movie.


To construct this table we need to iterate over all the rows of the DataFrame and gather the genres for this movie. For each genre of the movie we will add a an item to a list that contains the imdb_id of the movie and and item to a list that contains the name of the genre. These two lists are in sync with each other so that the ith element of each list will represent the same movie.

Here is some code you can use to construct two lists .. code:: python3

movie_list = [] genre_list = []

def map_genres(row):

glist = eval(row.genres)


glist = [] print(f”bad data for {row.title}”)

for g in glist:

movie_list.append(row.imdb_id) genre_list.append(g[‘name’])

_ = df.apply(map_genres, axis=1)

Using these two lists construct a new DataFrame with a column for imdb_id and genere

Q-1: How many movies are in the Family genre?

Q-2: Which genre has the most movies?

Now lets calculate the average revenue for the Comedy genre. We’ll do this is a couple of steps.

  1. We will reduce the genre DataFrame so it only has the Comedies left.

  2. Then we will merge the movie data frame with the genres DataFrame using the imdb_id column.

  3. We will be left with a DataFrame that only contains the rows for the movies that are comedies. You can think of a merge like this as being the intersection of the set of comedies and the set of all movies.

Q-3: What is the average revenue of a comedy movie?

Q-4: What is the title and number of genres of the movie that is in the most genres?

Problems to work on

  1. What is the total revenue for each genre?

  2. What is the average vote_average for each genre?

  3. What genre has the most votes?

  4. Use a similar process to create a data frame of collections and their movies. Which collection has the most movies?

  5. Again a similar process can be used for spoken_languages. How many movies are there for each language? Is English the most popular movie language?

Lesson Feedback

    During this lesson I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this lesson took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
You have attempted of activities on this page
Next Section - 8.3. Most and Least Common UN words