8.4. Working with Text

The Series and index objects in Pandas each have a set of string processing methods that make all of the standard Python string methods and more available to work on all of the string elements in a Series. We call these “vectorized string methods” because Pandas is designed to allow these operations to happen in parallel on all the rows of the data frame simultaneously, if you have the computing power. These are accessed through an intermediate object called str For example suppose we wanted to convert all of our three letter country codes to lowercase.

undf.country.str.lower()

Will do the job and is over 700 times faster than using a for loop.

Here is a complete list of the string functions that the str object knows. Most of them should be very familiar to you.

Q-1: How many rows from the united nations data set have a country code that starts with ‘M’

Q-2: How many country codes from the united nations data set have a country code that starts with M

Regular expression methods for strings

Q-3: What is the most common word that follows ‘global’ in all of the speeches and how many times does that word occur?

We can use our new skills to do a minor bit of cleanup on the text. Many of the speeches start with an invisible non-breaking space character followed by a newline (you will see it as n in the text. We can eliminate this with:

undf['text'] = undf.text.str.replace('\ufeff','') # remove strange character
undf['text'] = undf.text.str.strip() # eliminate whitespace from beginning and end

8.5. Research Questions

  1. What is the average word count per speech?

  2. How does that average compare across all of the countries?

  3. What is the average sentence length per speech?

  4. Find or create a list of topics that the UN might discuss and debate make a graph to show how often these topics were mentioned. For example, ‘peace’, ‘nuclear war’, ‘terrorism’, ‘moon landing’, You think of your own!

  5. The five permanent members of the UN security council are sec_council = [‘USA’, ‘RUS’, ‘GBR’, ‘FRA’, ‘CHN’] Make a graph of the frequency of topics and how often they are discussed by those countries. You could do this same exercise with any group of countries. Maybe the central European, or North African, etc.

  6. Make a graph to show the frequency with which various topics are

    discussed over the years. for example, peace is consistently a popular word as is freedom and human rights. what about HIV or terrorism or global warming. Compare two phrases like ‘global warming’ and ‘climate change’

  7. When did the internet become a popular topic?

8.6. Text Complexity

For years people have been trying to find measures of text complexity. Sometimes to determine what ‘reading level’ an article is at, or how much formal education is required to understand an piece of writing. These measures are often functions of things such as the number of sentences in a paragraph, sentence length, word length, number of polysyllabic words used, etc.

There are several Python packages that automatically compute the complexity for you so you don’t have to write that part yourself. One easy to use package is called textatistic. It calculates several different common measures of text complexity.

  1. Using the Gunning Fog, or smog index compute the reading complexity for each speech

  2. Is there any correlation between the Fog index for a country and the GDP or literacy rate?

  3. Make a graph showing the distribution of each of the above measures

Lesson Feedback

You have attempted of activities on this page
Next Section - 8.7. Graphing Relationships between Countries