8.4. Tidying Text: Most and Least Common Words

When analyzing text, it can be beneficial to know the most common and least common word used in the text. This can help us to analyze the text better and understand the context of the text. However, there are some obstacles that we will have to overcome before.

Before we tackle finding the most common and least common words used in the UN, we need to understand a couple of things about text processing. First, we are going to want to clean up our text, then we need to learn about stop words. If you think about it for a minute, you can probably answer the question of the most used words already. They will be words like “a”, “an”, “the”, “and” etc. These words are pretty useless if we are trying to extract some meaning from long texts. Our initial list of cleaning tasks is as follows.

  1. Convert all text to lower case.

  2. Remove all punctuation.

  3. Break the string into a list of words.

  4. Remove stop words from the list.

speeches_1970 = undf[undf.year == 1970].copy()
speeches_1970['text'] = speeches_1970.text.apply(lambda x: x.lower())
speeches_1970['text'] = speeches_1970.text.apply(
    lambda x: x.translate(str.maketrans(
        string.punctuation, ' '*len(string.punctuation))))
speeches_1970['word_list'] = speeches_1970.text.apply(nltk.word_tokenize)
from collections import Counter
c = Counter(speeches_1970.word_list.sum())
c.most_common(10)
[('the', 25077),
 ('of', 16265),
 ('and', 9224),
 ('to', 9134),
 ('in', 6668),
 ('a', 4530),
 ('that', 3919),
 ('is', 3322),
 ('for', 2563),
 ('which', 2471)]
c.most_common()[-10:]
[('shabby', 1),
 ('predatory', 1),
 ('siphoned', 1),
 ('crop', 1),
 ('outflow', 1),
 ('ashes', 1),
 ('pr', 1),
 ('bystander', 1),
 ('antiimperialist', 1),
 ('earn', 1)]
LookupError:
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

  Searched in:
    - '/Users/bradleymiller/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/bin/../nltk_data'
    - '/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/bin/../share/nltk_data'
    - '/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/bin/../lib/nltk_data'
**********************************************************************
sw = set(stopwords.words('english'))
len(sw)
179
speeches_1970['word_list'] = speeches_1970.word_list.apply(
    lambda x: [y for y in x if y not in sw])

c = Counter(speeches_1970.word_list.sum())
c.most_common(25)
[('nations', 1997),
 ('united', 1996),
 ('international', 1251),
 ('world', 1101),
 ('peace', 1019),
 ('countries', 908),
 ('states', 897),
 ('organization', 763),
 ('would', 677),
 ('people', 649),
 ('development', 649),
 ('security', 594),
 ('general', 571),
 ('peoples', 567),
 ('assembly', 552),
 ('charter', 551),
 ('government', 544),
 ('one', 535),
 ('must', 474),
 ('also', 454),
 ('economic', 450),
 ('us', 401),
 ('years', 392),
 ('time', 371),
 ('great', 369)]
c.most_common()[-25:]
[('reliably', 1),
 ('polish', 1),
 ('sqon', 1),
 ('ultra', 1),
 ('nonapplicability', 1),
 ('statutory', 1),
 ('2391', 1),
 ('renovation', 1),
 ('russia', 1),
 ('gbout', 1),
 ('•', 1),
 ('prediction', 1),
 ('oceania', 1),
 ('fat', 1),
 ('1848th', 1),
 ('shabby', 1),
 ('predatory', 1),
 ('siphoned', 1),
 ('crop', 1),
 ('outflow', 1),
 ('ashes', 1),
 ('pr', 1),
 ('bystander', 1),
 ('antiimperialist', 1),
 ('earn', 1)]

8.4.1. Practice

  1. Redo the analysis of the most common and least common words for 2015.

  2. Normalize the data so that you are looking at percentages, not raw counts.

  3. Build a graph to compare 1970 and 2015.

  4. Look at the documentation for the wordcloud package. Make a word cloud for both 1970 and 2015.

Lesson Feedback

    During this lesson I was primarily in my...
  • 1. Comfort Zone
  • 2. Learning Zone
  • 3. Panic Zone
    Completing this lesson took...
  • 1. Very little time
  • 2. A reasonable amount of time
  • 3. More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • 1. Don't seem worth learning
  • 2. May be worth learning
  • 3. Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • 1. Definitely within reach
  • 2. Within reach if I try my hardest
  • 3. Out of reach no matter how hard I try
You have attempted of activities on this page