8.6. Graphing Relationships Between Data

One thing that we might expect to come out of a bunch of people at the United Nations making speeches is that one country might mention another country. In fact, this probably happens quite frequently. In this section, we are interested in how we might visualize the relationship of one country talking about another country. We will do this in two ways: first as a “heatmap”, second as a graph.

8.6.1. Preparing the Data

Both of these visualizations rely on us processing the data in the speeches, to create pairs of countries based on one country mentioning the other in a speech. In fact, what we are going for will be a table that looks like this.

speaking_c

referenced_c

ref_count

USA

CAN

1234

USA

KOR

987

The columns are:

  • speaking_c: The country doing the speaking

  • referenced_c: The country that is referenced

  • ref_count: The number of times a reference is made over all of the speeches

When we are all done, we will put them together into a matrix that will have the country codes as the rows. The columns and the cells will contain the count of the number of references.

This project might sound a bit ambitious to tackle all at once, so it’s probably a good idea to limit the number of countries and the number of years to something more manageable, and more importantly, something we can check by hand. So, let’s limit our countries to the United States, Canada, Cuba, and Mexico for the years 2014 and 2015. The corresponding three-letter country codes are USA, CAN, CUB, and MEX.

Next, we’ll want to combine all of the speeches made by each country into one long string. You can arrange it so that the index of the resulting series is the three-letter country code.

Of course, countries do not refer to each other in speeches by their country codes, so you will want to add country name as a new column to the DataFrame. By now, you know of many sources for country codes, so I’ll leave that to you to choose whatever strategy you want to add country name.

At this point, you should have a DataFrame that looks like this.

text country
code
CAN It is both \nan honour and a pleasure for me t... Canada
CUB We live in a globalized world that is moving t... Cuba
MEX As \nPresident of Mexico, it is a high honour ... Mexico
USA We come together at a \ncrossroads between war... United States of America

You might think that counting the number of times Mexico refers to Cuba will be hard, but it is actually fairly straightforward, with only one line of code. See if you can think of it before looking at the solution.

What about counting the number of times that ALL the countries mention Cuba? Your first thought might be to write a for loop, but you don’t need to do that. (Remember the str object that we can use with a Series.)

The answer should look like this.

code
CAN     0
CUB    27
MEX     3
USA     7
Name: text, dtype: int64

This tells us that Canada did not mention Cuba at all in 2014 or 2015. Cuba referred to itself 27 times, Mexico referred to Cuba 3 times, and the United States referred to Cuba 7 times.

This feels like we are almost there! If we can convert the above result into a DataFrame, and add CUB as the referenced_c, column we could repeat this for each country and concatenate all of the small DataFrames together into one large DataFrame.

Hint: Use pd.concat. Contrary to previous advice, it is very difficult to do this without iterating over the rows of the data frame. If you can do it without a for loop, please tell your instructor!

Your initial result should look like this.

text referenced_c
code
CAN 40 CAN
CUB 0 CAN
MEX 0 CAN
USA 0 CAN
CAN 0 CUB
CUB 27 CUB
MEX 3 CUB
USA 7 CUB
CAN 0 MEX
CUB 0 MEX
MEX 20 MEX
USA 0 MEX
CAN 0 USA
CUB 2 USA
MEX 0 USA
USA 1 USA

Admittedly, this table is a bit hard to read in this format. It is much easier to read if we use our pivoting skills to make a table like this.

referenced_c CAN CUB MEX USA
speaking_c
CAN 40 0 0 0
CUB 0 27 0 2
MEX 0 3 20 0
USA 0 7 0 1

Challenge: Another way to go about this is to start by creating a DataFrame that looks like this.

code_3 CAN CUB MEX USA
code_3
CAN 40 0 0 0
CUB 0 27 3 7
MEX 0 0 20 0
USA 0 2 0 1

You will notice that this is flipped from our original, but we can easily fix that later. The challenge is to see if you can do it with just three lines of code.

Check Your Understanding

8.6.2. Visualizing the Relationships with a Heatmap

We will now look at a way to get a better visual representation of the table we have built, first using a heatmap and then using a graph.

The result we are after for our 2014, 2015 dataset looks like this.

Heatmap of how often USA, Mexico, Cuba and Canada are refered to by eachother.

With the narrow representation of the data, it is easy to have Altair make a heatmap using using a mark_bar and encoding y axis as the speaking_c, the x axis as the referenced_c, and the color as ref_count.

alt.Chart(narrow_test, height=200, width=200).mark_rect().encode(
    x='speaking_c:O',
    y='referenced_c:O',
    color='ref_count:Q'
)

The graph immediately visualizes that very few countries seem to reference the United States. This seems a bit strange…

Q-3: Can you explain why the United States has so few references? Is it a bug in our code? Is there something else going on? How can we fix it?

Another issue that the graph brings up is the problem with using the raw counts. Some countries are quite wordy, and others are less so, therefore it would be better to keep track of the percentages. That is, of all the countries referenced, Mexico references itself 87% of the time and Cuba 13% of the time.

Let’s iterate on this analysis and see what we learn.

  1. Update the country name for the USA to be United States instead of United States of America.

  2. Make the values for each country percentage based.

Your new heatmap should look like this.

Heatmap of how often the USA, Mexico, Cuba and Canada are referred to by each other.

Now, try to make your heatmap for these countries across all years, then move on to making a heatmap for all countries across all years.

If you inspect the data, you will see that many of the problem country names follow a pattern of name (something in parens). You can fix a bunch of these by replacing the name that has the parentheses with just the name. The str.extract function will be really useful to solve this.

To make a graph of all of the countries is a little overwhelming. So you may want to narrow it down to a group of approximately 12 related countries, just to get something a little more interesting and interpretable. For example, use one of our earlier datasets to get all of the three-letter country codes for countries in the same region.

You should also take a moment to step back and reflect on how we have built this in an incremental fashion, but how it continues to work at full scale. This is a very satisfying part of programming and data analysis! You have to enjoy your victories while you can.

8.6.3. Visualizing the Relationships with a Graph

The good news is that we have already done most of the hard work in the last section. This section will be all about how we can visualize that work in a different way. For this visualization, we are going to represent each country with a circle, and when one country talks about another, we’ll represent that by a line between the two circles. Our small example would look like this.

Digraph of USA, Mexico, Cuba and Canada based on how often they refer to each other. Mexico, USA and Cuba form a bidirectional line between each other while Canada points to itself.

The arrows on the graph indicate which country is referencing another country. Formally, we call the ovals with the country names nodes, and the arrows connecting them edges. One of the most common ways that computer scientists and mathematicians represent a graph is called an adjacency matrix. Don’t worry if this sounds daunting, you have actually already built an adjacency matrix!

referenced_c CAN CUB MEX USA
speaking_c
CAN 40 0 0 0
CUB 0 27 0 2
MEX 0 3 20 0
USA 0 7 0 1

In an adjacency matrix, the cells indicate if there is an edge from the row node to the column node. The values in the cells are often used to represent a weight or cost to go from one node to the other. A 0 in the cell indicates that there is no relationship.

A second common way to represent a graph is through an edge list. Our narrow representation that we built originally for this project fits that description perfectly. Even the names we chose for the columns (speaking_c, referenced_c) suggest a graph like relationship.

There are two graph packages we can use: networkx and graphviz. It’s not clear that one is preferable over the other; each has some strengths and weaknesses and in fact they can be used together to some extent. graphviz may be a little easier to use, since the file format is easy to edit, and can draw pretty graphs out of the box. The graph above was drawn using graphviz. You will need to install both networkx and graphviz on your computer.

Both packages are well documented.

Let’s look at some example code that shows how easy it was to build the graph above.

from graphviz import Digraph
g = Digraph()
g.edge('USA', 'Mexico')
g.edge('Mexico', 'Cuba')
g.edge('Cuba', 'USA')
g.edge('USA', 'Cuba')
g.edge('Canada', 'Canada')
g

Using networkx, we can build the graph directly from the DataFrame, but the graph produced is not very aesthetically pleasing.

g = nx.from_pandas_edgelist(narrow_test[narrow_test.ref_count > 0],
                        'speaking_c',
                        'referenced_c',
                        edge_attr='ref_count',
                       create_using=nx.DiGraph)
pos = graphvix_layout(g)
nx.draw(g, pos)
nx.draw_networkx_labels(g, pos)

The above produces a rather unattractive graph.

A digraph of the Mexico, USA, Cuba, and Canada set as nodes. Three undirected edges connect Mexico, USA and Cuba. Canada is not connected to any other node. All nodes are coloured Red.

The graph is missing the arrows, the text doesn’t fit, and the bright red is a bit alarming for no good reason. The layout is also not very easy to understand. We can immediately do much better by saving the graph we created with networkx as a dot file and then reading it back again and letting graphviz render the graph for us.

from networkx.drawing.nx_agraph import write_dot
from graphviz import Source

write_dot(g, 'mydots.dot')
s = Source.from_file('mydots.dot')
s

This produces a much nicer looking graph.

Directed digraph of USA, Mexico, Cuba, and Canada based on how often they refer to each other. Mexico has an edge pointing to Cuba and USA. Cuba has an edge pointing to the USA. USA has an edge pointing to Cuba. Canada has no edges connecting it to any other country. All four countries have edges pointing to themselves.

As with many tools, it’s easy to get 80% done in a pretty quick way, but if you want to make a graph worthy of a polished presentation, that last 20% can take some work. If we want to clean up the labels on the nodes to use the real names of the countries and add labels to the edges, we’ll have to combine what we have learned from the above examples and add our edges to a graphviz graph manually.

Pick a subregion to focus on, and build a graph where you label the edges with the fraction of times mentioned, using the real name of the country as the title of each node.

8.6.4. Projects for Further Exploration

  • Graph visualizations also lend themselves to literature. Check out this visualization of the interactions between the characters in the Lord of the Rings. You could make a similar visualization of a book. Project Gutenberg offers over 58,000 books that you are free to use for nearly any purpose.

  • Since graphing each country of the world individually is a bit difficult, build a heatmap or graph of how the countries within each subregion reference each other. There are about 22 sub-regions in the country_codes data file, which is quite manageable.

  • Find or create a group of topics and build a heatmap or a graph to visualize which countries or regions are most interested in those topics. We are defining “interest” to be somehow related to the number of times those topics come up in their UN speeches.

  • Challenge: A chord diagram is another great way to visualize relationships. Create a chord diagram to visualize the relationships between countries.

Lesson Feedback

    During this lesson I was primarily in my...
  • 1. Comfort Zone
  • 2. Learning Zone
  • 3. Panic Zone
    Completing this lesson took...
  • 1. Very little time
  • 2. A reasonable amount of time
  • 3. More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • 1. Don't seem worth learning
  • 2. May be worth learning
  • 3. Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • 1. Definitely within reach
  • 2. Within reach if I try my hardest
  • 3. Out of reach no matter how hard I try
You have attempted of activities on this page