6.2. World Factbook: Exploratory Data Analysis¶
6.2.1. Loading data into a DataFrame from a CSV file¶
The csv file is one of the most common ways you will find data. CSV stands for comma separated values and allows us to share data files in a simple text format. The data we will use to get started with Pandas is the data about countries we used in the spreadsheet module. You can open a CSV file in any text editor, but its not particularly easy to read. But because of its structure it is easy to parse. The first few lines of the raw csv file for this project look like this:
Country,Ctry,Code,CodeNum,Region,Population,Area,Pop. Density,Coastline,Net migration,Infant mortality,GDP,Literacy,Phones,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
Afghanistan,Afghanistan,AFG,4,ASIA (EX. NEAR EAST) ,31056997,647500,48.0,0.00,23.06,163.07,700,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
Albania ,Albania,ALB,8,EASTERN EUROPE ,3581655,28748,124.6,1.26,-4.93,21.52,4500,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
Algeria ,Algeria,DZA,12,NORTHERN AFRICA ,32930091,2381740,13.8,0.04,-0.39,31,6000,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
You may have some experience with reading and parsing CSV files on your own with Python. If not you may wish to have a quick review
%matplotlib inline
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import psycopg2
import textatistic
import seaborn as sbn
from altair import Chart, X, Y, Color, Scale
import altair as alt
from vega_datasets import data
import requests
from bs4 import BeautifulSoup
matplotlib.style.use('ggplot')
# for plotly py.offline.init_notebook_mode()
Meanwhile, we will make use of one of the many data reading functions
pandas provides for us read_csv
wd = pd.read_csv('world_countries.csv')
wd.head()
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Afghanistan | AFG | 4 | ASIA (EX. NEAR EAST) | 31056997 | 647500 | 48.0 | 0.00 | 23.06 | ... | 3.2 | 12.13 | 0.22 | 87.65 | 1.0 | 46.60 | 20.34 | 0.380 | 0.240 | 0.380 |
1 | Albania | Albania | ALB | 8 | EASTERN EUROPE | 3581655 | 28748 | 124.6 | 1.26 | -4.93 | ... | 71.2 | 21.09 | 4.42 | 74.49 | 3.0 | 15.11 | 5.22 | 0.232 | 0.188 | 0.579 |
2 | Algeria | Algeria | DZA | 12 | NORTHERN AFRICA | 32930091 | 2381740 | 13.8 | 0.04 | -0.39 | ... | 78.1 | 3.22 | 0.25 | 96.53 | 1.0 | 17.14 | 4.61 | 0.101 | 0.600 | 0.298 |
3 | American Samoa | American Samoa | ASM | 16 | OCEANIA | 57794 | 199 | 290.4 | 58.29 | -20.71 | ... | 259.5 | 10.00 | 15.00 | 75.00 | 2.0 | 22.46 | 3.27 | NaN | NaN | NaN |
4 | Andorra | Andorra | AND | 20 | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 6.60 | ... | 497.2 | 2.22 | 0.00 | 97.78 | 3.0 | 8.71 | 6.25 | NaN | NaN | NaN |
5 rows × 23 columns
6.2.2. Describing the Data¶
Country
Area Square Miles
Population Density per square mile
Coastline coast/area ratio
Net migration
Infant mortaility per 1000 births
GDP $ per capita
Literacy %
Phones per 1000
Arable land %
Crops %
other %
Climate
Birthrate
Deathrate
Agriculture % GDP
Inustry % GDP
Service % GDP
The Climate numbers are as follows:
Dry tropical or tundra and ice
Wet tropical
Temperate humid subtropical and temperate continental
Dry hot summers and wet winters
Somehow some values of 1.5 and 2.5 have crept in, so we will assume that 1.5 is mixed tropical and 2.5 mixed.
wd.describe()
CodeNum | Population | Area | Pop. Density | Coastline | Net migration | Infant mortality | GDP | Literacy | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 225.000000 | 2.250000e+02 | 2.250000e+02 | 225.000000 | 225.000000 | 222.000000 | 222.000000 | 224.000000 | 209.000000 | 221.000000 | 223.000000 | 223.000000 | 223.000000 | 203.000000 | 222.000000 | 221.000000 | 210.000000 | 209.000000 | 210.000000 |
mean | 436.213333 | 2.897847e+07 | 6.035169e+05 | 362.911111 | 21.304089 | 0.017838 | 35.635180 | 9770.089286 | 82.838278 | 236.435294 | 13.715247 | 4.425695 | 81.858700 | 2.130542 | 21.993604 | 9.290045 | 0.151710 | 0.282722 | 0.564395 |
std | 254.713527 | 1.183891e+08 | 1.797370e+06 | 1650.160243 | 72.591840 | 4.906187 | 35.523302 | 10057.808157 | 19.722173 | 228.942889 | 13.057554 | 8.268356 | 16.029195 | 0.697558 | 11.147278 | 4.986086 | 0.147199 | 0.138935 | 0.166357 |
min | 4.000000 | 7.026000e+03 | 2.000000e+00 | 0.000000 | 0.000000 | -20.990000 | 2.290000 | 500.000000 | 17.600000 | 0.200000 | 0.000000 | 0.000000 | 33.330000 | 1.000000 | 7.290000 | 2.290000 | 0.000000 | 0.020000 | 0.062000 |
25% | 214.000000 | 4.361310e+05 | 5.128000e+03 | 29.000000 | 0.100000 | -0.962500 | 8.070000 | 1900.000000 | 70.600000 | 37.200000 | 3.160000 | 0.190000 | 72.825000 | 2.000000 | 12.597500 | 5.980000 | 0.038000 | 0.190000 | 0.427750 |
50% | 434.000000 | 5.042920e+06 | 8.836100e+04 | 77.400000 | 0.730000 | 0.000000 | 21.000000 | 5700.000000 | 92.500000 | 176.200000 | 10.380000 | 1.010000 | 86.070000 | 2.000000 | 18.750000 | 8.100000 | 0.099500 | 0.270000 | 0.566500 |
75% | 654.000000 | 1.765484e+07 | 4.465500e+05 | 183.500000 | 10.320000 | 0.965000 | 56.095000 | 15775.000000 | 98.000000 | 394.400000 | 20.000000 | 4.425000 | 95.470000 | 3.000000 | 29.645000 | 10.620000 | 0.223000 | 0.342000 | 0.677500 |
max | 894.000000 | 1.313974e+09 | 1.707520e+07 | 16271.500000 | 870.660000 | 23.060000 | 191.190000 | 55100.000000 | 100.000000 | 1035.600000 | 62.110000 | 50.680000 | 100.000000 | 4.000000 | 50.730000 | 29.740000 | 0.769000 | 0.906000 | 0.954000 |
6.2.3. Visualizing Distribution with Histograms¶
c = Chart(wd) # make a chart
m = c.mark_bar() # set the mark -- returns a new Chart
e = m.encode(X('Birthrate',bin=True),y='count()') # set the encoding
e.display()

We can shortcut a lot of what we did above into a single line because once we have created a mark there is really nothing more to do with it besides add the encoding. Because the methods are all cleverly designed to return the proper object we can string all of the calls above into a single line. We also do not need to explicitly call display because Altair returns an object that the Jupyter environment knows how to display automatically.
Chart(wd).mark_bar().encode(x=X('Birthrate', bin=True), y='count()')

6.2.4. Practice¶
Q-1: What is the range of values for the tallest bar when creating a histogram of the literacy rate? lower: upper:
Q-2: What is the range of values for the tallest bar when creating a histogram of the fraction of the economy due to service? lower: upper:
Q-3: Approximately how many countries (to the nearest 5) have between 90 and 100% of their economy based on service?
6.2.5. Scatter Plots for discovering relationships¶
Now lets make a simple scatter plot of area versus population of the countries.
Chart(wd).mark_point().encode(x='Population', y='Area', tooltip='Country')

Thats not a very satisfying graph. But it does make us want to focus more on the lower left corner. Let’s redo the graph but focus on the countries with a population under 150 million and an area under 4 million. Lets start with the first part
To do this we will create a new DataFrame where we focus on the countries with populations less than 150 million and areas less than 4 million. Pandas makes this really easy with its querying power.
The statement below produces a Series of boolean values. These boolean values are used to index the data frame and only the rows corresponding to True values are returned in the result.
(wd.Population < 150000000).head(20)
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
Name: Population, dtype: bool
To be a bit more dramatic lets look at the countries of less than 150,000
wd[wd.Population < 150000]
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | American Samoa | American Samoa | ASM | 16 | OCEANIA | 57794 | 199 | 290.4 | 58.29 | -20.71 | ... | 259.5 | 10.00 | 15.00 | 75.00 | 2.0 | 22.46 | 3.27 | NaN | NaN | NaN |
4 | Andorra | Andorra | AND | 20 | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 6.60 | ... | 497.2 | 2.22 | 0.00 | 97.78 | 3.0 | 8.71 | 6.25 | NaN | NaN | NaN |
6 | Anguilla | Anguilla | AIA | 660 | LATIN AMER. & CARIB | 13477 | 102 | 132.1 | 59.80 | 10.76 | ... | 460.0 | 0.00 | 0.00 | 100.00 | 2.0 | 14.17 | 5.34 | 0.040 | 0.180 | 0.780 |
7 | Antigua & Barbuda | Antigua & Barbuda | ATA | 10 | LATIN AMER. & CARIB | 69108 | 443 | 156.0 | 34.54 | -6.15 | ... | 549.9 | 18.18 | 4.55 | 77.27 | 2.0 | 16.93 | 5.37 | 0.038 | 0.220 | 0.743 |
10 | Aruba | Aruba | ABW | 533 | LATIN AMER. & CARIB | 71891 | 193 | 372.5 | 35.49 | 0.00 | ... | 516.1 | 10.53 | 0.00 | 89.47 | 2.0 | 11.03 | 6.68 | 0.004 | 0.333 | 0.663 |
22 | Bermuda | Bermuda | BMU | 60 | NORTHERN AMERICA | 65773 | 53 | 1241.0 | 194.34 | 2.49 | ... | 851.4 | 20.00 | 0.00 | 80.00 | 2.0 | 11.40 | 7.74 | 0.010 | 0.100 | 0.890 |
28 | British Virgin Is. | British Virgin Is. | IOT | 86 | LATIN AMER. & CARIB | 23098 | 153 | 151.0 | 52.29 | 10.01 | ... | 506.5 | 20.00 | 6.67 | 73.33 | 2.0 | 14.89 | 4.42 | 0.018 | 0.062 | 0.920 |
38 | Cayman Islands | Cayman Islands | CYM | 136 | LATIN AMER. & CARIB | 45436 | 262 | 173.4 | 61.07 | 18.75 | ... | 836.3 | 3.85 | 0.00 | 96.15 | 2.0 | 12.74 | 4.89 | 0.014 | 0.032 | 0.954 |
47 | Cook Islands | Cook Islands | COK | 184 | OCEANIA | 21388 | 240 | 89.1 | 50.00 | NaN | ... | 289.9 | 17.39 | 13.04 | 69.57 | 2.0 | 21.00 | NaN | 0.151 | 0.096 | 0.753 |
56 | Dominica | Dominica | DMA | 212 | LATIN AMER. & CARIB | 68910 | 754 | 91.4 | 19.63 | -13.87 | ... | 304.8 | 6.67 | 20.00 | 73.33 | 2.0 | 15.27 | 6.73 | 0.177 | 0.328 | 0.495 |
66 | Faroe Islands | Faroe Islands | FRO | 234 | WESTERN EUROPE | 47246 | 1399 | 33.8 | 79.84 | 1.41 | ... | 503.8 | 2.14 | 0.00 | 97.86 | NaN | 14.05 | 8.70 | 0.270 | 0.110 | 0.620 |
77 | Gibraltar | Gibraltar | GIB | 292 | WESTERN EUROPE | 27928 | 7 | 3989.7 | 171.43 | 0.00 | ... | 877.7 | 0.00 | 0.00 | 100.00 | NaN | 10.74 | 9.31 | NaN | NaN | NaN |
79 | Greenland | Greenland | GRL | 304 | NORTHERN AMERICA | 56361 | 2166086 | 0.0 | 2.04 | -8.37 | ... | 448.9 | 0.00 | 0.00 | 100.00 | 1.0 | 15.93 | 7.84 | NaN | NaN | NaN |
80 | Grenada | Grenada | GRD | 308 | LATIN AMER. & CARIB | 89703 | 344 | 260.8 | 35.17 | -13.92 | ... | 364.5 | 5.88 | 29.41 | 64.71 | 2.0 | 22.08 | 6.88 | 0.054 | 0.180 | 0.766 |
84 | Guernsey | Guernsey | GGY | 831 | WESTERN EUROPE | 65409 | 78 | 838.6 | 64.10 | 3.84 | ... | 842.4 | NaN | NaN | NaN | 3.0 | 8.81 | 10.01 | 0.030 | 0.100 | 0.870 |
98 | Isle of Man | Isle of Man | IMN | 833 | WESTERN EUROPE | 75441 | 572 | 131.9 | 27.97 | 5.36 | ... | 676.0 | 9.00 | 0.00 | 91.00 | 3.0 | 11.05 | 11.19 | 0.010 | 0.130 | 0.860 |
103 | Jersey | Jersey | JEY | 832 | WESTERN EUROPE | 91084 | 116 | 785.2 | 60.34 | 2.76 | ... | 811.3 | 0.00 | 0.00 | 100.00 | 3.0 | 9.30 | 9.28 | 0.050 | 0.020 | 0.930 |
107 | Kiribati | Kiribati | KIR | 296 | OCEANIA | 105432 | 811 | 130.0 | 140.94 | 0.00 | ... | 42.7 | 2.74 | 50.68 | 46.58 | 2.0 | 30.65 | 8.26 | 0.089 | 0.242 | 0.668 |
118 | Liechtenstein | Liechtenstein | LIE | 438 | WESTERN EUROPE | 33987 | 160 | 212.4 | 0.00 | 4.85 | ... | 585.5 | 25.00 | 0.00 | 75.00 | 4.0 | 10.21 | 7.18 | 0.060 | 0.390 | 0.550 |
129 | Marshall Islands | Marshall Islands | MHL | 584 | OCEANIA | 60422 | 11854 | 5.1 | 3.12 | -6.04 | ... | 91.2 | 16.67 | 38.89 | 44.44 | 2.0 | 33.05 | 4.78 | 0.317 | 0.149 | 0.534 |
135 | Micronesia, Fed. St. | Micronesia, Fed. St. | FSM | 583 | OCEANIA | 108004 | 702 | 153.9 | 870.66 | -20.99 | ... | 114.8 | 5.71 | 45.71 | 48.58 | 2.0 | 24.68 | 4.75 | 0.289 | 0.152 | 0.559 |
137 | Monaco | Monaco | MCO | 492 | WESTERN EUROPE | 32543 | 2 | 16271.5 | 205.00 | 7.75 | ... | 1035.6 | 0.00 | 0.00 | 100.00 | NaN | 9.19 | 12.91 | 0.170 | NaN | NaN |
139 | Montserrat | Montserrat | MSR | 500 | LATIN AMER. & CARIB | 9439 | 102 | 92.5 | 39.22 | 0.00 | ... | NaN | 20.00 | 0.00 | 80.00 | 2.0 | 17.59 | 7.10 | NaN | NaN | NaN |
143 | Nauru | Nauru | NRU | 520 | OCEANIA | 13287 | 21 | 632.7 | 142.86 | 0.00 | ... | 143.0 | 0.00 | 0.00 | 100.00 | 2.0 | 24.76 | 6.70 | NaN | NaN | NaN |
152 | N. Mariana Islands | N. Mariana Islands | MMR | 104 | OCEANIA | 82459 | 477 | 172.9 | 310.69 | 9.61 | ... | 254.7 | 13.04 | 4.35 | 82.61 | 2.0 | 19.43 | 2.29 | NaN | NaN | NaN |
156 | Palau | Palau | PLW | 585 | OCEANIA | 20579 | 458 | 44.9 | 331.66 | 2.85 | ... | 325.6 | 8.70 | 4.35 | 86.95 | 2.0 | 18.03 | 6.80 | 0.062 | 0.120 | 0.818 |
170 | Saint Helena | Saint Helena | BLM | 652 | SUB-SAHARAN AFRICA | 7502 | 413 | 18.2 | 14.53 | 0.00 | ... | 293.3 | 12.90 | 0.00 | 87.10 | NaN | 12.13 | 6.53 | NaN | NaN | NaN |
171 | Saint Kitts & Nevis | Saint Kitts & Nevis | SHN | 654 | LATIN AMER. & CARIB | 39129 | 261 | 149.9 | 51.72 | -7.11 | ... | 638.9 | 19.44 | 2.78 | 77.78 | 2.0 | 18.02 | 8.33 | 0.035 | 0.258 | 0.707 |
173 | St Pierre & Miquelon | St Pierre & Miquelon | LKA | 144 | NORTHERN AMERICA | 7026 | 242 | 29.0 | 49.59 | -4.86 | ... | 683.2 | 13.04 | 0.00 | 86.96 | NaN | 13.52 | 6.83 | NaN | NaN | NaN |
174 | Saint Vincent and the Grenadines | Saint Vincent and the Grenadines | VCT | 670 | LATIN AMER. & CARIB | 117848 | 389 | 303.0 | 21.59 | -7.64 | ... | 190.9 | 17.95 | 17.95 | 64.10 | 2.0 | 16.18 | 5.98 | 0.100 | 0.260 | 0.640 |
176 | San Marino | San Marino | SMR | 674 | WESTERN EUROPE | 29251 | 61 | 479.5 | 0.00 | 10.98 | ... | 704.3 | 16.67 | 0.00 | 83.33 | NaN | 10.02 | 8.17 | NaN | NaN | NaN |
181 | Seychelles | Seychelles | SYC | 690 | SUB-SAHARAN AFRICA | 81541 | 455 | 179.2 | 107.91 | -5.69 | ... | 262.4 | 2.22 | 13.33 | 84.45 | 2.0 | 16.03 | 6.29 | 0.032 | 0.304 | 0.665 |
202 | Tonga | Tonga | TON | 776 | OCEANIA | 114689 | 748 | 153.3 | 56.02 | 0.00 | ... | 97.7 | 23.61 | 43.06 | 33.33 | 2.0 | 25.37 | 5.28 | 0.230 | 0.270 | 0.500 |
207 | Turks & Caicos Is | Turks & Caicos Is | TKM | 795 | LATIN AMER. & CARIB | 21152 | 430 | 49.2 | 90.47 | 11.68 | ... | 269.5 | 2.33 | 0.00 | 97.67 | 2.0 | 21.84 | 4.21 | NaN | NaN | NaN |
208 | Tuvalu | Tuvalu | TUV | 798 | OCEANIA | 11810 | 26 | 454.2 | 92.31 | 0.00 | ... | 59.3 | 0.00 | 0.00 | 100.00 | 2.0 | 22.18 | 7.11 | 0.166 | 0.272 | 0.562 |
219 | Virgin Islands | Virgin Islands | VIR | 850 | LATIN AMER. & CARIB | 108605 | 1910 | 56.9 | 9.84 | -8.94 | ... | 652.8 | 11.76 | 2.94 | 85.30 | 2.0 | 13.96 | 6.43 | 0.010 | 0.190 | 0.800 |
220 | Wallis and Futuna | Wallis and Futuna | WLF | 876 | OCEANIA | 16025 | 274 | 58.5 | 47.08 | NaN | ... | 118.6 | 5.00 | 25.00 | 70.00 | 2.0 | NaN | NaN | NaN | NaN | NaN |
37 rows × 23 columns
Now lets graph these countries. The easiest way to do this is to plug the query right into the call to create a Chart rather than assigning it to a variable first.
Chart(wd[wd.Population < 150000]).mark_point().encode(x='Population', y='Area', tooltip='Country').interactive()

How interesting! One country has such a large value that it pushes all the others down. We added a tooltip parameter so that if you hover over that point you will see it is Greenland! Lots of land area but not too many people. There are large universities that have more people than the country of Greenland. Lets improve out query to focus on area less than 200,000
We can do more complicated boolean expressions by using the |
(logical or) and &
(logical and) operators. Normally in Python these
two operators are used for bitwise or and bitwise and. So we can create
a more complicated boolean expression to limit our DataFrame in both
directions.
wd[(wd.Population < 150000) & (wd.Area < 200000)]
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | American Samoa | American Samoa | ASM | 16 | OCEANIA | 57794 | 199 | 290.4 | 58.29 | -20.71 | ... | 259.5 | 10.00 | 15.00 | 75.00 | 2.0 | 22.46 | 3.27 | NaN | NaN | NaN |
4 | Andorra | Andorra | AND | 20 | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 6.60 | ... | 497.2 | 2.22 | 0.00 | 97.78 | 3.0 | 8.71 | 6.25 | NaN | NaN | NaN |
6 | Anguilla | Anguilla | AIA | 660 | LATIN AMER. & CARIB | 13477 | 102 | 132.1 | 59.80 | 10.76 | ... | 460.0 | 0.00 | 0.00 | 100.00 | 2.0 | 14.17 | 5.34 | 0.040 | 0.180 | 0.780 |
7 | Antigua & Barbuda | Antigua & Barbuda | ATA | 10 | LATIN AMER. & CARIB | 69108 | 443 | 156.0 | 34.54 | -6.15 | ... | 549.9 | 18.18 | 4.55 | 77.27 | 2.0 | 16.93 | 5.37 | 0.038 | 0.220 | 0.743 |
10 | Aruba | Aruba | ABW | 533 | LATIN AMER. & CARIB | 71891 | 193 | 372.5 | 35.49 | 0.00 | ... | 516.1 | 10.53 | 0.00 | 89.47 | 2.0 | 11.03 | 6.68 | 0.004 | 0.333 | 0.663 |
22 | Bermuda | Bermuda | BMU | 60 | NORTHERN AMERICA | 65773 | 53 | 1241.0 | 194.34 | 2.49 | ... | 851.4 | 20.00 | 0.00 | 80.00 | 2.0 | 11.40 | 7.74 | 0.010 | 0.100 | 0.890 |
28 | British Virgin Is. | British Virgin Is. | IOT | 86 | LATIN AMER. & CARIB | 23098 | 153 | 151.0 | 52.29 | 10.01 | ... | 506.5 | 20.00 | 6.67 | 73.33 | 2.0 | 14.89 | 4.42 | 0.018 | 0.062 | 0.920 |
38 | Cayman Islands | Cayman Islands | CYM | 136 | LATIN AMER. & CARIB | 45436 | 262 | 173.4 | 61.07 | 18.75 | ... | 836.3 | 3.85 | 0.00 | 96.15 | 2.0 | 12.74 | 4.89 | 0.014 | 0.032 | 0.954 |
47 | Cook Islands | Cook Islands | COK | 184 | OCEANIA | 21388 | 240 | 89.1 | 50.00 | NaN | ... | 289.9 | 17.39 | 13.04 | 69.57 | 2.0 | 21.00 | NaN | 0.151 | 0.096 | 0.753 |
56 | Dominica | Dominica | DMA | 212 | LATIN AMER. & CARIB | 68910 | 754 | 91.4 | 19.63 | -13.87 | ... | 304.8 | 6.67 | 20.00 | 73.33 | 2.0 | 15.27 | 6.73 | 0.177 | 0.328 | 0.495 |
66 | Faroe Islands | Faroe Islands | FRO | 234 | WESTERN EUROPE | 47246 | 1399 | 33.8 | 79.84 | 1.41 | ... | 503.8 | 2.14 | 0.00 | 97.86 | NaN | 14.05 | 8.70 | 0.270 | 0.110 | 0.620 |
77 | Gibraltar | Gibraltar | GIB | 292 | WESTERN EUROPE | 27928 | 7 | 3989.7 | 171.43 | 0.00 | ... | 877.7 | 0.00 | 0.00 | 100.00 | NaN | 10.74 | 9.31 | NaN | NaN | NaN |
80 | Grenada | Grenada | GRD | 308 | LATIN AMER. & CARIB | 89703 | 344 | 260.8 | 35.17 | -13.92 | ... | 364.5 | 5.88 | 29.41 | 64.71 | 2.0 | 22.08 | 6.88 | 0.054 | 0.180 | 0.766 |
84 | Guernsey | Guernsey | GGY | 831 | WESTERN EUROPE | 65409 | 78 | 838.6 | 64.10 | 3.84 | ... | 842.4 | NaN | NaN | NaN | 3.0 | 8.81 | 10.01 | 0.030 | 0.100 | 0.870 |
98 | Isle of Man | Isle of Man | IMN | 833 | WESTERN EUROPE | 75441 | 572 | 131.9 | 27.97 | 5.36 | ... | 676.0 | 9.00 | 0.00 | 91.00 | 3.0 | 11.05 | 11.19 | 0.010 | 0.130 | 0.860 |
103 | Jersey | Jersey | JEY | 832 | WESTERN EUROPE | 91084 | 116 | 785.2 | 60.34 | 2.76 | ... | 811.3 | 0.00 | 0.00 | 100.00 | 3.0 | 9.30 | 9.28 | 0.050 | 0.020 | 0.930 |
107 | Kiribati | Kiribati | KIR | 296 | OCEANIA | 105432 | 811 | 130.0 | 140.94 | 0.00 | ... | 42.7 | 2.74 | 50.68 | 46.58 | 2.0 | 30.65 | 8.26 | 0.089 | 0.242 | 0.668 |
118 | Liechtenstein | Liechtenstein | LIE | 438 | WESTERN EUROPE | 33987 | 160 | 212.4 | 0.00 | 4.85 | ... | 585.5 | 25.00 | 0.00 | 75.00 | 4.0 | 10.21 | 7.18 | 0.060 | 0.390 | 0.550 |
129 | Marshall Islands | Marshall Islands | MHL | 584 | OCEANIA | 60422 | 11854 | 5.1 | 3.12 | -6.04 | ... | 91.2 | 16.67 | 38.89 | 44.44 | 2.0 | 33.05 | 4.78 | 0.317 | 0.149 | 0.534 |
135 | Micronesia, Fed. St. | Micronesia, Fed. St. | FSM | 583 | OCEANIA | 108004 | 702 | 153.9 | 870.66 | -20.99 | ... | 114.8 | 5.71 | 45.71 | 48.58 | 2.0 | 24.68 | 4.75 | 0.289 | 0.152 | 0.559 |
137 | Monaco | Monaco | MCO | 492 | WESTERN EUROPE | 32543 | 2 | 16271.5 | 205.00 | 7.75 | ... | 1035.6 | 0.00 | 0.00 | 100.00 | NaN | 9.19 | 12.91 | 0.170 | NaN | NaN |
139 | Montserrat | Montserrat | MSR | 500 | LATIN AMER. & CARIB | 9439 | 102 | 92.5 | 39.22 | 0.00 | ... | NaN | 20.00 | 0.00 | 80.00 | 2.0 | 17.59 | 7.10 | NaN | NaN | NaN |
143 | Nauru | Nauru | NRU | 520 | OCEANIA | 13287 | 21 | 632.7 | 142.86 | 0.00 | ... | 143.0 | 0.00 | 0.00 | 100.00 | 2.0 | 24.76 | 6.70 | NaN | NaN | NaN |
152 | N. Mariana Islands | N. Mariana Islands | MMR | 104 | OCEANIA | 82459 | 477 | 172.9 | 310.69 | 9.61 | ... | 254.7 | 13.04 | 4.35 | 82.61 | 2.0 | 19.43 | 2.29 | NaN | NaN | NaN |
156 | Palau | Palau | PLW | 585 | OCEANIA | 20579 | 458 | 44.9 | 331.66 | 2.85 | ... | 325.6 | 8.70 | 4.35 | 86.95 | 2.0 | 18.03 | 6.80 | 0.062 | 0.120 | 0.818 |
170 | Saint Helena | Saint Helena | BLM | 652 | SUB-SAHARAN AFRICA | 7502 | 413 | 18.2 | 14.53 | 0.00 | ... | 293.3 | 12.90 | 0.00 | 87.10 | NaN | 12.13 | 6.53 | NaN | NaN | NaN |
171 | Saint Kitts & Nevis | Saint Kitts & Nevis | SHN | 654 | LATIN AMER. & CARIB | 39129 | 261 | 149.9 | 51.72 | -7.11 | ... | 638.9 | 19.44 | 2.78 | 77.78 | 2.0 | 18.02 | 8.33 | 0.035 | 0.258 | 0.707 |
173 | St Pierre & Miquelon | St Pierre & Miquelon | LKA | 144 | NORTHERN AMERICA | 7026 | 242 | 29.0 | 49.59 | -4.86 | ... | 683.2 | 13.04 | 0.00 | 86.96 | NaN | 13.52 | 6.83 | NaN | NaN | NaN |
174 | Saint Vincent and the Grenadines | Saint Vincent and the Grenadines | VCT | 670 | LATIN AMER. & CARIB | 117848 | 389 | 303.0 | 21.59 | -7.64 | ... | 190.9 | 17.95 | 17.95 | 64.10 | 2.0 | 16.18 | 5.98 | 0.100 | 0.260 | 0.640 |
176 | San Marino | San Marino | SMR | 674 | WESTERN EUROPE | 29251 | 61 | 479.5 | 0.00 | 10.98 | ... | 704.3 | 16.67 | 0.00 | 83.33 | NaN | 10.02 | 8.17 | NaN | NaN | NaN |
181 | Seychelles | Seychelles | SYC | 690 | SUB-SAHARAN AFRICA | 81541 | 455 | 179.2 | 107.91 | -5.69 | ... | 262.4 | 2.22 | 13.33 | 84.45 | 2.0 | 16.03 | 6.29 | 0.032 | 0.304 | 0.665 |
202 | Tonga | Tonga | TON | 776 | OCEANIA | 114689 | 748 | 153.3 | 56.02 | 0.00 | ... | 97.7 | 23.61 | 43.06 | 33.33 | 2.0 | 25.37 | 5.28 | 0.230 | 0.270 | 0.500 |
207 | Turks & Caicos Is | Turks & Caicos Is | TKM | 795 | LATIN AMER. & CARIB | 21152 | 430 | 49.2 | 90.47 | 11.68 | ... | 269.5 | 2.33 | 0.00 | 97.67 | 2.0 | 21.84 | 4.21 | NaN | NaN | NaN |
208 | Tuvalu | Tuvalu | TUV | 798 | OCEANIA | 11810 | 26 | 454.2 | 92.31 | 0.00 | ... | 59.3 | 0.00 | 0.00 | 100.00 | 2.0 | 22.18 | 7.11 | 0.166 | 0.272 | 0.562 |
219 | Virgin Islands | Virgin Islands | VIR | 850 | LATIN AMER. & CARIB | 108605 | 1910 | 56.9 | 9.84 | -8.94 | ... | 652.8 | 11.76 | 2.94 | 85.30 | 2.0 | 13.96 | 6.43 | 0.010 | 0.190 | 0.800 |
220 | Wallis and Futuna | Wallis and Futuna | WLF | 876 | OCEANIA | 16025 | 274 | 58.5 | 47.08 | NaN | ... | 118.6 | 5.00 | 25.00 | 70.00 | 2.0 | NaN | NaN | NaN | NaN | NaN |
36 rows × 23 columns
Chart(wd[(wd.Population < 150000) & (wd.Area < 200000)]).mark_point().encode(x='Population', y='Area', tooltip='Country').interactive()

OK, so maybe you have a favorite country you have visited or lived in at some point. I lived in Malta for six months, so I’m always curious about Malta. Lets see what data we have in the data frame for Malta using an equality:
wd[wd.Country == 'Malta']
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service |
---|
0 rows × 23 columns
Hmmm.. that seems odd that Malta would not be in the dataset. Lets try some other countries. Nothing seems to work. One common problem is that names and other strings can end up with spaces at the beginning or the end. If you do a quick try you will see that ‘Malta’ works. But that is horrible. We don’t want to have to remember to put spaces at the end of every string all the time. We should do a little data cleanup and strip those spaces.
wd[wd.Country == 'Malta ']
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
128 | Malta | Malta | MLT | 470 | WESTERN EUROPE | 400214 | 316 | 1266.5 | 62.28 | 2.07 | ... | 505.0 | 28.13 | 3.13 | 68.74 | NaN | 10.22 | 8.1 | 0.03 | 0.23 | 0.74 |
1 rows × 23 columns
You may recall that Python has a string method called strip
that
does exactly what we want. How can we get that to apply to all of the
strings in the Series? Pandas allows us to do this using the str
attribute of the series in combination with most of the standard string
methods you know about.
wd.Country.str.strip()
0 Afghanistan
1 Albania
2 Algeria
3 American Samoa
4 Andorra
5 Angola
6 Anguilla
7 Antigua & Barbuda
8 Argentina
9 Armenia
10 Aruba
11 Australia
12 Austria
13 Azerbaijan
14 Bahamas, The
15 Bahrain
16 Bangladesh
17 Barbados
18 Belarus
19 Belgium
20 Belize
21 Benin
22 Bermuda
23 Bhutan
24 Bolivia
25 Bosnia & Herzegovina
26 Botswana
27 Brazil
28 British Virgin Is.
29 Brunei
...
195 Switzerland
196 Syria
197 Taiwan
198 Tajikistan
199 Tanzania
200 Thailand
201 Togo
202 Tonga
203 Trinidad & Tobago
204 Tunisia
205 Turkey
206 Turkmenistan
207 Turks & Caicos Is
208 Tuvalu
209 Uganda
210 Ukraine
211 United Arab Emirates
212 United Kingdom of Great Britain and Northern I...
213 United States of America
214 Uruguay
215 Uzbekistan
216 Vanuatu
217 Venezuela
218 Vietnam
219 Virgin Islands
220 Wallis and Futuna
221 Western Sahara
222 Yemen
223 Zambia
224 Zimbabwe
Name: Country, Length: 225, dtype: object
Now we can replace our original Country column with the stripped column.
wd['Country'] = wd.Country.str.strip()
wd[wd.Country == 'Malta']
Country | Ct | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
128 | Malta | Malta | MLT | 470 | WESTERN EUROPE | 400214 | 316 | 1266.5 | 62.28 | 2.07 | ... | 505.0 | 28.13 | 3.13 | 68.74 | NaN | 10.22 | 8.1 | 0.03 | 0.23 | 0.74 |
1 rows × 23 columns
6.2.5.1. Power Tools – Scatter Matrix¶
It would be pretty tedius to look at all the different pairs of things we might want to look at for correlation one at a time, but we can Use a scatter matrix to make life easier.
alt.Chart(wd).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative'),
color='Region:N'
).properties(
width=150,
height=150
).repeat(
row=['Birthrate', 'Deathrate', 'Infant mortality', 'GDP'],
column=['Birthrate', 'Deathrate', 'Infant mortality', 'GDP']
).interactive()

list(reversed(['a','b']))
['b', 'a']
6.2.6. Developing Fluency¶
Pandas will only become a part of your daily workflow when you develop fluency with the basics. You need to be able to do easy queries without having to think hard about the syntax. The only way that happens is through repetition. Lots of repetition and ideally that repetitive practice is spread out over time.
That doesn’t mean you can’t go on and do lots of much harder things, it just means it will take longer at first as you have to go back and review documentation in order to become efficient.
6.2.6.1. Practice Questions¶
What are the top 10 countries with the largest GDP?
What are the top 20 countries by Population?
What are the 10 countries with the largest net migration?
What is distribution of Argiculture, Industry, and service for the countries in Western Europe?
What are the names, population and Area of the 5 largest (by area) landlocked countries?
What are the names and population of the five most populous landlocked countries?
What what is the name and GDP of the 10 countries with the most cell phones/1000 people?
What are the 10 countries with the largest GDP that have a “Wet Tropical” climate?
Lesson Feedback
-
During this lesson I was primarily in my...
- Comfort Zone
- Learning Zone
- Panic Zone
-
Completing this lesson took...
- Very little time
- A reasonable amount of time
- More time than is reasonable
-
Based on my own interests and needs, the things taught in this lesson...
- Don't seem worth learning
- May be worth learning
- Are definitely worth learning
-
For me to master the things taught in this lesson feels...
- Definitely within reach
- Within reach if I try my hardest
- Out of reach no matter how hard I try