7.16. 🤔 Computing Statistics with Kiva Data

Kiva is an international nonprofit, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. We celebrate and support people looking to create a better future for themselves, their families and their communities. By lending as little as $25 on Kiva, anyone can help a borrower start or grow a business, go to school, access clean energy or realize their potential. For some, it’s a matter of survival, for others it’s the fuel for a life-long ambition. The following table contains some data that we will use to practice on some basic descriptive statistics that are commonly used in data science.

Kiva Lending Data
id loan_amount country_name status time_to_raise num_lenders_total
212763 1250.0 Azerbaijan funded 193075.0 38
76281 500.0 El Salvador funded 1157108.0 18
444097 1450.0 Bolivia funded 1552939.0 51
402224 200.0 Paraguay funded 244945.0 3
634949 700.0 El Salvador funded 238797.0 21
1383386 100.0 Philippines funded 1248909.0 1
351 250.0 Philippines funded 773599.0 10
35651 225.0 Nicaragua funded 116181.0 8
784253 1200.0 Guatemala funded 2288095.0 42
1328839 150.0 Philippines funded 51668.0 1
1094905 600.0 Paraguay funded 26717.0 18
336986 300.0 Philippines funded 48030.0 6
163170 700.0 Bolivia funded 24078.0 28
1323915 125.0 Philippines funded 71117.0 5
528261 650.0 Philippines funded 580401.0 16
495978 175.0 Madagascar funded 800427.0 7
1251510 1800.0 Georgia funded 1156218.0 54
642684 1525.0 Uganda funded 1166045.0 1
974324 575.0 Kenya funded 2924705.0 18
7487 700.0 Tajikistan funded 470622.0 22
957 1450.0 Jordan funded 3046687.0 36
647494 400.0 Kenya funded 260044.0 12
706941 200.0 Philippines funded 445938.0 8
889708 1000.0 Ecuador funded 201408.0 24
882568 350.0 Kenya funded 2370450.0 8

There are some great (more advanced) tools in Python for working with massive tables of data. In fact this table is a random sample of a data set from Kiva that contains 1.4 million rows! We will move on to more and bigger data sets in time, but for now we need a simple way to work with this sample. To do that we will represent each column of the table as its own list.

To keep your coding easier and cleaner we will show you these lists here, but they will be automatically included for you in later activecodes. You can just use the list by name and it will be fine.

7.16.1. Level 1 Questions

  1. What is the total amount of money loaned?
  2. What is the average loan amount?
  3. What is the largest/smallest loan?
  4. What country got the largest/smallest loan?
  5. What is the variance of the money loaned?
  6. What is the average number of days needed to fund a loan?
Compute the total amount of money loaned and store it in the variable loan_total
Compute the average amount of money loaned and store it in the variable loan_average
Store the amount of the minimum loan in min_loan and the amount of the maximum loan in max_loan Then, store the name of the country that received the largest loan in max_country and the smallest loan in min_country Hint: max and min are builting Python functions that you can use to find the minimum value or maximum value in any sequence.
Compute the average number of lenders per loan and store it in a variable average_lenders
Compute the total number of loans made to the Philippines and store it in a variable philippines_count
For each unique country name, print a line that shows the name of the country and then the number of loans made in that country, like this: “Guatemala 1”

7.16.2. Level 2 Questions

  1. What is the average amount of loans made to people in the Philippines?
  2. In which country was the loan granted that took the longest to fund?
  3. What is the average amount of time / dollar it takes to fund a loan?
  4. What is the standard deviation of the money loaned? The Empirical Rule or 68-95-99.7% Rule reminds us that 68% of the population falls within 1 standard deviation. Does this hold for our data?
  5. Is there a relationship between the loan amount and the number of people? Or time to fund? How would we measure this? Covariance? Correlation?
The index positions for the Phillipines are [5, 6, 9, 11, 13, 14, 22] Use that information to compute the average loan amount for the Phillipines. Store your result in the variable p_average
What is the name of the country with the loan that took the longest to raise? Store your result in the variable longest_to_fund
What is the arithmetic mean of the time / dollar it takes to fund a loan? The arithmetic mean is the average of the individual time/dollar calculations, not the average of the sum of time divided by the sum of dollar amounts. Store your result in the variable a_mean

For our final few exercises we are interested in exploring the distribution of the data as well as the relationships between two of our variables. To do this we need to introduce a few more statistical concepts including variance, standard deviation, covariance and correlation.

Variance looks at a single variable and measures how far the set of numbers are spread out from their average value. However its a bit hard to interpret because the units are squared so its not on the same scale as our original numbers. This is why most of the time we use the standard devation, which is just the square root of the variance. A large standard deviation tells us that our data is quite spread out while a small standard deviation tells us that most of our data is pretty close to the mean.

\[variance = \frac{\sum{ (x-\bar{x})^2}}{n}\]
\[stdev = \sqrt{variance}\]

Don’t let the fancy math get you down the variance is just the sum of the squared values of each value minus the average for that value divided by the number of values. This is a little more complicated that what you have done before but you can definitely do this.

Calculate the standard deviation of the loan_amount variable and store the variance in loan_var and the standard deviation in loan_stdev.

In data science we are often most interested in two variables that seem to influence one another. That is, we can observe that as one variable grows a second grows with it, or as one variable grows another variable shrinks at a similar rate. We will look at two ways to explore the relationships between these variables.

Covariance measures the larger values of one variable correspond to the larger values of a second variable as well as the extent to which the smaller values of one variable correspond to the smaller values of a second variable. If the covariance is positive it means the two variables grow together (positive correlation). If the magnitude is negative it means one variable grows while the other shrinks. The magnitude is hard to interpret because it depends on the values of the variables. So Most often the covariance is normalized so that the values are between minus 1 and positive 1, this is the pearson correlation coefficient A -1 indicates a strong negative correlation, a value of 0 indicates that the variables are not correlated at all, and a +1 indicates a strong positive correlation.

Historically the pearson correlation coefficient has been used in recommender systems to find groups of like minded shoppers that can recommend products to each other. It was the basis of Amazon.com’s recommender system from 1997 to 2000. I know this because I was part of the team that wrote that software :-)

\[covariance = \frac{\sum{(x -\bar{x}) \cdot (y-\bar{y})}}{n}\]
\[pearson = \frac{covariance(x,y)}{std(x) std(y)}\]
Calculate the pearson correlation between the loan_amount and the num_lenders_total or between time_to_raise and the loan_amount or between num_lenders_total and time_to_raise. If you divide up the class you can compare values to see which pair has the strongest correlation.

Post Project Questions

    During this project I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this project took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this project...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this project feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
Next Section - 8. Conditionals