7.16. ðŸ¤” Computing Statistics with Kiva DataÂ¶
Kiva is an international nonprofit, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. We celebrate and support people looking to create a better future for themselves, their families and their communities. By lending as little as $25 on Kiva, anyone can help a borrower start or grow a business, go to school, access clean energy or realize their potential. For some, itâ€™s a matter of survival, for others itâ€™s the fuel for a lifelong ambition. The following table contains some data that we will use to practice on some basic descriptive statistics that are commonly used in data science.
id  loan_amount  country_name  status  time_to_raise  num_lenders_total 

212763  1250.0  Azerbaijan  funded  193075.0  38 
76281  500.0  El Salvador  funded  1157108.0  18 
444097  1450.0  Bolivia  funded  1552939.0  51 
402224  200.0  Paraguay  funded  244945.0  3 
634949  700.0  El Salvador  funded  238797.0  21 
1383386  100.0  Philippines  funded  1248909.0  1 
351  250.0  Philippines  funded  773599.0  10 
35651  225.0  Nicaragua  funded  116181.0  8 
784253  1200.0  Guatemala  funded  2288095.0  42 
1328839  150.0  Philippines  funded  51668.0  1 
1094905  600.0  Paraguay  funded  26717.0  18 
336986  300.0  Philippines  funded  48030.0  6 
163170  700.0  Bolivia  funded  24078.0  28 
1323915  125.0  Philippines  funded  71117.0  5 
528261  650.0  Philippines  funded  580401.0  16 
495978  175.0  Madagascar  funded  800427.0  7 
1251510  1800.0  Georgia  funded  1156218.0  54 
642684  1525.0  Uganda  funded  1166045.0  1 
974324  575.0  Kenya  funded  2924705.0  18 
7487  700.0  Tajikistan  funded  470622.0  22 
957  1450.0  Jordan  funded  3046687.0  36 
647494  400.0  Kenya  funded  260044.0  12 
706941  200.0  Philippines  funded  445938.0  8 
889708  1000.0  Ecuador  funded  201408.0  24 
882568  350.0  Kenya  funded  2370450.0  8 
There are some great (more advanced) tools in Python for working with massive tables of data. In fact this table is a random sample of a data set from Kiva that contains 1.4 million rows! We will move on to more and bigger data sets in time, but for now we need a simple way to work with this sample. To do that we will represent each column of the table as its own list.
To keep your coding easier and cleaner we will show you these lists here, but they will be automatically included for you in later activecodes. You can just use the list by name and it will be fine.
7.16.1. Level 1 QuestionsÂ¶
 What is the total amount of money loaned?
 What is the average loan amount?
 What is the largest/smallest loan?
 What country got the largest/smallest loan?
 What is the variance of the money loaned?
 What is the average number of days needed to fund a loan?
loan_total
loan_average
min_loan
and the amount of the maximum loan in max_loan
Then, store the name of the country that received the largest loan in max_country
and the smallest loan in min_country
Hint: max
and min
are builting Python functions that you can use to find the minimum value or maximum value in any sequence.
average_lenders
philippines_count
7.16.2. Level 2 QuestionsÂ¶
 What is the average amount of loans made to people in the Philippines?
 In which country was the loan granted that took the longest to fund?
 What is the average amount of time / dollar it takes to fund a loan?
 What is the standard deviation of the money loaned? The Empirical Rule or 689599.7% Rule reminds us that 68% of the population falls within 1 standard deviation. Does this hold for our data?
 Is there a relationship between the loan amount and the number of people? Or time to fund? How would we measure this? Covariance? Correlation?
[5, 6, 9, 11, 13, 14, 22]
Use that information to compute the average loan amount for the Phillipines. Store your result in the variable p_average
longest_to_fund
a_mean
For our final few exercises we are interested in exploring the distribution of the data as well as the relationships between two of our variables. To do this we need to introduce a few more statistical concepts including variance, standard deviation, covariance and correlation.
Variance looks at a single variable and measures how far the set of numbers are spread out from their average value. However its a bit hard to interpret because the units are squared so its not on the same scale as our original numbers. This is why most of the time we use the standard devation, which is just the square root of the variance. A large standard deviation tells us that our data is quite spread out while a small standard deviation tells us that most of our data is pretty close to the mean.
Donâ€™t let the fancy math get you down the variance is just the sum of the squared values of each value minus the average for that value divided by the number of values. This is a little more complicated that what you have done before but you can definitely do this.
loan_stdev
.
In data science we are often most interested in two variables that seem to influence one another. That is, we can observe that as one variable grows a second grows with it, or as one variable grows another variable shrinks at a similar rate. We will look at two ways to explore the relationships between these variables.
Covariance measures the larger values of one variable correspond to the larger values of a second variable as well as the extent to which the smaller values of one variable correspond to the smaller values of a second variable. If the covariance is positive it means the two variables grow together (positive correlation). If the magnitude is negative it means one variable grows while the other shrinks. The magnitude is hard to interpret because it depends on the values of the variables. So Most often the covariance is normalized so that the values are between minus 1 and positive 1, this is the pearson correlation coefficient A 1 indicates a strong negative correlation, a value of 0 indicates that the variables are not correlated at all, and a +1 indicates a strong positive correlation.
Historically the pearson correlation coefficient has been used in recommender systems to find groups of like minded shoppers that can recommend products to each other. It was the basis of Amazon.comâ€™s recommender system from 1997 to 2000. I know this because I was part of the team that wrote that software :)
Post Project Questions

During this project I was primarily in my...
 Comfort Zone
 Learning Zone
 Panic Zone

Completing this project took...
 Very little time
 A reasonable amount of time
 More time than is reasonable

Based on my own interests and needs, the things taught in this project...
 Don't seem worth learning
 May be worth learning
 Are definitely worth learning

For me to master the things taught in this project feels...
 Definitely within reach
 Within reach if I try my hardest
 Out of reach no matter how hard I try