🤔 Computing Statistics with Kiva Data¶

Kiva is an international nonprofit, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. We celebrate and support people looking to create a better future for themselves, their families and their communities. By lending as little as $25 on Kiva, anyone can help a borrower start or grow a business, go to school, access clean energy or realize their potential. For some, it’s a matter of survival, for others it’s the fuel for a life-long ambition. The following table contains some data that we will use to practice on some basic descriptive statistics that are commonly used in data science.

Kiva Lending Data¶
id	loan_amount	country_name	status	time_to_raise	num_lenders_total
212763	1250.0	Azerbaijan	funded	193075.0	38
76281	500.0	El Salvador	funded	1157108.0	18
444097	1450.0	Bolivia	funded	1552939.0	51
402224	200.0	Paraguay	funded	244945.0	3
634949	700.0	El Salvador	funded	238797.0	21
1383386	100.0	Philippines	funded	1248909.0	1
351	250.0	Philippines	funded	773599.0	10
35651	225.0	Nicaragua	funded	116181.0	8
784253	1200.0	Guatemala	funded	2288095.0	42
1328839	150.0	Philippines	funded	51668.0	1
1094905	600.0	Paraguay	funded	26717.0	18
336986	300.0	Philippines	funded	48030.0	6
163170	700.0	Bolivia	funded	24078.0	28
1323915	125.0	Philippines	funded	71117.0	5
528261	650.0	Philippines	funded	580401.0	16
495978	175.0	Madagascar	funded	800427.0	7
1251510	1800.0	Georgia	funded	1156218.0	54
642684	1525.0	Uganda	funded	1166045.0	1
974324	575.0	Kenya	funded	2924705.0	18
7487	700.0	Tajikistan	funded	470622.0	22
957	1450.0	Jordan	funded	3046687.0	36
647494	400.0	Kenya	funded	260044.0	12
706941	200.0	Philippines	funded	445938.0	8
889708	1000.0	Ecuador	funded	201408.0	24
882568	350.0	Kenya	funded	2370450.0	8

There are some great (more advanced) tools in Python for working with massive tables of data. In fact this table is a random sample of a data set from Kiva that contains 1.4 million rows! We will move on to more and bigger data sets in time, but for now we need a simple way to work with this sample. To do that we will represent each column of the table as its own list.

To keep your coding easier and cleaner we will show you these lists here, but they will be automatically included for you in later activecodes. You can just use the list by name and it will be fine.

Level 1 Questions¶

What is the total amount of money loaned?
What is the average loan amount?
What is the largest/smallest loan?
What country got the largest/smallest loan?
What is the variance of the money loaned?
What is the average number of days needed to fund a loan?

The questions in the list above are the way you would probably think of them when brainstorming or having a discussion with a colleague. Answering them in code often requires more precision in the way the questions are posed. We will restate these questions below and make them more precise.

Compute the total amount of money loaned and store it in the variable loan_total

Compute the average amount of money loaned and store it in the variable loan_average

Store the amount of the minimum loan in min_loan and the amount of the maximum loan in max_loan Then, store the name of the country that received the largest loan in max_country and the smallest loan in min_country Hint: max and min are built in Python functions that you can use to find the minimum value or maximum value in any sequence.

Compute the average number of lenders per loan and store it in a variable average_lenders

Compute the total number of loans made to the Philippines and store it in a variable philippines_count

For each unique country name, print a line that shows the name of the country and then the number of loans made in that country, like this: “Guatemala 1”

Level 2 Questions¶

What is the average amount of loans made to people in the Philippines?
In which country was the loan granted that took the longest to fund?
What is the average amount of time / dollar it takes to fund a loan?
What is the standard deviation of the money loaned? The Empirical Rule or 68-95-99.7% Rule reminds us that 68% of the population falls within 1 standard deviation. Does this hold for our data?
Is there a relationship between the loan amount and the number of people? Or time to fund? How would we measure this? Covariance? Correlation?

The index positions for the Phillipines are [5, 6, 9, 11, 13, 14, 22] Use that information to compute the average loan amount for the Phillipines. Store your result in the variable p_average

What is the name of the country with the loan that took the longest to raise? Store your result in the variable longest_to_fund

What is the arithmetic mean of the time / dollar it takes to fund a loan? The arithmetic mean is the average of the individual time/dollar calculations, not the average of the sum of time divided by the sum of dollar amounts. Store your result in the variable a_mean

For our final few exercises we are interested in exploring the distribution of the data as well as the relationships between two of our variables. To do this we need to introduce a few more statistical concepts including variance, standard deviation, covariance and correlation.

Variance looks at a single variable and measures how far the set of numbers are spread out from their average value. However its a bit hard to interpret because the units are squared so its not on the same scale as our original numbers. This is why most of the time we use the standard devation, which is just the square root of the variance. A large standard deviation tells us that our data is quite spread out while a small standard deviation tells us that most of our data is pretty close to the mean.

\[variance = \frac{\sum{ (x-\bar{x})^2}}{n}\]

\[stdev = \sqrt{variance}\]

Don’t let the fancy math get you down the variance is just the sum of the squared values of each value minus the average for that value divided by the number of values. This is a little more complicated that what you have done before but you can definitely do this.

Calculate the standard deviation of the loan_amount variable and store the variance in loan_var and the standard deviation in loan_stdev.

In data science we are often most interested in two variables that seem to influence one another. That is, we can observe that as one variable grows a second grows with it, or as one variable grows another variable shrinks at a similar rate. We will look at two ways to explore the relationships between these variables.

Covariance measures the larger values of one variable correspond to the larger values of a second variable as well as the extent to which the smaller values of one variable correspond to the smaller values of a second variable. If the covariance is positive it means the two variables grow together (positive correlation). If the magnitude is negative it means one variable grows while the other shrinks. The magnitude is hard to interpret because it depends on the values of the variables. So Most often the covariance is normalized so that the values are between minus 1 and positive 1, this is the pearson correlation coefficient A -1 indicates a strong negative correlation, a value of 0 indicates that the variables are not correlated at all, and a +1 indicates a strong positive correlation.

Historically the pearson correlation coefficient has been used in recommender systems to find groups of like minded shoppers that can recommend products to each other. It was the basis of Amazon.com’s recommender system from 1997 to 2000. I know this because I was part of the team that wrote that software :-)

\[covariance = \frac{\sum{(x -\bar{x}) \cdot (y-\bar{y})}}{n}\]

\[pearson = \frac{covariance(x,y)}{std(x) std(y)}\]

Calculate the pearson correlation between the loan_amount and the num_lenders_total or between time_to_raise and the loan_amount or between num_lenders_total and time_to_raise. If you divide up the class you can compare values to see which pair has the strongest correlation.

Post Project Questions

1. Comfort Zone
2. Learning Zone
3. Panic Zone

1. Very little time
2. A reasonable amount of time
3. More time than is reasonable

1. Don't seem worth learning
2. May be worth learning
3. Are definitely worth learning

1. Definitely within reach
2. Within reach if I try my hardest
3. Out of reach no matter how hard I try

You have attempted of activities on this page