πŸ€” Computing Statistics with Kiva DataΒΆ

Kiva is an international nonprofit, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. We celebrate and support people looking to create a better future for themselves, their families and their communities. By lending as little as $25 on Kiva, anyone can help a borrower start or grow a business, go to school, access clean energy or realize their potential. For some, it’s a matter of survival, for others it’s the fuel for a life-long ambition. The following table contains some data that we will use to practice on some basic descriptive statistics that are commonly used in data science.

Kiva Lending DataΒΆ

id

loan_amount

country_name

status

time_to_raise

num_lenders_total

212763

1250.0

Azerbaijan

funded

193075.0

38

76281

500.0

El Salvador

funded

1157108.0

18

444097

1450.0

Bolivia

funded

1552939.0

51

402224

200.0

Paraguay

funded

244945.0

3

634949

700.0

El Salvador

funded

238797.0

21

1383386

100.0

Philippines

funded

1248909.0

1

351

250.0

Philippines

funded

773599.0

10

35651

225.0

Nicaragua

funded

116181.0

8

784253

1200.0

Guatemala

funded

2288095.0

42

1328839

150.0

Philippines

funded

51668.0

1

1094905

600.0

Paraguay

funded

26717.0

18

336986

300.0

Philippines

funded

48030.0

6

163170

700.0

Bolivia

funded

24078.0

28

1323915

125.0

Philippines

funded

71117.0

5

528261

650.0

Philippines

funded

580401.0

16

495978

175.0

Madagascar

funded

800427.0

7

1251510

1800.0

Georgia

funded

1156218.0

54

642684

1525.0

Uganda

funded

1166045.0

1

974324

575.0

Kenya

funded

2924705.0

18

7487

700.0

Tajikistan

funded

470622.0

22

957

1450.0

Jordan

funded

3046687.0

36

647494

400.0

Kenya

funded

260044.0

12

706941

200.0

Philippines

funded

445938.0

8

889708

1000.0

Ecuador

funded

201408.0

24

882568

350.0

Kenya

funded

2370450.0

8

There are some great (more advanced) tools in Python for working with massive tables of data. In fact this table is a random sample of a data set from Kiva that contains 1.4 million rows! We will move on to more and bigger data sets in time, but for now we need a simple way to work with this sample. To do that we will represent each column of the table as its own list.

To keep your coding easier and cleaner we will show you these lists here, but they will be automatically included for you in later activecodes. You can just use the list by name and it will be fine.

Level 1 QuestionsΒΆ

  1. What is the total amount of money loaned?

  2. What is the average loan amount?

  3. What is the largest/smallest loan?

  4. What country got the largest/smallest loan?

  5. What is the variance of the money loaned?

  6. What is the average number of days needed to fund a loan?

The questions in the list above are the way you would probably think of them when brainstorming or having a discussion with a colleague. Answering them in code often requires more precision in the way the questions are posed. We will restate these questions below and make them more precise.

Compute the total amount of money loaned and store it in the variable loan_total

Compute the average amount of money loaned and store it in the variable loan_average

Store the amount of the minimum loan in min_loan and the amount of the maximum loan in max_loan Then, store the name of the country that received the largest loan in max_country and the smallest loan in min_country Hint: max and min are built in Python functions that you can use to find the minimum value or maximum value in any sequence.

Compute the average number of lenders per loan and store it in a variable average_lenders

Compute the total number of loans made to the Philippines and store it in a variable philippines_count

For each unique country name, print a line that shows the name of the country and then the number of loans made in that country, like this: β€œGuatemala 1”

Level 2 QuestionsΒΆ

  1. What is the average amount of loans made to people in the Philippines?

  2. In which country was the loan granted that took the longest to fund?

  3. What is the average amount of time / dollar it takes to fund a loan?

  4. What is the standard deviation of the money loaned? The Empirical Rule or 68-95-99.7% Rule reminds us that 68% of the population falls within 1 standard deviation. Does this hold for our data?

  5. Is there a relationship between the loan amount and the number of people? Or time to fund? How would we measure this? Covariance? Correlation?

The index positions for the Phillipines are [5, 6, 9, 11, 13, 14, 22] Use that information to compute the average loan amount for the Phillipines. Store your result in the variable p_average

What is the name of the country with the loan that took the longest to raise? Store your result in the variable longest_to_fund

What is the arithmetic mean of the time / dollar it takes to fund a loan? The arithmetic mean is the average of the individual time/dollar calculations, not the average of the sum of time divided by the sum of dollar amounts. Store your result in the variable a_mean

For our final few exercises we are interested in exploring the distribution of the data as well as the relationships between two of our variables. To do this we need to introduce a few more statistical concepts including variance, standard deviation, covariance and correlation.

Variance looks at a single variable and measures how far the set of numbers are spread out from their average value. However its a bit hard to interpret because the units are squared so its not on the same scale as our original numbers. This is why most of the time we use the standard devation, which is just the square root of the variance. A large standard deviation tells us that our data is quite spread out while a small standard deviation tells us that most of our data is pretty close to the mean.

\[variance = \frac{\sum{ (x-\bar{x})^2}}{n}\]
\[stdev = \sqrt{variance}\]

Don’t let the fancy math get you down the variance is just the sum of the squared values of each value minus the average for that value divided by the number of values. This is a little more complicated that what you have done before but you can definitely do this.

Calculate the standard deviation of the loan_amount variable and store the variance in loan_var and the standard deviation in loan_stdev.

In data science we are often most interested in two variables that seem to influence one another. That is, we can observe that as one variable grows a second grows with it, or as one variable grows another variable shrinks at a similar rate. We will look at two ways to explore the relationships between these variables.

Covariance measures the larger values of one variable correspond to the larger values of a second variable as well as the extent to which the smaller values of one variable correspond to the smaller values of a second variable. If the covariance is positive it means the two variables grow together (positive correlation). If the magnitude is negative it means one variable grows while the other shrinks. The magnitude is hard to interpret because it depends on the values of the variables. So Most often the covariance is normalized so that the values are between minus 1 and positive 1, this is the pearson correlation coefficient A -1 indicates a strong negative correlation, a value of 0 indicates that the variables are not correlated at all, and a +1 indicates a strong positive correlation.

Historically the pearson correlation coefficient has been used in recommender systems to find groups of like minded shoppers that can recommend products to each other. It was the basis of Amazon.com’s recommender system from 1997 to 2000. I know this because I was part of the team that wrote that software :-)

\[covariance = \frac{\sum{(x -\bar{x}) \cdot (y-\bar{y})}}{n}\]
\[pearson = \frac{covariance(x,y)}{std(x) std(y)}\]

Calculate the pearson correlation between the loan_amount and the num_lenders_total or between time_to_raise and the loan_amount or between num_lenders_total and time_to_raise. If you divide up the class you can compare values to see which pair has the strongest correlation.

Post Project Questions

    During this project I was primarily in my...
  • 1. Comfort Zone
  • 2. Learning Zone
  • 3. Panic Zone
    Completing this project took...
  • 1. Very little time
  • 2. A reasonable amount of time
  • 3. More time than is reasonable
    Based on my own interests and needs, the things taught in this project...
  • 1. Don't seem worth learning
  • 2. May be worth learning
  • 3. Are definitely worth learning
    For me to master the things taught in this project feels...
  • 1. Definitely within reach
  • 2. Within reach if I try my hardest
  • 3. Out of reach no matter how hard I try
You have attempted of activities on this page