2.6. Case Study 2: Considering Starting a Business?¶
Data science and data analytics can be used to analyze and understand data related to many different fields, such as education, business, targeted advertising, healthcare, and many more. In this case study, we will explore a data set related to starting a business.
2.6.1. Thinking About Starting Your Business¶
This case study utilizes the starting a business (also called business startup) data set obtained from the Doing BusinessWorld Bank website. The data set contains indicators from over 190 countries that measure the relative ease of starting a business in those countries. The data set looks at two limited liability companies in various regions and countries around the world.
Each country in the data set measures things such as the minimum amount of capital investment an entrepreneur must have to start a business, and the number of procedures that are necessary to register the business, and more that will be covered throughout this case study.
The ease of starting a business for each country is measured on a scale of 0100. A score of zero represents the country with the lowest performance while 100 represents the country with the highest performance.
Below are definitions of the indicators found in the data set.
Starting a Business Rank: Countries are ranked on a score representing the relative ease of starting a business.
Starting a Business Score: These scores are calculated using the simple average of all the indicators’ scores.
Procedures: The activities that must be accomplished to get the business registered with its associated governmental entity.
Time (days): The median number of days needed to get the business up and running for each country/region.
Cost: The amount of money that must be expended to get the business started (such as official, legal, and professional services fees, etc.).
PaidIn Minimum Capital: The minimum amount of money the entrepreneur must have in the bank for the business registration process to be completed.
Income Level: This represents the income levels of each country’s economy. This indicator is divided into low, lowermiddle, uppermiddle, and high, based on a country’s gross national income (GNI) per person.
We will use Google Sheets to explore which of these indicators are most important to start a new business in each economy’s largest cities. Import the data set that you downloaded earlier, starting a business, into Google Sheets.
Business StartUp Rank

Incorrect
Income Level

Correct
Business StartUp Score

Incorrect
Procedure – Men (number)

Incorrect
Q1: Identify and select the columns in the business startup dataset that represent categorical (nominal) data?
2.6.2. Business StartUp Data Analysis Research Questions¶
What are the different factors that lead to a high ranking in the business startup dataset?
What role does “income level” play in determining the rank of a country?
What factor, on average, contributes most/least to the business startup rank?
What similarities and differences do the countries experiencing the highest/lowest business startup rank have? Are there any discrepancies between different score factors of countries with similar rankings?
For the countries with the largest change, which factors changed the most? Are these factors the same as you identified in the first 3 questions?
The data set lists countries based on their business startup scores. While it is easy to see the best countries for starting a business based on the business startup rank, it is not easy to grasp the relative simplicity of each country. We can use the functions that we learned in the previous case study to create a common baseline: average, standard deviation, and median. So, let’s average the business startup score of all countries together.
Use the
AVERAGE
function to calculate the mean in column D. Scroll down and click on a cell in column 194. There should be an empty cell below the column of numbers for the business startup score. Now type=AVERAGE(D2:D192)
. You can also type=AVERAGE(
and then click and drag the numbers you want.D2:D192
specifies a range, from Column D Row 2 down to Column D Row 192.Many formulas in Google Sheets use ranges. They can span cells in a single column, single row, and they can span rows and columns, which form a rectangular shape. Try it yourself, calculate the range of:
E2:O2
E2:E192
E2:L192
Standard deviation is the average distance from the mean. It shows how spread out the data is more than other types of variabilities. The median is also as important because it provides another kind of baseline besides mean and mode. Calculate the
STDEV
andMEDIAN
for the business startup score column.Calculate the standard deviation and median by copying and pasting the formula to other columns.
Copy the formula for
=AVERAGE(D2:D141)
from a, and the formula for standard deviation from c then calculate:
Q2: What is the mean value for the GNI?
Q3: What is the standard deviation for the GNI?
Remember, use a
$
so Google Sheets will not change the cell references when copy/pasting.
2.6.3. Visualizing How to Start a Business¶
1. Visualizing the data is a great way to begin to interpret the data because doing so allows the viewer to easily see trends or find outliers. A histogram is one way to visualize the standard deviation of a particular data set.
2. When you have a data set covering the entire world, it can be interesting to identify certain information. For instance, you can calculate which countries have the largest or smallest GNI, the income per capita of women and men, and so on.
Remember, finding the maximum value of a column does not mean we know which country it corresponds to. Therefore, we can use the
MATCH
andINDEX
functions to fix this problem. Let’s find what country corresponds to the maximum value of GNI. First, calculate the maximum GNI in cell M193, then in cell M194 type=MATCH(M193, M2:M192, 0)
. Notice that the match function searches for the value in cell M193 in the rangeM2:M192
, and the 0 tells Google Sheets that the data is not sorted. The 0 is important because, without it, sheets will assume the data is sorted and will stop when it finds a value greater than the value in M194.Type
=INDEX(A2:A192, M194)
in cell M195. TheA2:A192
parameters is the range from whichINDEX
will return a corresponding value; in this case, it is the location. M194 from the previous question is=MATCH(M193, M2:M192, 0)
. So theINDEX
is practically telling sheets to find the location, from column A, that is found in the same row as the maximum value.All three steps shown above can be performed in a single cell. Let’s look at the country that has the lowest Procedure Men number. In cell E193 type
=INDEX($A2:$A141, MATCH(MIN(E2:E141), E2:E141, 0))
. TheMATCH
andMIN
functions both return one value. So, sheets will first find the minimum value in cellsJ2:J141
. Then it will use theMATCH
function to find the cell location (column and row) of where that minimum value is. Finally, it will use theINDEX
function to find what value fromA2:A141
matches up with the given parameters. Try this and see what it returns. It should return New Zealand, its region, business startup rank, and business startup score.Practice using the functions you have learned by finding the names of locations for other columns.
If you want to copy/paste, check the ranges carefully and add the
$
sign to avoid running into errors.
3. Another great way of visualizing data is to use a choropleth. As you know, a choropleth takes in a set of geographic data and uses a map to show another set of data, such as business startup score.
Click on Insert then select Chart
On the new Chart editor section, click on Chart Type and select Geo Chart
Select location column (
A2:A192
) as the region and any column that you wish to see as the Color.You may hover around each country to see its respective statistic.
4. You may be wondering if there is a correlation between a country’s ease of starting a business score and GNI or procedure.
One way to check this is to use the CORREL
function to see how the score is affected by each factor i.e., business startup score to GNI,
business startup score to the procedure, business startup score to time.
We can use the
CORREL
function to calculate the Pearson correlation between two ranges of data. Remember to use the$
sign to anchor the column and the values in the business startup data set$D$2:$D192
, so it doesn’t change when it is copypasted to use for other columns.Repeat the above exercise by changing or copypasting it to different columns to see the correlation with other factors listed.
5. To better understand what leads a country to have a high score in creating a business, calculate the top and bottom 20 countries’ correlation scores. Are there any interesting results?
Calculate the mean of each factor for the top 20 countries, then do so for the bottom 20 countries. Calculate the difference in those averages for each of the factors for the top and bottom 20 countries. Which factors have the most impact on the business startup score?
6. While using the choropleth, you might have noticed some outliers in the data, for example, South Africa has one of the lowest cost of starting a business but is ranked 139. The countries above and below South Africa have a cost of 5 and 5.7 while South Africa has a cost of 0.2.
For all countries, compute the countries’ average cost immediately above and below it and subtract that from the chosen country’s average cost. Store your findings in a new separate column called Average Cost.
Use conditional formatting to help visually pick out the outliers.
Sort the data by selecting the average cost column containing the average costs, clicking on the Data Tab and select the Sort sheet by A > Z.
Lesson Feedback

During this lesson I was primarily in my...
 1. Comfort Zone
 2. Learning Zone
 3. Panic Zone

Completing this lesson took...
 1. Very little time
 2. A reasonable amount of time
 3. More time than is reasonable

Based on my own interests and needs, the things taught in this lesson...
 1. Don't seem worth learning
 2. May be worth learning
 3. Are definitely worth learning

For me to master the things taught in this lesson feels...
 1. Definitely within reach
 2. Within reach if I try my hardest
 3. Out of reach no matter how hard I try