7.2. Good Data Sources

7.2.1. Data Source List

There are many places on the web where you can find datasets for exploration and analysis. Here is a list of some sites that have organized data sets into categories or made them searchable in order to help you find datasets that match up with your own interests.

Most data sets are distributed under some kind of license. The license spells out the terms and conditions under which you are allowed to use the data. There are lots of different licenses out there. This article is a nice summary and comparison of many of the commonly used licenses. Please make sure you check the license for any data set that you wish to use. Nearly all of them are going to be fine for you to use in an educational setting, but it’s good to get in the habit of understanding the limitations of what you can and cannot do with a dataset that you do now own.

One restriction that is important to pay attention to is whether or not you can redistribute the data. Most dataset owners do not want you to redistribute their data. You should respect that and document a link to the original data, not your own copy of the data. You should always acknowledge the source of your data as part of your documentation. Many dataset owners even provide you with their preferred way for you to cite a dataset. This is because much of the data that is available to you for free is the result of some academic research work. It is important to the career(s) of the researcher(s) that they are given appropriate credit for the work they have done and have published. It is also just good practice as a member of the scientific community. Most researchers are keenly interested in what others learn from their data, and if you cite it properly, it makes it easy for them to learn about your own work.

Even if you clean it up, you should never republish or redistribute the data under a different license than the data was provided to you.

7.2.2. Screen Scraping Considerations

In section Case Study 1: Screen Scraping the CIA, we take you through the mechanics of screen scraping. In this section, we will look at some of the ethical considerations.

The first thing you should do before you get data from a website via screen scraping is to check the terms and conditions of the site. If screen scraping is prohibited by their terms, then you should definitely move on and look for a different source. If screen scraping is explicitly allowed, then you are good to proceed, but you are not quite finished with your responsible scraping research.

The next thing to check is the site’s robots.txt file. Many sites have these files to tell automated screen scraping programs, like Google’s web crawler, about any pages on their site the owners do not want to be scraped. Most sites have robots.txt in the top level of their domain. For example, the site robotstxt.org (which is a good resource for learning about the format of the robots.txt file) has the following information at the URL http://www.robotstxt.org/robots.txt.

User-agent: *

# too many repeated hits, too quick
User-agent: litefinder
Disallow: /

# Yahoo. too many repeated hits, too quick
User-agent: Slurp
Disallow: /

# too many repeated hits, too quick
User-agent: Baidu
Disallow: /

The first line says that in general, robots are allowed to read the pages of the website. However litefindr, Slurp, and Baidu are all asked to move along without reading any of the pages on the site. You can specify individual pages by disallowing them explicitly. See the site for details.

If the site does not explicitly allow or disallow scraping, the best policy is to contact them and ask permission. It is way easier to get permission or learn to stay away upfront than it is to get a cease and desist letter from corporate lawyers after the fact.

For most academic projects, like class assignments, sites are happy to help you learn and are happy to share their data. If your class project turns into an entrepreneurial adventure and you start making money, then you should probably revisit the license and permissions.

You have attempted of activities on this page