11 Websites to Find Free, Interesting Datasets
If you're new to the data space, recently learned a new skill, or are trying to build a more robust data science/analyst portfolio, a perfect way of solidifying your skills is to do some mini-projects focused on your new skills. Below we outline a few places you can find publicly available data for your next project.
If you're interested in practicing real data scientist and analyst interview questions, feel free to sign up for our email newsletter, where we send a few curated questions per week to help you prepare for interviews at top companies.
1. FiveThirtyEight
FiveThirtyEight is an interactive news and sports site that has some incredible data visualizations (which you should totally check out). They make a lot of their data open to the public, meaning you can download and play with the source data yourself! Here are some examples:
- Airline Safety — contains information on accidents from each airline.
- US Weather History — historical weather data for the US.
- Study Drugs — data on who's taking Adderall in the US.
2. BuzzFeed News
BuzzFeed makes the data sets, analysis, libraries, tools, and guides used in its articles available on GitHub. Check them out to learn from some of the best! Here are some examples:
- Federal Surveillance Planes — contains data on planes used for domestic surveillance.
- Zika Virus — data about the geography of the Zika virus outbreak.
- Firearm Background Checks — data on background checks of people attempting to buy firearms.
3. Kaggle
Kaggle, recently acquired by Google, is a place where you can learn, practice, and fine-tune your data science/analytics skills. They have tons of data that’s open to the public, and allow users of the platform to share code so you can learn best practices within the data space. They also host competitions where you can win real money if you have a top ranking model! Here are some examples:
- Federal Surveillance Planes — contains data on planes used for domestic surveillance.
- Zika Virus — data about the geography of the Zika virus outbreak.
- Firearm background checks — data on background checks of people attempting to buy firearms.
4. Socrata
Socrata hosts cleaned open source data sources ranging from government, business, and education data sets. Here are some examples:
- White House Staff Salaries — data on the salary of each White House staffer in 2010.
- Radiation Analysis — data on which milk products in the US were radioactive based on location.
- Workplace Fatalities by US State — the number of workplace deaths across the US.
5. Awesome-Public-Datasets on GitHub
This github hosts a library of awesome, public datasets! They are all sorted by category and link you straight to the hosting website. Here are some examples:
-
Global Climate Data — climate information for every country in the world with historical data (some date back to 1929).
-
Heart Rate Time Series Data — two series of data contains 1800 evenly-spaced measurements of instantaneous heart rate from a single subject.
-
Plane Crash Database — plane crash data dating from 1929.
6. Google Public Datasets
Google lists all of the data sets on a page. Google has a cloud hosting service called Google Cloud Platform (GCP), and you can query using a tool called BigQuery to explore these datasets. You'll need to sign up for a GCP account, but the first 1TB of queries you make are free! Here are some examples:
- US Name Data Set — contains all names from social security card applications from births that occur after 1879.
- Major League Baseball Data — data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016.
7. UCI Machine Learning Repository
University of California Irvine hosts 440 data sets as a service to the machine learning community. These data sets are nice because most of them are squeaky clean and ready for modeling! Here are some examples:
-
Iris Data Set — the most famous pattern recognition dataset.
-
Wine Data Set — using chemical analysis to determine the origin of wine.
-
Forest Fires — try to predict the burn area of forest fires using this dataset.
8. Data.gov
Data.gov allows you to download and explore data from multiple US government agencies. Data can range from government budgets to climate data. The data is very well documented so you should have an easy time navigating the sources. You can browse the data sets on Data.gov directly without registering. You can browse by topic area or search for a specific data set. Here are some examples:
- Food Environment Atlas — contains data on how local food choices affect diet in the US.
- School System Finances — a survey of the finances of school systems in the US.
- Chronic Disease Data — data on chronic disease indicators in areas across the US.
9. Academic Torrents
Academic Torrents is a site that is geared around sharing the data sets from scientific papers. It has tons of interesting data sets. You can browse the data sets directly on the site and download them. Here are some examples:
- Enron Emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
- Student Learning Factors — a set of factors that measure and influence student learning.
- News Articles — contains news article attributes and a target variable.
10. Quandl
Quandl is a repository of economic and financial data. Some of the datasets are free, while others are up for purchase. Here are some examples:
- Entrepreneurial Activity by Race and Other Factors — contains data from the Kauffman foundation on entrepreneurs in the US.
- Chinese Macroeconomic Data — indicators of Chinese economic health.
- US Federal Reserve Data — US economic indicators from the Federal Reserve.
11. Jeremy Singer-Vine
Jeremy Singer-Vine collects awesome data sets across multiple sources. If you're interested in getting data sets straight to your inbox, you should consider signing up for his newsletter.