Personal Project Data Sources

Blog Summary: (AI Summaries by Summarizes)
  • Having no previous Big Data experience is not a barrier to getting hired as a Data Engineer if you have a well-executed personal project that showcases your skills.
  • Looking at available datasets can help you come up with an idea for your personal project and keep you focused.
  • Some interesting and novel datasets for personal projects include Planet's satellite imagery API, The GDELT Project's real-time monitoring of global news, municipal data dashboards, a large dataset of Jeopardy questions on Reddit, the GitHub Dataset on Google Cloud, and public datasets on Amazon Web Services.
  • There are also dedicated subreddits and lists of datasets for data science and machine learning projects.
  • If you don't find an interesting dataset, you can create one and highlight that in your project.

You don’t have previous Big Data experience, but want to get hired as a Data Engineer. Don’t worry, you can get hired. You’ll need a well executed personal project that gets you noticed and shows your skills. I’ve verified this with hiring managers all over the place. They will hire a brand-new person if they have an awesome personal project.

You’ll obviously need the Big Data skills to complete the project and get the job. The next big hangup is coming up with an idea. You can constrain this idea search by looking at available datasets. This helps narrow down your search and keeps you focused.

I’m going to share a few datasets that are both novel and interesting. These are the kinds of personal projects that will get you noticed.

Planet

Planet has an API to go through satellite imagery over time. They have a free tier for their API. This could be fascinating way to add or process imagery for your personal project.

The GDELT Project

The GDELT Project is a site that monitors the world’s broadcast, print, and web news. All of this is done in real-time. They automatically translate from over 100 languages. You could start comparing how news is covered in the same language in the same country.

The project has created and participated in demos and challenges. They visualized the interconnectedness of the media ecosystem. They’re looking at fake news. You can find more of their projects that used Big Query from Google Cloud.

Your Municipality

Depending on the city you live in, they may have a municipal data dashboard. I live in Reno and the local citizens have curated the city’s data.

Although Reno is an example, many other cities give their data. It could give your personal project a great local feel and interest.

You can find similar data at the state or province.

Jeopardy Questions

There is a large dataset on Reddit of Jeopardy questions. Could you use the GDELT or Wikipedia datasets to answer the questions?

GitHub Dataset

Google Cloud has the GitHub Dataset. You run some interesting analysis. Felipe Hoffa and answered some age old programming questions.

Cloud Datasets

I’d be remiss in not pointing out some of the public datasets in Amazon Web Services.

Others

Updates:

Rainbow Six Siege released their dataset of 20 GB. This could be an interesting project if you like games.

There is an entire subreddit dedicated to datasets.

Springboard has a list of data sources for data science projects.

Update: Here is another list of more machine learning focused datasets.

What to do now?

Get the technical skills and create your project. If you didn’t see an interesting data, make one up (but be sure to point that out).

In my Platinum level of Professional Data Engineering, I share the tips and strategies that made one of my personal projects go viral.

Related Posts

Data Teams Survey 2020-2024 Analysis

Blog Summary: (AI Summaries by Summarizes)**Total Value Creation**:**Gradual Decrease in Value Creation**:**Team Makeup and Descriptions**:**Methodologies**:**Advice**:Frequently Asked Questions (AI FAQ by Summarizes)

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of