Saying You Have Small Data Isn’t Belittling Your Use Case

Blog Summary: (AI Summaries by Summarizes)
  • Many engineers starting out with Big Data ask which technology to use for processing a dataset of 3 billion rows in 10,000 files that is 100 GB in size.
  • The assumption is that small data technologies can't handle this, but this is a misunderstanding of what Big Data is and isn't.
  • A dataset of 100 GB can easily fit in memory, so it's likely not a Big Data problem.
  • Using a relational database instead of a Big Data technology has benefits such as less conceptual complexity, more prevalence in the marketplace, and faster speeds of queries.
  • When someone tells you that your use case is small data, they're not belittling you, they're saving you time, money, and effort.

There is a common beginner question for engineers starting out with Big Data. An engineer will do a post to a social media site saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”

The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.

The issue starts with a misunderstanding of what Big Data is and isn’t. Here’s my definition. The person is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.

If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.

On the threads with answers to these questions, there is another person that responds and says that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.

This is because their use case would be so much better off in a small data technology like a relational database. Using a relational database instead of a Big Data technology has these major benefits:

  • Less conceptual complexity
  • More prevalent in the marketplace
  • More people who know the technology
  • Easier operationally
  • Faster speeds of queries
  • Cheaper operationally, technically, and people-wise
  • Shorter development cycles

When someone is telling you that use your case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.

For toy and personal projects, these sorts of small datasets are fine. If you’re doing this for real for a production use case or a real project, do yourself a favor and stick to the small data technologies.

If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying can’t because you are hitting a known technical limitation. The only way solve these problems is with Big Data technologies. For these problems you will need data engineers.

Remember that if you have Big Data use cases, not every use case within an organization requires Big Data. There are still small data use case work nicely in their small data technologies. Using Big Data technologies for every use case will bring the same sorts of issues when dealing with small data use cases.

Related Posts

Data Teams Survey 2020-2024 Analysis

Blog Summary: (AI Summaries by Summarizes)**Total Value Creation**:**Gradual Decrease in Value Creation**:**Team Makeup and Descriptions**:**Methodologies**:**Advice**:Frequently Asked Questions (AI FAQ by Summarizes)

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of