- Many engineers starting out with Big Data ask which technology to use for processing a dataset of 3 billion rows in 10,000 files that is 100 GB in size.
- The assumption is that small data technologies can't handle this, but this is a misunderstanding of what Big Data is and isn't.
- A dataset of 100 GB can easily fit in memory, so it's likely not a Big Data problem.
- Using a relational database instead of a Big Data technology has benefits such as less conceptual complexity, more prevalence in the marketplace, and faster speeds of queries.
- When someone tells you that your use case is small data, they're not belittling you, they're saving you time, money, and effort.
There is a common beginner question for engineers starting out with Big Data. An engineer will do a post to a social media site saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”
The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.
The issue starts with a misunderstanding of what Big Data is and isn’t. Here’s my definition. The person is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.
If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.
On the threads with answers to these questions, there is another person that responds and says that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.
This is because their use case would be so much better off in a small data technology like a relational database. Using a relational database instead of a Big Data technology has these major benefits:
- Less conceptual complexity
- More prevalent in the marketplace
- More people who know the technology
- Easier operationally
- Faster speeds of queries
- Cheaper operationally, technically, and people-wise
- Shorter development cycles
When someone is telling you that use your case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.
For toy and personal projects, these sorts of small datasets are fine. If you’re doing this for real for a production use case or a real project, do yourself a favor and stick to the small data technologies.
If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying can’t because you are hitting a known technical limitation. The only way solve these problems is with Big Data technologies. For these problems you will need data engineers.
Remember that if you have Big Data use cases, not every use case within an organization requires Big Data. There are still small data use case work nicely in their small data technologies. Using Big Data technologies for every use case will bring the same sorts of issues when dealing with small data use cases.