There Are Several Hard Problems with Big Data

Blog Summary: (AI Summaries by Summarizes)
  • Big Data has several different hard problems that cannot be solved by changing just one thing.
  • Big Data is 10-15x more complex than small data.
  • The three main problems for Big Data are operations, development, and management.
  • Management is crucial to the success of the project and problems tend to materialize early on.
  • Operational problems can be reduced in complexity by moving to the cloud or using purpose-built software.

There’s a common misconception that says if I just change one thing in Big Data, everything else will be easier. The answer is that there are several different hard problems in Big Data. Changing one problem doesn’t solve the other problems.

Sometimes, I’ll see tweets or posts about how companies or vendors haven’t made Big Data easy. It makes the assumption that everything about Hadoop can be made simple. Also, it continues the assumption that there’s only one hard problem to solve.

Big Data is complex. In chapter 2 “The Need for Data Engineering” in Data Engineering Teams, I show how Big Data is 10-15x more complex than small data.

The three main problems for Big Data are: operations, development, and management.

Management

Setting up the team team correctly is crucial to the success of the project. I make that point over 73 pages in Data Engineering Teams.

In the scope of making this easier, there isn’t much that can be done. I’ve written the book giving the steps. If you still need help, we provide mentoring services for management and teams.

Problems in management tend to materialize early on. These problems are the culprits behind the early failures of Big Data projects. These projects just never go anywhere because they have the wrong people on the team.

Operations

Operational problems can be the easiest to reduce in complexity. You can move entirely to the cloud and remove the majority of operational overhead. You can use purpose-built software like Cloudera Manager or Apache Ambari. These allow you to have fewer people monitor and maintain a cluster, but don’t remove the need for operations people.

Operations problems tend to manifest after the first few months of the project.

Development

Development projects are the most difficult to reduce in complexity. Many people think that the move from Apache Hadoop to Apache Spark will reduce complexity. It doesn’t.

Others think that the stems from Hadoop or Spark being immature; it comes from them being general purpose systems.

Development problems tend to manifest throughout the project. A data pipeline is constantly being updated and added to. If the development team isn’t ready, these updates will take forever or the team will say they aren’t possible.

I stress the need for qualified Data Engineers. Without proper training and resources, data engineering projects never finish.

What to Do?

Some problems can be lessened and others require smart people. Don’t fall into the misconception that these problems can be magically made easy. In Big Data, an ounce of prevention is worth a ton of cure.

Related Posts

Data Teams Survey 2020-2024 Analysis

Blog Summary: (AI Summaries by Summarizes)**Total Value Creation**:**Gradual Decrease in Value Creation**:**Team Makeup and Descriptions**:**Methodologies**:**Advice**:Frequently Asked Questions (AI FAQ by Summarizes)

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of