- Big Data has several different hard problems that cannot be solved by changing just one thing.
- Big Data is 10-15x more complex than small data.
- The three main problems for Big Data are operations, development, and management.
- Management is crucial to the success of the project and problems tend to materialize early on.
- Operational problems can be reduced in complexity by moving to the cloud or using purpose-built software.
There’s a common misconception that says if I just change one thing in Big Data, everything else will be easier. The answer is that there are several different hard problems in Big Data. Changing one problem doesn’t solve the other problems.
Sometimes, I’ll see tweets or posts about how companies or vendors haven’t made Big Data easy. It makes the assumption that everything about Hadoop can be made simple. Also, it continues the assumption that there’s only one hard problem to solve.
Big Data is complex. In chapter 2 “The Need for Data Engineering” in Data Engineering Teams, I show how Big Data is 10-15x more complex than small data.
The three main problems for Big Data are: operations, development, and management.
Management
Setting up the team team correctly is crucial to the success of the project. I make that point over 73 pages in Data Engineering Teams.
In the scope of making this easier, there isn’t much that can be done. I’ve written the book giving the steps. If you still need help, we provide mentoring services for management and teams.
Problems in management tend to materialize early on. These problems are the culprits behind the early failures of Big Data projects. These projects just never go anywhere because they have the wrong people on the team.
Operations
Operational problems can be the easiest to reduce in complexity. You can move entirely to the cloud and remove the majority of operational overhead. You can use purpose-built software like Cloudera Manager or Apache Ambari. These allow you to have fewer people monitor and maintain a cluster, but don’t remove the need for operations people.
Operations problems tend to manifest after the first few months of the project.
Development
Development projects are the most difficult to reduce in complexity. Many people think that the move from Apache Hadoop to Apache Spark will reduce complexity. It doesn’t.
Others think that the stems from Hadoop or Spark being immature; it comes from them being general purpose systems.
Development problems tend to manifest throughout the project. A data pipeline is constantly being updated and added to. If the development team isn’t ready, these updates will take forever or the team will say they aren’t possible.
I stress the need for qualified Data Engineers. Without proper training and resources, data engineering projects never finish.
What to Do?
Some problems can be lessened and others require smart people. Don’t fall into the misconception that these problems can be magically made easy. In Big Data, an ounce of prevention is worth a ton of cure.