- Data Scientists are often hired with the expectation that they will create models, but they may not have the necessary skills to create the data pipeline needed for those models.
- The definition of a Data Scientist is highly variable, and their programming and distributed system skill level can range from beginner to advanced.
- Beginner to intermediate programmers may struggle to create a data pipeline due to a lack of programming, distributed systems, and Big Data skills.
- This can lead to Data Scientists being idle for 2-6 months, which can result in them quitting after about 6 months.
- It is recommended to have a data pipeline in place before hiring a Data Scientist, which may require creating a data engineering team first.
Sometimes I’ll train at a company that’s creating a data engineering team. The team often includes a Data Scientist.
I’ll always make a note to talk to the Data Scientist about their experience and interactions with the team before I arrived. These Data Scientists are recent hires – within the last 6 months. A clear theme is that their time is under-utilized. They’ve been waiting for 2-6 months for a Data Engineer to create the data pipeline for them.
The trouble is that the definition of Data Scientist is highly variable. For some, it means a person with some programming skills that has math skills. With Data Scientists, the programming and distributed system skill level is incredibly variable. They can range from people with a CS degree to beginner programmers.
These beginner to intermediate programmers will have the most difficulty in creating the data pipeline. They’re lacking the programming, distributed systems, and Big Data skills to create a data pipeline because that’s a complex endeavor; they’re not lacking the math or statistical skills.
These inabilities lead to issues all around. The Data Scientist expected the data pipeline to already be created when they were hired. They’re used to creating the models and not the hardcore data engineering that’s needed. They’re consumers of the pipeline and not the creators of the pipeline. The company and managers are expecting the Data Scientist to create the data pipeline.
When I’ve encountered this issue, the Data Scientist has been idle for 2-6 months. After about 6 months they’ll quit. They haven’t done any of the cool stuff they thought they were signing on for. At small companies, this spells the end of the Big Data foray.
My suggestion is to make sure you have a data pipeline before hiring your first Data Scientist. This will require you to create a data engineering team, before or at the same, as you’re creating a data science team. At a minimum, you need to inventory your datasets and make them available before hiring a Data Scientist.
I talk more about the relationship between a data science and data engineering team in my Data Engineering Teams book. It walks you through the skills the team needs and why they’re so important.