I want to share with you some of the traits that I’ve found in especially good Data Engineers. Every one of these traits may not be in every Data Engineer, but you will find several. I can’t stress enough how important it is for a Data Engineer to have a...
Today’s blog post comes from a question from a subscriber on my mailing list. The question come from Guruprasad B.R.: What are the best ways to Ingest data in to Big Data (HBase/HDFS) from different sources like FTP, Web, Email, RDBMS,..etc There are a couple...
In this video, I live code a dedupe algorithm. If you’re not familiar with this algorithm, you need to take several data files and remove the duplicates. I show the simple version. Then, I show a more complicated version that adds some custom logic. If you want...
Sometimes companies will start writing code or designing a solution before I train there. This is usually a bad idea. It really shows the difference between Big Data and small data. Making a mistake with small data isn’t costly and doesn’t take long to...
Facebook Twitter LinkedIn Digg Google+ reddit Hacker News Delicious Working with complex and multi-module Maven projects can be a handful. These are a few tips to make that easier. I’m going to use Apache Beam as an example of a multi-module Maven project. The...
In a previous post, I showed how to use Beam’s Regex class to split up a string. In this post, I’m going to going to show some other features of the Regex class. The Regex class gives you a distributed way to work with strings. I tried to make the...