- Apache Beam has just had its first API stable release, highlighting the growth of the project and increased usage in pre-production/development or production deployments.
- Beam is a cross-language and framework independent way of creating data pipelines, with a unified API for both batch and streaming computation.
- Beam is future-proof, allowing users to easily tune requirements around completeness and latency and run the same code across multiple runtime environments.
- Favorite features of Beam include StatefulDoFn and windowing/trigger, the ability to easily adapt business logic as needs change, and the ability to run the same code in different distributed systems.
- Beam is available in multiple languages and can be run on any supported runner, with Python support coming soon.
Apache Beam just had its first API stable release. Now that we have an API stable release, I want to update what’s changed in the Beam ecosystem. I want to highlight the growth of Beam as a project and the increased usage of Beam in pre-production/development or production deployments. Each committer and user is sharing their own opinion and not necessarily that of their company.
If you want to see how far Beam has come, you can the original Q and A I did for the first release of Beam.
Our interviewees are:
- Jesse Anderson (JA)
- Luke Cwik (LC)
- Kenneth Knowles (KK)
- Ted Yu (TY)
- James Xu(JX)
- Jingsong Li(JL)
- Joshua Fox (JF)
- Ahmet Altay (AA)
- Frances Perry (FP)
- Ismaël MejÃa (IM)
- Tyler Akidau (TA)
How do you describe Beam to others?
JA: I describe Beam as a cross language and framework independent way of creating data pipelines.
JX: A unified API to do both batch and streaming computation — The new assembly language for big data processing.
JL: Apache Beam is a portable and unified and cross language programing model. It is very flexible too.(I love StatefulDoFn)
JF: I describe it as like Spark or Hadoop but with more plumbing OOtB, including deployment of 3rd party libraries (jars) and autoscaling of workers.
FP: A set of abstractions for expressing data processing use cases across the spectrum from simple batch to complex stream processing, plus the ability to mix and match SDKs and runners.
IM: Apache Beam is a unified programming model to express both batch and streaming data processing jobs (pipelines). Beam enables users to easily tune requirements around completeness and latency and let them run the same code across multiple runtime environments.
TA: A unified model for large-scale streaming and batch data processing, with concrete SDKs in multiple languages (currently Java & Python), and the ability to run those SDKs on any supported runner (currently Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow). And with a temporary caveat that Python can’t yet be run absolutely anywhere, but that’s coming very soon.
What would you say to convince someone to use Beam for their next project?
LC: Python support
KK: Beam is the state of the art for writing unified batch/streaming big data processing pipelines, letting you use the latest innovations in a variety of operational scenarios (on premise, managed service, various engines) and your programming language of choice (currently Java, soon Python, more later).
JX: Focus on what you really care about — business logic, let the professionals care about the how to implement your business logic in big data engines.
JL: Just write your business logic and run it in the best engine!
JL: It is the next-gen after MapReduce/Hadoop/Spark. Much easier to use. Just focus on your logic, not the plumbing/
AA: Python support and nice integration with tensorflow through the tensorflow-transform project
IM: Beam is future-proof, you won’t need to rewrite your code if you plan to move into a different distributed system or cloud, this gives you incredible flexibility. Additionally it is the most complete model for big data processing.
TA: Beam is at the frontier of data processing. We provide some really strong API stability guarantees, so you can rest assured things aren’t going to change out from underneath you on minor version releases as has been known to happen elsewhere. And the ease of switching runners is a big benefit in a lot of cases. That will become especially true once Python portability is available; you’ll be able to write Python pipelines for runners that historically have been Java/Scala only.
What’s your favorite feature of Beam?
JL: StatefulDoFn and windowing/trigger.
FP: The ability to easily adapt your business logic as your needs around latency and completeness change — from batch to assorted streaming use cases.
IM: The ability to run the same code in different distributed systems.
TA: Currently state & timers. But once Python can execute anywhere, it will probably become that. And then maybe SQL after that.
Are you running or about to run Beam in production? What has been your experience? Has it changed anything about your developer experience?
JL: I have wrote some ETL jobs and data statistics jobs. It made my code more useful, not just on an engine.
IM: We at Talend use Beam for our Data Preparation solution already in production. Personally using Beam has been an eye opener for me. I had previous experience with other Big Data frameworks and at the beginning the abstractions of Beam didn’t seem to me as user friendly as those of the other systems, but now I have the impression that other APIs hide complexity issues to captivate casual users, but this hidden complexity ends biting in the end, in particular the closer you get to ops. Beam is lean, clean and straightforward, it is an approach with no surprises, this is something developers care about.
If you’re non-Google, why are you working on Beam? If you are from Google, why were you drawn to working on Beam? If you’ve been working on Beam for a while, has your motivation changed over time?
LC: (Google) Combining all the knowledge that Google had developed internally with all the insights produced by the external community is an awesome challenge and bodes well for the data processing & analytics community overall.
KK: When we started Beam, I was excited to be a part of pushing the state of the art in unifying batch and streaming executions – for users, unifying historical analysis with ongoing low-latency results. There’s still great stuff happening around stream/table duality and the relationship between DAG-based models and higher-level querying. But looking ahead I am focused on cross-language portability. We have a variety of great processing engines implemented primarily in Java. I want to make these available to any language via our portability APIs. Python and R are obvious choices for data lovers.
JX: Non-Google(Alipay). I am currently working on BeamSQL. Batch processing is very mature in our company, but streaming processing is not. We want to utilize BeamSQL to enable proficient batch processing users to reuse their knowledge in streaming processing, let them focus on business logic — let the BEAM do the rest.
JL: Non-Google(Alibaba). Flink, Spark, Storm, a lot of engines lead to user learning and development costs too much, I hope there is a standard model, to make big data user programming easier.
JF. Non-Google (Freightos). Beam looked like the easiest way to do simple ETL. (Actually, we are just copying a Datastore.) And it was easy, except for several cases of swallowed exceptions that were very hard to debug, and incompatibility of transitive Maven dependencies with the Google Cloud Datastore libraries. (For some reason, Beam uses an ad-hoc library to access Datastore.) But the real blocker was Issue BEAM-991. I am pleased that it is resolved for an upcoming release, but I am puzzled that this bug has been in the codebase since Datastore support began–yet users continue to use Beam. Could it be that no one is using Beam with Datastore, or the few that do have very small entities (<10 KB on average)?
FP (Googler): I worked on the internal Google technology that preceded Beam and was excited by the opportunity to integrate those ideas with the broader ecosystem.
IM: Non-Google (Talend). We (Talend) produce data integration solutions and we found that we needed to rewrite our codebase every time a new data processing framework got traction, Beam offers a solution to this problem, and not only an abstraction layer but also a rich model that covers all of our use cases. My motivation for Beam has evolved with the needs of the project. The first stable release is a great moment because we can start now focusing in higher level abstractions and we can work on the missing data connectors so users can use all the different data stores existing in their servers or in the cloud.
TA (Google): I was drawn to working on Beam because I saw it as an opportunity to help push forward the state of the art in data processing (and particularly stream processing, which is where most of my background is) as part of an open ecosystem. I still see it as such.
What’s a feature that you’d really like to see added to Beam?
LC: Iterative computation to better support machine learning initiatives.
KK: Transparent and automatically correct use of retractions, to unlock the stream/table duality and queryable state. Iterative computation (loops) with correct watermarks.
JX: Like what LC said, but include iterative computation in Beam might be a big design challenge.
JL: Support machine learning, Retracting, Schema support(like Dataframes, it will give the underlying engine more space to optimize performance.).
JF: A status-update callback, to allow a monitoring tool to track progress and detect end of processing, including success/failure.
IM: A unified way to submit and monitor pipelines, like a REST service that allows to execute and monitor what is going on independently of the processing engine.
TA: SQL support (happening, thanks to Mingmin, James, Tarush, and Luke) and retractions (will have to happen for SQL).
How would you recommend a new person get started with Beam?
KK: If you like to just get something running, we’ve got a Java quickstart and a Python quickstart, whichever works for you. For a sense of what it is like to work with Beam, try the WordCount walkthrough. And if you prefer videos, there are also a ton of talks on Youtube with little Beam samples, demos, and background on Beam’s ideas.
IM: Play with the examples, but more important try to solve a problem, even a trivial one with Beam. Find a dataset that you want to analyze and do it with Beam. Translate a program from Hadoop, Storm, Spark or whatever system you already know. The best way to learn is active-learning, and don’t be afraid to ask questions, we have an open community and the user mailing-list to help everyone.
TA: It really depends on what your goals are, but in general, just play with it. It’s pretty easy to start messing around with the direct runner, and not much additional work to bring that same pipeline up on any of the other runners.
How would you recommend a person start contributing to Beam?
LC: This is an easy one, https://beam.apache.org/contribute/contribution-guide/
KK: What Luke said :-). But if you do one thing, subscribe to dev@beam.apache.org!
TY: Also, subscribe to user@beam.apache.org to get to know what cases Beam users have
JX: Find the component which you are really interested in, then read the code for the component, learn from other’s pull request, contribute some small PR, then you are on the way.
AA: Look for the starter bugs in JIRA. They are tagged with starter newbie labels.
IM: Start humble, find a need you have and is not covered, e.g. fix a bug that you found, or the data connector that you need and it does not exist. It can seem hard at the beginning but Beam has good documentation now, a stable API and a great community to help you.
TA: Dive in. Read the contribution guide others listed. Subscribe to and ask questions on the Dev List dev@beam.apache.org. Explore. Work on things you’re passionate about.