Using Apache Beam vs other frameworks for bulk data processing
Apache Beam is a unified programming model for implementing batch and stream processing. It can run on multiple runners, such as Apache Flink, Apache Spark, and Google Cloud Dataflow. This advantage makes Apache Beam more attractive compare to other frameworks, such as Apache Flink (for stream processing) or Apache Spark (for bulk/batch processing). However, Apache Beam has some caveats that are not suitable for some stream processing use cases, such as:
- No ordering guarantees, which means our application must handle/tolerate unordered messages.
- At least once, which means our application must handle/tolerate duplicates.
Thus, when we can use Apache Beam? Answering based on my own experiences, I have three conditions to choose Apache Beam over other frameworks.
- We use Google Cloud Platform (GCP). Google Dataflow has some operational advantages: dynamic scaling and managed infrastructure. Those advantages vanish if we run on top of Apache Flink or Apache Spark.
- We want to perform bulk/batch data transformation. It is easy using GCP native technologies, such as Google Bigquery or Google Cloud Storage. Don’t use Apache Beam to train (Machine Learning) models, but opt for Apache Spark/Pyspark instead. Apache Spark/Pyspark has tons of useful libraries intended for such tasks.
- We want to perform event stream processing where ordering and duplicates don’t matter. Also, we don’t care about calculating time spent on an event. Otherwise, opt for Apache Flink. Although Apache Flink is more challenging in infrastructural maintenance and scaling, it has the ordering and exactly once guarantees. Also, it is easy to perform a calculation of time spent on an event. On the other hand, if we want to apply machine learning on event streams, Apache Spark is a proper tool for that (i.e., DStream with windowing function).
I have created an example of bulk data processing using Apache Beam available at my Github repo: apache-beam-kotlin-bulk-sample. The example is adapted from the Apache Beam example ported into Kotlin and added with an integration test.
Have fun coding!
Comments