Apache Beam: how to execute a task before/after the record flow starts/stops by Romain Manni-Bucau, 2019-01-03

Big Data pipelines are designed to process a tons of data, however it is still common to need to execute some task once, either before or after (for batches) the flow was processed. Let see how to do it with Apache Beam.

Why my Apache Beam Spark fatjar job broke upgrading to 2.9.0 version by Romain Manni-Bucau, 2018-12-18

Apache Beam 2.9.0 has released last week. So if you were using a previous version you are likely tempted to upgrade. This is what I did, an what was expected to be a single line change in a pom has surprisingly been more work. Let’s see what changed!

How to select the best coder for your data with Apache Beam by Romain Manni-Bucau, 2018-09-19

Apache Beam coder abstraction enables you to switch between implementations without rewriting your pipeline. But how to select your coder? Performance and disk spaces are likely the most important criterias, let’s see how to measure them.

Apache Beam: convert Row structure to an Avro IndexedRecord by Romain Manni-Bucau, 2018-09-12

We previously saw that Beam Row structure allows to write generic transforms but that using its serialization can be a bad bet. To illustrate how to switch between one format to another, we will show in this post how to convert a Row to an IndexedRecord

Apache Beam and Row: a new Big Data record/serialization standard? by Romain Manni-Bucau, 2018-09-05

Handing data you don’t know at compile time is a common concern of processing libraries. Apache Beam can’t ignore that since it allows to build portable pipelines for Big Data engines. Let’s see how they started to solve that concern!