Apache Beam: Look-up Table with Side Input
In this blog post I’ll show you how to use the side input feature of Apache Beam.
But maybe you’re asking why I’d use this in the first place.
For my “happiness levels” project that I’ve blogged about, I wanted to connect to a database to perform a quick index lookup until something matched. So, I wanted the pipeline workers (Apache Beam’s worker nodes) to connect to a database like Datastore.
However, this couldn’t be done. The reason is every transform in the pipeline must be serializable. So, the introduction of a database connector dependency (like a repository bean when using Spring) will prevent the transform from being serialized.
Fortunately, the Apache Beam model has this side input feature. So we can do something similar — rather than a database connection, we can build an in-memory map that can be passed into any transform.
World City File
First, let’s prepare a file with the data. Then, we can construct an in-memory Map using the file records.
Looking forward: This file will be uploaded to a GCP Storage bucket. Then, each pipeline worker can initialize the Map with the file content.
We want to map a Twitter user’s location string — something like “Fort Lauderdale, USA” — to the country name of “United States”. So, in my project repo I have a file that maps various popular cities around the world to their country.
You may want to prepare your own file, and so you may need to clean up the file so there are no duplicate “keys”. If this doesn’t make sense, hopefully it will very soon.
City Side Input
Here is the test for the code we’re about to implement. This may be useful to see so you understand where we are going.
After reading the test, you’ll see there are only two transforms. The first one takes each line from the input file, and maps the city field to the country field. In this case, we’re using a .csv
file to simplify the parsing process.
The second transform, called AddCountryDataToTweetFn
is where we use the side input. So, this is pretty simple but very useful.
I want to show you the two transforms. And if you want to see how the entire application comes together, please do so by visiting the repo for this blog post.
So, now, the test I showed you above should pass. And please visit the repo to see how the entire data pipeline is built.
Conclusion
Also, I want to mention that you will need to upload your .csv
file to a GCP Storage bucket if you want to run the pipeline with GCP Dataflow.
If you have any question, please ask! And thanks for reading. Talk to you soon. 👋