Ecosystem Track

Streaming ETL with Flink and Elasticsearch

At Intellify we have implemented a system where we can create Flink apps for streaming ETL into normalized datasets in Elasticsearch, with schemas specified in Avro. Our data comes in via a single Kafka topic, but in different shapes depending on the originating source. To that end, we've developed a framework for implementing ETL apps in Flink. This framework supports nested and out-of-order streaming joins using a custom processing function, as well as a seeding source which can source input from our "data lake" in S3 and seamlessly transition to the live Kafka topic. Finally, the framework treats stream output as immutable using conceptual namespaces and aliasing in Elasticsearch, allowing us to iteratively develop new ETL features without disrupting existing users of the data set. This talk would give an overview of the streaming join algorithm and the custom seeding source function, as well as show our web UI for managing the streaming apps and data set.

Authors

Jared Stehler
Intellify Learning
Jared Stehler

Jared is a passionate expert in Software Architecture, Continuous Delivery, Platform-as-a-Service systems, and AWS. As Chief Architect of Intellify, Jared led the design and implementation of the cloud-based streaming analytics platform. Previously, he was a Staff Software Engineer at HubSpot, responsible for migrating their content management system from a python-based monolith into Java micro services. He has worked in various industries including Lending, Search, Ad-tech, and Marketing.

Fill out the form to view
the Slides and Video

* All fields required