Technology Deep Dive Track

Failure is not fatal: what is your recovery story?

Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.

Authors

Steven Wu
Software Engineer Netflix
Steven Wu

Steven Wu is a software engineer at Netflix. He is working on real-time data infrastructure that powers massive data ingestion pipeline and stream processing platform. Previously he was working on cloud platform that builds the foundation for Netflix’s cloud-native microservice architecture. He is passionate about building scalable distributed system and empowering people with data.

Fill out the form to view
the Slides and Video

* All fields required