How Apache Flink™ Enables New Streaming Applications, Part 2
This is the second blog post in our series on how Flink enables new streaming applications. Part 1 introduced event time and out of order processing.
This post is about versioning application state and was followed by a post about sessions and advanced windows (part 3, available here).
Stateful stream processingStream processing can be divided into stateless and stateful processing. Stateless stream processing applications merely receive events, and produce some kind of response (e.g., an alert or a transformed event) based on the information recorded on the last received event alone. As such, they have no “memory” or aggregating power. While these are useful in many cases (e.g., filtering, simple transformations), many interesting streaming applications, like aggregations over time windows, complex event processing and other pattern matching over several events, as well as transaction processing are stateful. Older streaming systems, like Apache Storm (at its core API), did not have any support for stateful programs (Storm Trident, a library included with Apache Storm does provide support for state). Storm programs can define Java objects in bolts that hold state, or interact with external databases and key/value stores that hold state, but the system does not offer correctness guarantees for this state in the case of failures, falling back to “at least once” semantics (duplicates), or “at most once” semantics (data loss). This lack of accuracy guarantees, together with the inability to handle high volume streams (large throughput) made hybrid solutions like the Lambda architecture necessary. Flink represents a newer generation of streaming systems, and guarantees correctness of state, which makes stateful applications much easier. In Flink programs, you can define state in a variety of ways:
- Using Flink’s window transformations, you can define time-based windows based on event or processing time, count-based windows, as well as generalized custom windows. See here for a short introduction to Flink windows, and stay tuned for the next blog post in this series for more advanced forms of windows.
- Using the Checkpointed interface, you can register any kind of Java/Scala object (e.g., a HashMap) with Flink, guaranteeing its correct recovery after failure.
- Using Flink’s key/value state interface, you can use state that is partitioned in the cluster by a key.
Savepoints: versioning stateIn Flink, we recently introduced savepoints, a feature that brings Flink a long way towards resolving the above and more issues. A savepoint can be taken on a running Flink job, and essentially takes an externally accessible snapshot of the job at a well-defined consistent time. It includes the offset currently being read from the stream sources, and the program state as it is at exactly this offset. Internally, savepoints are just regular checkpoints that Flink takes periodically in order to guarantee correctness in the case of failures. The main differences are (1) savepoints can be manually triggered, and (2) savepoints never expire unless explicitly disposed by the user. To take a savepoint of a running job with a given JobID through the command line, simply run
<span style="font-weight: 400;">> flink savepoint <JobID></span>
<span style="font-weight: 400;">> flink run -s <pathToSavepoint> <jobJar> ...</span>
- Application code upgrades: assume that you have found a bug in an already running application, and you want the future events to be processed by the updated code with the bug fixed. By taking a savepoint of the job, and restarting from that savepoint using the new code, downstream applications will not see the difference.
- Flink version upgrades: upgrading Flink itself becomes also easy, as you can take savepoints of running pipelines and replay them from the savepoints using an upgraded Flink version.
- Maintenance and migration: using savepoints, you can easily “pause and resume” an application. This is especially useful for cluster maintenance, as well as migrating jobs consistently to a new cluster. In addition, this is useful for developing, testing, and debugging applications, as you do not need to replay the complete event stream.
- What-if simulations (reinstatements): many times it is very useful to run an alternative application logic to model “what if” scenarios from controllable points in the past.
- A/B testing: by running two different versions of application code in parallel from the exact same savepoint, you can model A/B testing scenarios.
ConclusionIn this blog post, we saw
- Many interesting streaming use cases like aggregations over time windows, complex event processing or pattern matching require support for stateful programs inside the system. Flink’s support for state as a first class citizen enables these kinds of applications and allows Flink to make guarantees about the correctness of state (exactly ones semantics).
- Stateful streaming applications face many operational issues like behaviour under upgrades (both of the application code and Flink itself), failures, as well as application and cluster maintenance. Flink’s support for savepoints help a great deal to solve these operational issues by allowing you to version both your application code and state.