Month: November 2016

Stream Processing Myths Debunked

Six Common Streaming Misconceptions

By @kostas_tzoumas and @wints

Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations.

A good first step in this process is understanding misconceptions about the modern stream processing space (and as a rapidly-changing space high in its hype cycle, there are many misconceptions worth talking about).

We’ve selected six of them to walk through in this post, and since Apache Flink® is the open-source stream processing framework that we’re most familiar with, we’ll provide examples in the context of Flink.

Myth 1: There’s no streaming without batch (the Lambda Architecture)
Myth 2: Latency and Throughput: Choose One
Myth 3: Micro-batching means better throughput
Myth 4: Exactly once? Completely impossible.
Myth 5: Streaming only applies to “real-time”
Myth 6: So what? Streaming is too hard anyway.

Read more

Savepoints, Part 2: Streaming Applications in the Pit Lane

Updating Apache Flink® applications with no lost state and minimal downtime

By Fabian Hueske (@fhueske) and Mike Winters (@wints)

Last month, we gave a high-level overview of Apache Flink® savepoints and touched on why and how you’d reprocess data in a streaming application. If you haven’t already read that post, or if you aren’t familiar with Flink’s savepoints, we recommend starting there.

A common use for savepoints is to fix a bug or make an improvement to a streaming application, which is a task that in some sense has requirements that are similar to an F1 pit stop: every second of downtime counts, and the car needs to be back on the track as quickly as possible without sacrificing the driver’s current position in the race (‘application state’).

In this post, we’ll walk you through the process of updating a streaming application and deploying an improved version without losing application state and with minimal downtime.

How do I update a running job?

Imagine that you have a streaming application that observes a stream of events sent out by an alarm system. There are three types of events:

  1. ActivateAlarm
  2. DeactivateAlarm
  3. MotionDetected

Each event is associated with a different room in a facility and has a timestamp. The job of our application is to trigger an alarm if a MotionDetection event is observed for a room for which the last received event was ActivateAlarm. The dataflow of such an application implemented as a Flink job would look like the figure below.alarm-app

Putting this dataflow into Scala source code is pretty straightforward with Flink’s DataStream API:

Read more

Announcing Flink Forward San Francisco 2017

We are thrilled to announce the first Flink Forward San Francisco on April 10-11, 2017 at Hotel Kabuki, in the center of Japantown in San Francisco.

In this first edition of Flink Forward San Francisco, we are looking to connect with the already thriving Flink community in the Bay Area and beyond. Our mission is to foster innovation and discussion with developers around the world in order to push Apache Flink to the next level.

The Call for Papers will be open soon and you are invited to share your knowledge, use cases and best practices with the Apache Flink community and shape the program of the first edition of Flink Forward San Francisco!

Flink Forward San Francisco will take place on April 10-11, 2017. Participants are invited to join one day of hands-on Flink training sessions on April 10 followed by one day of speaker sessions. April 11 is dedicated to technical talks on how Flink is used in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform.

Last but not least, tickets will be on sale soon!