Stream Processing Myths Debunked

Six Common Streaming Misconceptions

By @kostas_tzoumas and @wints

Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations.

A good first step in this process is understanding misconceptions about the modern stream processing space (and as a rapidly-changing space high in its hype cycle, there are many misconceptions worth talking about).

We’ve selected six of them to walk through in this post, and since Apache Flink® is the open-source stream processing framework that we’re most familiar with, we’ll provide examples in the context of Flink.

Myth 1: There’s no streaming without batch (the Lambda Architecture)
Myth 2: Latency and Throughput: Choose One
Myth 3: Micro-batching means better throughput
Myth 4: Exactly once? Completely impossible.
Myth 5: Streaming only applies to “real-time”
Myth 6: So what? Streaming is too hard anyway.

Read more

Savepoints, Part 2: Streaming Applications in the Pit Lane

Updating Apache Flink® applications with no lost state and minimal downtime

By Fabian Hueske (@fhueske) and Mike Winters (@wints)

Last month, we gave a high-level overview of Apache Flink® savepoints and touched on why and how you’d reprocess data in a streaming application. If you haven’t already read that post, or if you aren’t familiar with Flink’s savepoints, we recommend starting there.

A common use for savepoints is to fix a bug or make an improvement to a streaming application, which is a task that in some sense has requirements that are similar to an F1 pit stop: every second of downtime counts, and the car needs to be back on the track as quickly as possible without sacrificing the driver’s current position in the race (‘application state’).

In this post, we’ll walk you through the process of updating a streaming application and deploying an improved version without losing application state and with minimal downtime.

How do I update a running job?

Imagine that you have a streaming application that observes a stream of events sent out by an alarm system. There are three types of events:

  1. ActivateAlarm
  2. DeactivateAlarm
  3. MotionDetected

Each event is associated with a different room in a facility and has a timestamp. The job of our application is to trigger an alarm if a MotionDetection event is observed for a room for which the last received event was ActivateAlarm. The dataflow of such an application implemented as a Flink job would look like the figure below.alarm-app

Putting this dataflow into Scala source code is pretty straightforward with Flink’s DataStream API:

Read more

Announcing Flink Forward San Francisco 2017

We are thrilled to announce the first Flink Forward San Francisco on April 10-11, 2017 at Hotel Kabuki, in the center of Japantown in San Francisco.

In this first edition of Flink Forward San Francisco, we are looking to connect with the already thriving Flink community in the Bay Area and beyond. Our mission is to foster innovation and discussion with developers around the world in order to push Apache Flink to the next level.

The Call for Papers will be open soon and you are invited to share your knowledge, use cases and best practices with the Apache Flink community and shape the program of the first edition of Flink Forward San Francisco!

Flink Forward San Francisco will take place on April 10-11, 2017. Participants are invited to join one day of hands-on Flink training sessions on April 10 followed by one day of speaker sessions. April 11 is dedicated to technical talks on how Flink is used in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform.

Last but not least, tickets will be on sale soon!

Blink: How Alibaba Uses Apache Flink®

Improvements to Flink and its application in Alibaba Search

This is a guest post from Xiaowei Jiang, Senior Director of Alibaba’s search infrastructure team. The post is adapted from Alibaba’s presentation at Flink Forward 2016, and you can see the original talk from the conference here.

Alibaba is the largest e-commerce retailer in the world. Our annual sales in 2015 totalled $394 billion–more than eBay and Amazon combined. Alibaba Search, our personalized search and recommendation platform, is a critical entry point for our customers and is responsible for much of our online revenue, and so the search infrastructure team is constantly exploring ways to improve the product.

What makes for a great search engine on an e-commerce site? Results that, in real-time, are as relevant and accurate as possible for each user. At Alibaba’s scale, this is a non-trivial problem, and it’s difficult to find technologies that are capable of handling our use cases.

Apache Flink® is one such technology, and Alibaba is using Blink, a system based on Flink, to power critical aspects of its search infrastructure and to deliver relevance and accuracy to end users. In this post, I’ll walk through Flink’s role in Alibaba search and outline the reasons we chose to work with Flink on the search infrastructure team.

Read more

Savepoints: Turning Back Time

Reprocessing Data Streams with Savepoints in Apache Flink®

This post is the first in a series where the data Artisans team will highlight some of Apache Flink’s® core features.

By Fabian Hueske (@fhueske) and Mike Winters (@wints)

Stream processing is commonly associated with ‘data in motion’, powering systems that make sense of and respond to data in nearly the same instant it’s created. The most frequently discussed streaming topics, such as latency and throughput or watermarks and handling of late data, focus on the present rather than the past.

In reality, though, there are a number of cases where you’ll need to reprocess data that your streaming application has already processed before. Some examples include:

  • Deployment of a new version of your application with a new feature, a bug fix, or a better machine learning model
  • A/B testing different versions of an application using the same source data streams, starting the test from the same point in time without sacrificing prior state
  • Evaluating and carrying out the migration of applications to newer releases of the processing framework or to a different cluster

Apache Flink’s savepoint feature enables all of the above and is one of the unique points that distinguishes Flink from other distributed open source stream processors.

Read more

A Few of Our Favorite Insights from Flink Forward 2016

The data Artisans team was very much impressed by this year’s Flink Forward speaker sessions, and the speakers delivered tons of detail on Apache Flink® use cases and benchmarks. Here, we’ll share just a small selection of our favorite insights from the presentations.

And remember, all speaker session recordings and slides are available on the Flink Forward website.

  1. Bouygues Telecom, one of the largest telecom networks in France, is running 30 production applications powered by Flink and is processing 10 billion raw events per day. As of Flink Forward 2015, they were live with 5 Flink applications, so we’re looking forward to hearing about their 180 Flink applications in 2017. (All Slides, Talkbouygues-slide-new-image Read more

September 2016 in Review: a fantastic Flink Forward, dA Platform debut, and Strata + Hadoop World NYC

Berlin’s surprise 32° September weather (90° F for those of you Stateside) has come and gone, and there was lots happening in the last few weeks of summer. Here are a few of the highlights.

Apache Flink® to the enterprise

In order to make Flink more accessible to organizations seeking enterprise support, data Artisans announced the dA Platform, a data Artisans-certified distribution of Flink bundled with 24x7x365 support. Get in touch with us if you’d like to learn more.

And we were thrilled to see that Lightbend included Flink in its Fast Data Platform. September was month of great progress in growing the Flink community and broadening the user base.

Read more

data Artisans at Strata + Hadoop World NYC 2016

From September 26-29, 2016 the big data community meet at Strata + Hadoop World in NYC. This year data Artisans will take part in several ways at the conference. For the first time we will have a booth at the conference (#P2), and will be demonstrating Apache Flink® and our brand new dA Platform. Stop by to connect with Apache Flink experts and learn more about implementing enterprise-grade streaming data applications in production.

Read more

Announcing the dA Platform, our distribution of Apache® Flink®

A team of original Apache Flink® contributors founded data Artisans in 2014 because we believed that existing data processing frameworks weren’t adequately addressing the needs of organizations and their engineering teams. From the global saturation of smartphones, to the rapid adoption of the Internet of Things and connected devices, the very nature of data and how it is generated had evolved far more quickly than the tools available to manage that data.

Read more

August 2016 in Review: Apache Flink® 1.1, Flink Forward announcements, and more

While most of the Continent was away on holiday, it was a productive August for data Artisans and for the Apache Flink® community. Here are highlights from the past month, and we can’t wait to see what the rest of 2016 has in store.

Apache Flink 1.1

There were many long-awaited features included in the Flink community’s 1.1 release, which was supported by 95 contributors. If you haven’t already, we recommend that you browse the release notes. Here are a few of the highlights:

Read more