Apache Flink® Community Announces 1.2.0 Release

Rescalable state, queryable state, async I/O, low-level stream operations, SQL improvements, and more

On Monday, February 6, the Apache Flink® community announced the project’s 1.2.0 release. We at data Artisans would like to extend a sincere thanks to the 122 members of the Flink community who contributed to 1.2.0. The release included contributors employed by Alibaba, Amazon, Cloudera, King, and many other enterprises.

At data Artisans, we spend most of our waking hours thinking about and working on Flink, and so there’s lots that we’re excited about in the 1.2.0 release. In this post, members of the data Artisans engineering team will share their thoughts on just a subset of the release’s new features.

For a complete overview, be sure to check out the changelog on the project site.

And in the coming weeks, we’ll be writing about 1.2.0 features in more detail here on the data Artisans blog.

Read more

Apache Flink® User Survey 2016 Results, Part 2

Last week, we published the first of two blog posts recapping the results of the 2016 Apache Flink® user survey. In part 1, we shared a selection of graphs summarizing responses to the survey’s multiple choice questions. In part 2, we’ll look at responses to the survey’s open-ended questions:

  • What new features or functionality would you like to see in Flink?
  • Please briefly describe the application(s) your team is building or plans to build with Flink.
  • Are there any other challenges (when working with Flink) not listed (in the previous question) that you’d like to mention?
  • What other sources / sinks not included in the list (that was provided in the survey), if any, are important for your Flink application?
  • We welcome any final comments about any aspect of Flink.

Read more

Apache Flink® User Survey 2016 Results, Part 1

(You can find part 2 here)

At the end of 2016, data Artisans organized the first-ever Apache Flink® user survey in order to better understand Flink usage in the community, asking for feedback about both common patterns and the most-needed Flink features.

The results are in, and we’ll be sharing them in a two-post series. This first post will include a summary of answers to the survey’s multiple-choice questions, and the second post will include written answers to open-ended questions that respondents gave us permission to share anonymously.

For context, here’s some general information about the survey:

  • We collected responses between 18 Nov 2016 – 13 Dec 2016
  • The survey was distributed via the Apache Flink mailing lists, the data Artisans Twitter account, and Apache Flink meetup groups around the world
  • In total, 119 respondents from 21 different countries answered at least 1 question; note that each graph includes a count of respondents for that particular question

If you’d like to download a single file with all 5 of the graph images from this post, you can do so here.

First, a fun one: where in the world are Flink users? The Flink community has long been a global one, with 27% of respondents are based in the United States with many more throughout continental Europe, South America, and Asia.

Read more

November 2016 in Review: Flink Forward 2017, Amazon EMR + Google Dataproc, and kicking off the Flink training series in Germany

We’re in the home stretch of 2016, and November was another action-packed month for the Apache Flink® community and the data Artisans team.

Here’s a recap of November’s most exciting highlights.

Announcing Flink Forward 2017

Flink Forward is coming to San Francisco! For the first time ever, the annual Apache Flink user conference will expand beyond Berlin with a 2-day event at the Hotel Kabuki in Japantown on 10-11 April, 2017. The call for papers is open, so submit your talk or register today.

And the Berlin event will return to Kulturbrauerei on 11-13 September, 2017. Registration is available now, and stay tuned for the call for papers.

Flink in Amazon EMR and Google Cloud Dataproc

Flink is now natively supported in Amazon EMR 5.1.0, and Google included support for Flink 1.1.3 in its November 29 Cloud Dataproc release. We’re excited to see Flink become available in an increasing number of commercial distributions.

Flink Training in Frankfurt (Munich and Hamburg coming up soon)

Last month, the data Artisans team hosted the first of three Flink training sessions–this time in Frankfurt–in coordination with codecentric. The Munich workshop happens tomorrow, Tuesday 13 December, and timing for our Hamburg event is still TBD. Interested in setting up a Flink training for your organization? Learn more and get in touch.

Apache Flink User Survey: Results Coming Soon

In late November and early December, data Artisans ran the first-ever Apache Flink user survey. There have been over 100 responses so far, and we’ll be publishing a summary of results to share with the community before the end of 2016. We’re excited to share feedback from Flink users around the world.

Community and Conference Circuit

CEO Kostas Tzoumas hosted a session at Big Data London, CTO Stephan Ewen gave a keynote at Apache Big Data Europe, Director of Applications Engineering Jamie Grier gave a workshop at QCon San Francisco, software engineer Aljoscha Krettek hosted a session at Big Data Spain, software engineers Robert Metzger and Maximilian Michels presented at a Flink meetup in the San Francisco Bay Area, and data Artisans hosted a Flink meetup at our Berlin office (setting an attendance record for the Berlin group). It was quite a month! We very much enjoyed meeting members of the Flink user community in person.

On the data Artisans Blog

Check out Savepoints, Part 2 to see an example of how to update a streaming application using Flink’s savepoints (of course, we recommend starting with Part 1).

And for a debunking of commonly-held myths in the stream processing space, we recommend this post from data Artisans’ CEO Kostas Tzoumas.


Hello, San Francisco! And nice to see you again, Berlin.

By @danibentrup

We’re excited to announce that Apache Flink® enthusiasts have two events to look forward to in 2017, both fully-packed with the latest and greatest on Flink.

Flink Forward, the premier Flink conference, is coming to Berlin for a third time on September 11-13, 2017. But before our annual event in Germany, we invite the data stream processing community to the first-ever Flink Forward San Francisco on April 10-11, 2017.

In this first edition of Flink Forward San Francisco, we’ll connect with the already-thriving Flink community in the Bay Area and beyond. Our mission is to foster innovation and discussion with developers around the world in order to push Apache Flink to the next level.

The call for submissions is already open and you are invited to share your knowledge, use cases, and best practices with the Apache Flink community and to shape the program of the first edition of Flink Forward San Francisco! Submit your talk here.

Flink Forward San Francisco will take place at Hotel Kabuki, in the heart of Japantown in the city center. Participants are invited to join one day of hands-on Flink training sessions on April 10 followed by one day of speaker sessions on April 11. The speaker sessions will be made up of technical talks covering Flink in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform.

Last but not least, tickets are on sale here: http://sf.flink-forward.org/registration/

From September 11-13, we welcome the stream data processing community to the third edition of Flink Forward Berlin at Kulturbrauerei in the heart of Prenzlauer Berg in Berlin.

Participants are invited to join one day of hands-on Flink training sessions on September 11 followed by two days of speaker sessions on September 12-13. Tickets are on sale now, and you can purchase your Early Bird Ticket here.

In 2017, we seek to provide a platform for developers, architects, engineering managers, and C-level executives to gain in-depth insights on Apache Flink. We hope that you’ll join us.

Stream Processing Myths Debunked

Six Common Streaming Misconceptions

By @kostas_tzoumas and @wints

Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations.

A good first step in this process is understanding misconceptions about the modern stream processing space (and as a rapidly-changing space high in its hype cycle, there are many misconceptions worth talking about).

We’ve selected six of them to walk through in this post, and since Apache Flink® is the open-source stream processing framework that we’re most familiar with, we’ll provide examples in the context of Flink.

Myth 1: There’s no streaming without batch (the Lambda Architecture)
Myth 2: Latency and Throughput: Choose One
Myth 3: Micro-batching means better throughput
Myth 4: Exactly once? Completely impossible.
Myth 5: Streaming only applies to “real-time”
Myth 6: So what? Streaming is too hard anyway.

Read more

Savepoints, Part 2: Streaming Applications in the Pit Lane

Updating Apache Flink® applications with no lost state and minimal downtime

By Fabian Hueske (@fhueske) and Mike Winters (@wints)

Last month, we gave a high-level overview of Apache Flink® savepoints and touched on why and how you’d reprocess data in a streaming application. If you haven’t already read that post, or if you aren’t familiar with Flink’s savepoints, we recommend starting there.

A common use for savepoints is to fix a bug or make an improvement to a streaming application, which is a task that in some sense has requirements that are similar to an F1 pit stop: every second of downtime counts, and the car needs to be back on the track as quickly as possible without sacrificing the driver’s current position in the race (‘application state’).

In this post, we’ll walk you through the process of updating a streaming application and deploying an improved version without losing application state and with minimal downtime.

How do I update a running job?

Imagine that you have a streaming application that observes a stream of events sent out by an alarm system. There are three types of events:

  1. ActivateAlarm
  2. DeactivateAlarm
  3. MotionDetected

Each event is associated with a different room in a facility and has a timestamp. The job of our application is to trigger an alarm if a MotionDetection event is observed for a room for which the last received event was ActivateAlarm. The dataflow of such an application implemented as a Flink job would look like the figure below.alarm-app

Putting this dataflow into Scala source code is pretty straightforward with Flink’s DataStream API:

Read more

Announcing Flink Forward San Francisco 2017

We are thrilled to announce the first Flink Forward San Francisco on April 10-11, 2017 at Hotel Kabuki, in the center of Japantown in San Francisco.

In this first edition of Flink Forward San Francisco, we are looking to connect with the already thriving Flink community in the Bay Area and beyond. Our mission is to foster innovation and discussion with developers around the world in order to push Apache Flink to the next level.

The Call for Papers will be open soon and you are invited to share your knowledge, use cases and best practices with the Apache Flink community and shape the program of the first edition of Flink Forward San Francisco!

Flink Forward San Francisco will take place on April 10-11, 2017. Participants are invited to join one day of hands-on Flink training sessions on April 10 followed by one day of speaker sessions. April 11 is dedicated to technical talks on how Flink is used in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform.

Last but not least, tickets will be on sale soon!

Blink: How Alibaba Uses Apache Flink®

Improvements to Flink and its application in Alibaba Search

This is a guest post from Xiaowei Jiang, Senior Director of Alibaba’s search infrastructure team. The post is adapted from Alibaba’s presentation at Flink Forward 2016, and you can see the original talk from the conference here.

Alibaba is the largest e-commerce retailer in the world. Our annual sales in 2015 totalled $394 billion–more than eBay and Amazon combined. Alibaba Search, our personalized search and recommendation platform, is a critical entry point for our customers and is responsible for much of our online revenue, and so the search infrastructure team is constantly exploring ways to improve the product.

What makes for a great search engine on an e-commerce site? Results that, in real-time, are as relevant and accurate as possible for each user. At Alibaba’s scale, this is a non-trivial problem, and it’s difficult to find technologies that are capable of handling our use cases.

Apache Flink® is one such technology, and Alibaba is using Blink, a system based on Flink, to power critical aspects of its search infrastructure and to deliver relevance and accuracy to end users. In this post, I’ll walk through Flink’s role in Alibaba search and outline the reasons we chose to work with Flink on the search infrastructure team.

Read more

Savepoints: Turning Back Time

Reprocessing Data Streams with Savepoints in Apache Flink®

This post is the first in a series where the data Artisans team will highlight some of Apache Flink’s® core features.

By Fabian Hueske (@fhueske) and Mike Winters (@wints)

Stream processing is commonly associated with ‘data in motion’, powering systems that make sense of and respond to data in nearly the same instant it’s created. The most frequently discussed streaming topics, such as latency and throughput or watermarks and handling of late data, focus on the present rather than the past.

In reality, though, there are a number of cases where you’ll need to reprocess data that your streaming application has already processed before. Some examples include:

  • Deployment of a new version of your application with a new feature, a bug fix, or a better machine learning model
  • A/B testing different versions of an application using the same source data streams, starting the test from the same point in time without sacrificing prior state
  • Evaluating and carrying out the migration of applications to newer releases of the processing framework or to a different cluster

Apache Flink’s savepoint feature enables all of the above and is one of the unique points that distinguishes Flink from other distributed open source stream processors.

Read more