Apache Flink, Flink Features

Announcing the Apache Flink 1.4.0 Release

twitterredditlinkedinmailtwitterredditlinkedinmail
This post originally appeared on the Apache Flink blog. It was reproduced here under the Apache License, Version 2.0. 

The Apache Flink community is pleased to announce the 1.4.0 release. Over the past 5 months, the Flink community has been working hard to resolve more than 900 issues. See the complete changelog for more detail.

This is the fifth major release in the 1.x.y series. It is API-compatible with the other 1.x.y releases for APIs annotated with the @Public annotation.

We encourage everyone to download the release and check out the documentation.

Feedback through the Flink mailing lists is, as always, gladly encouraged!

You can find the binaries on the updated Downloads page on the Flink project site.

The release includes improvements to many different aspects of Flink, including:
  • The ability to build end-to-end exactly-once applications with Flink and popular data sources and sinks such as Apache Kafka.
  • A more developer-friendly dependency structure as well as Hadoop-free Flink for Flink users who do not have Hadoop dependencies.
  • Support for JOIN and for new sources and sinks in table API and SQL, expanding the range of logic that can be expressed with these APIs.
A summary of some of the features in the release is available below.

For more background on the Flink 1.4.0 release and the work planned for the Flink 1.5.0 release, please refer to this blog post on the Apache Flink blog.

New Features and Improvements

End-to-end Exactly Once Applications with Apache Flink and Apache Kafka and TwoPhaseCommitSinkFunction

Flink 1.4 includes a first version of an exactly-once producer for Apache Kafka 0.11. This producer enables developers who build Flink applications with Kafka as a data source and sink to compute exactly-once results not just within the Flink program, but truly “end-to-end” in the application.

The common pattern used for exactly-once applications in Kafka and in other sinks–the two-phase commit algorithm–has been extracted in Flink 1.4.0 into a common class, the TwoPhaseCommitSinkFunction (FLINK-7210). This will make it easier for users to create their own exactly-once data sinks in the future.

Table API and Streaming SQL Enhancements

Flink SQL now supports windowed joins based on processing time and event time (FLINK-5725). Users will be able to execute a join between 2 streaming tables and compute windowed results according to these 2 different concepts of time. The syntax and semantics in Flink are the same as standard SQL with JOIN and with Flink’s streaming SQL more broadly.

Flink SQL also now supports “INSERT INTO SELECT” queries, which makes it possible to write results from SQL directly into a data sink (an external system that receives data from a Flink application). This improves operability and ease-of-use of Flink SQL.

The Table API now supports aggregations on streaming tables; previously, the only supported operations on streaming tables were projection, selection, and union (FLINK-4557). This feature was initially discussed in Flink Improvement Proposal 11: FLIP-11.

The release also adds support for new table API and SQL sources and sinks, including a Kafka 0.11 source and JDBC sink.

Lastly, Flink SQL now uses Apache Calcite 1.14, which was just released in October 2017 (FLINK-7051).

A Significantly-Improved Dependency Structure and Reversed Class Loading

Flink 1.4.0 shades a number of dependences and subtle runtime conflicts, including:
  • ASM
  • Guava
  • Jackson
  • Netty
  • Apache Zookeeper
These changes improve Flink’s overall stability and removes friction when embedding Flink or calling Flink “library style”.

The release also introduces default reversed (child-first) class loading for dynamically-loaded user code, allowing for different dependencies than those included in the core framework.

For details on those changes please check out the relevant Jira issues:

Hadoop-free Flink

Apache Flink users without any Apache Hadoop dependencies can now run Flink without Hadoop. Flink programs that do not rely on Hadoop components can now be much smaller, a benefit particularly in a container-based setup resulting in less network traffic and better performance.

This includes the addition of Flink’s own Amazon S3 filesystem implementations based on Hadoop’s S3a and Presto’s S3 file system with properly shaded dependencies (FLINK-5706).

The details of these changes regarding Hadoop-free Flink are available in the Jira issue: FLINK-2268.

Improvements to Flink Internals

Flink 1.4.0 introduces a new blob storage architecture that was first discussed in Flink Improvement Proposal 19 (FLINK-6916).

This will enable easier integration with both the work being done in Flink Improvement Proposal 6 in the future and with other improvements in the 1.4.0 release, such as support for messages larger than the maximum Akka Framesize (FLINK-6046).

The improvement also enables Flink to leverage distributed file systems in high availability settings for optimized distribution of deployment data to TaskManagers.

Improvements to the Queryable State Client

Flink’s queryable state makes it possible for users to access application state directly in Flink before the state has been sent to an external database or key-value store.

Flink 1.4.0 introduces a range of improvements to the queryable state client, including a more container-friendly architecture, a more user-friendly API that hides configuration parameters, and the groundwork to be able to expose window state (the state of an in-flight window) in the future.

For details about the changes to queryable state please refer to the umbrella Jira issue: FLINK-5675.

Metrics and Monitoring

Flink’s metrics system now also includes support for Prometheus, an increasingly-popular metrics and reporting system within the Flink community (FLINK-6221).

And the Apache Kafka connector in Flink now exposes metrics for failed and successful offset commits in the Kafka consumer callback (FLINK-6998).

Connector improvements and fixes

Flink 1.4.0 introduces an Apache Kafka 0.11 connector and, as described above, support for an exactly-once producer for Kafka 0.11 (FLINK-6988).

Additionally, the Flink-Kafka consumer now supports dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees (FLINK-4022).

Flink’s Apache Kinesis connector now uses an updated version of the Kinesis Consumer Library and Kinesis Consumer Library. This introduces improved retry logic to the connector and should significantly reduce the number of failures caused by Flink writing too quickly to Kinesis (FLINK-7366).

Flink’s Apache Cassandra connector now supports Scala tuples–previously, only streams of Java tuples were supported (FLINK-4497). Also, a bug was fixed in the Cassandra connector that caused messages to be lost in certain instances (FLINK-4500).

Release Notes – Please Read

Some of these changes will require updating the configuration or Maven dependencies for existing programs. Please read below to see if you might be affected.

Changes to dynamic class loading of user code

As mentioned above, we changed the way Flink loads user code from the previous default of parent-first class loading (the default for Java) to child-first classloading, which is a common practice in Java Application Servers, where this is also referred to as inverted or reversed class loading.

This should not affect regular user code but will enable programs to use a different version of dependencies that come with Flink – for example Akka, netty, or Jackson. If you want to change back to the previous default, you can use the configuration setting classloader.resolve-order: parent-first, the new default being child-first.

No more Avro dependency included by default

Flink previously included Avro by default so user programs could simply use Avro and not worry about adding any dependencies. This behavior was changed in Flink 1.4 because it can lead to dependency clashes.

You now must manually include the Avro dependency (flink-avro) with your program jar (or add it to the Flink lib folder) if you want to use Avro.

Hadoop-free Flink

Starting with version 1.4, Flink can run without any Hadoop dependencies present in the Classpath. Along with simply running without Hadoop, this enables Flink to dynamically use whatever Hadoop version is available in the classpath.

You could, for example, download the Hadoop-free release of Flink but use that to run on any supported version of YARN, and Flink would dynamically use the Hadoop dependencies from YARN.

This also means that in cases where you used connectors to HDFS, such as the BucketingSink or RollingSink, you now have to ensure that you either use a Flink distribution with bundled Hadoop dependencies or make sure to include Hadoop dependencies when building a jar file for your application.

List of Contributors

According to git shortlog, the following 106 people contributed to the 1.4.0 release. Thank you to all contributors!

Ajay Tripathy, Alejandro Alcalde, Aljoscha Krettek, Bang, Phiradet, Bowen Li, Chris Ward, Cristian, Dan Kelley, David Anderson, Dawid Wysakowicz, Dian Fu, Dmitrii Kniazev, DmytroShkvyra, Fabian Hueske, FlorianFan, Fokko Driesprong, Gabor Gevay, Gary Yao, Greg Hogan, Haohui Mai, Hequn Cheng, James Lafa, Jark Wu, Jie Shen, Jing Fan, JingsongLi, Joerg Schad, Juan Paulo Gutierrez, Ken Geis, Kent Murra, Kurt Young, Lim Chee Hau, Maximilian Bode, Michael Fong, Mike Kobit, Mikhail Lipkovich, Nico Kruber, Novotnik, Petr, Nycholas de Oliveira e Oliveira, Patrick Lucas, Piotr Nowojski, Robert Metzger, Rodrigo Bonifacio, Rong Rong, Scott Kidder, Sebastian Klemke, Shuyi Chen, Stefan Richter, Stephan Ewen, Svend Vanderveken, Till Rohrmann, Tony Wei, Tzu-Li (Gordon) Tai, Ufuk Celebi, Usman Younas, Vetriselvan1187, Vishnu Viswanath, Wright, Eron, Xingcan Cui, Xpray, Yestin, Yonatan Most, Zhenzhong Xu, Zhijiang, adebski, asdf2014, bbayani, biao.liub, cactuslrd.lird, dawidwys, desktop, fengyelei, godfreyhe, gosubpl, gyao, hongyuhong, huafengw, kkloudas, kl0u, lincoln-lil, lingjinjiang, mengji.fy, minwenjun, mtunique, p1tz, paul, rtudoran, shaoxuan-wang, sirko bretschneider, sunjincheng121, tedyu, twalthr, uybhatti, wangmiao1981, yew1eb, z00376786, zentol, zhangminglei, zhe li, zhouhai02, zjureel, 付典, 军长, 宝牛, 淘江, 金竹
twitterredditlinkedinmailtwitterredditlinkedinmail