Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations.
A good first step in this process is understanding misconceptions about the modern stream processing space (and as a rapidly-changing space high in its hype cycle, there are many misconceptions worth talking about).
We’ve selected six of them to walk through in this post, and since Apache Flink® is the open-source stream processing framework that we’re most familiar with, we’ll provide examples in the context of Flink.
A team of original Apache Flink® contributors founded data Artisans in 2014 because we believed that existing data processing frameworks weren’t adequately addressing the needs of organizations and their engineering teams. From the global saturation of smartphones, to the rapid adoption of the Internet of Things and connected devices, the very nature of data and how it is generated had evolved far more quickly than the tools available to manage that data.
In this post, we would like to shed some light upon Apache Beam, the new Apache Incubator project that Google initiated with us and other partners. We would like to highlight our involvement in Beam and how we see the relationship between Beam and Flink developing in the future. See also Google’s perspective on how Beam and Flink relate.
We are delighted to see that the Flink community has announced the availability of Apache Flink™ 1.0. This release is one of the largest Flink releases ever, with about 64 individuals resolving more than 450 JIRA issues, and, most importantly, marks the beginning of the Flink 1.x.y series, which initiates backwards compatibility for all minor releases moving forward. We see this release as the most important milestone in the project since Flink graduated from the Apache Incubator one year ago. Additionally, we see this release as (1) validating production-readiness for Flink, and (2) significantly pushing the envelope in stream processing with features that are unique in the open source world.
This is the second blog post in our series on how Flink enables new streaming applications. The first post introduced event time and out of order processing. This post is about versioning application state, and will be followed by a post about sessions and advanced windows.
Today, we are delighted to see Dataflow publicly proposed as an Apache Incubator project with an initial project team from Google, data Artisans, Cloudera, and others. data Artisans is the second biggest contributing organization after Google with a total of 4 proposed initial committers.
Part I: The power of event time and out of order stream processing
Stream data processing is booming in popularity, as it promises better insights from fresher data, as well as a radically simplified pipeline from data ingestion to analytics. Data production in the real world has always been a continuous process (for example, web server logs, user activity in mobile applications, database transactions, or sensor readings). As has been noted by others, until now, most pieces of the data infrastructure stack were built with the underlying assumption that data is finite and static. To bridge this fundamental gap between continuous data production and the limitations of older “batch” systems, companies have been introducing complex and fragile end-to-end pipelines. Modern data streaming technology alleviates the need for complex solutions by modeling and processing data in the form that it is produced, a stream of real-world events.
While these numbers are impressive on their own, in Flink 0.10, the whole is truly greater than the sum of its parts. The combination of new features advances Flink to be a data stream processor that truly stands out in the open source space and that significantly eases the effort to bring streaming jobs into production. While the official release announcement provides an extensive list of new features, this blog post focuses on those features that jointly improve the experience of developing and operating stream processing applications.