Announcing Google Cloud Dataflow on Flink and easy Flink deployment on Google Cloud
Posted on Apr 5th, 2015 by Maximilian Michels
Today, we are pleased to announce a deeper engagement between Google, data Artisans, and the broader Apache Flink™ community to bring easy Flink deployment to Google Cloud Platform, and enable Google Cloud Dataflow users to leverage Apache Flink™ as a backend.
Flink deployment on Google Cloud Platform
We recently contributed a patch to bdutil, Google’s open source tool for deploying data processing systems on Google Compute Engine. In addition to managing Hadoop on Google Compute Engine, bdutil now lets you deploy Flink as easily as:
bdutil -e extensions/flink/flink_env.sh deploy
See here for detailed instructions.
Automatic Flink deployment on Google Caompute Engine is a natural next step after our recent experience of using Flink and the Google Compute Engine to factorize a 28-billion element matrix in 5 hours using a 40-node cluster. Check out our recent blog post here and an extended version here.
Google Cloud Dataflow on Flink
Google Cloud Dataflow is a data analytics service running on Google’s infrastructure. It allows users to write sophisticated data analytics pipelines for both batch and streaming programs and run them at scale on Google Cloud Platform. Dataflow offers a unified view at batch and stream processing, as well as highly flexible window semantics that support complex event stream analysis patterns.
Cloud Dataflow is a descendant of Google’s FlumeJava and MillWheel projects. Google recently released an SDK for Dataflow as open source. The SDK decouples the programming model from the execution engine, via pluggable “runners”. Google provides runners to run Dataflow programs on Google Cloud Platform, or on a local machine (for development).
Today, we are pleased to announce a Flink runner for Cloud Dataflow. Dataflow users can now run their programs using Apache Flink™ as the execution backend. The current Flink runner supports all the batch functionality of Dataflow. We are currently working on bringing the Dataflow streaming functionality into the Flink runner. Fortunately, Flink already supports flexible window semantics, as does Cloud Dataflow.
Flink and Cloud Dataflow are very well aligned, as they both share the vision of natively unifying stream and batch processing at the engine level. Flink has always executed both batch and streaming programs using the same streaming (pipelined) engine. The addition of Flink to the family of Dataflow SDK runners (that now include Google’s cloud platform, a local runner, and a Cloudera-contributed Apache Spark runner) is great for users that want to run the same hybrid analytical pipelines in the cloud and even on premise.
Click here to get started on Google Dataflow. To install the Flink Dataflow runner, follow the instructions here. As always, we would love to know what you think, so please give us feedback by submitting an issue.
For more information, see the announcement on the Google Cloud Platform Blog.