Stream processing at Bouygues Telecom with Apache Flink™
This is a guest post written by Mohamed Amine Abdessemed from Bouygues Telecom
About Bouygues TelecomBouygues Telecom is a full-service communication operator (mobile, fixed telephony, TV, Internet, and Cloud computing) and one of the largest providers in France, with over 11 million mobile subscribers and 2.5 million fixed customers, generating a revenue of 5.43 billion euros (2014). At Bouygues Telecom we stand for our innovative products, services and award-winning customer relations: We are pioneers in Ultra High Speed Mobile with over 72% of the population covered in 4G/4G+, we offer the first TV box running Android and more. Our team is in charge of providing systems and applications to our telecom engineers in order to create, develop, and maintain our network; to provide the best services for our customers. We came to know Apache Flink™ during a Hadoop User Group meetup in Paris. Since then, we have been using Flink internally for a variety of use cases with very good results.
LUX: Logged User ExperienceBut first, a bit of context: At Bouygues, our customers experience is a top priority, that’s why we’ve invested in using Big Data solutions to provide a new network-oriented vision of our Data and give our engineers real-time insight about:
- The customer’s experience on our network.
- Global & detailed picture on what’s happening on the network.
- The network evolutions and operations.
- The Planck system is responsible for data collection, and is implemented using Apache Camel.
- The Vortex system, responsible for gathering all collected data in a central bus, and is implemented using Apache Kafka.
- The Candela system is responsible for providing real-time diagnostics to upstream users, and is based on Elasticsearch.
- Finally, the Lumen system is responsible for data transformation, alarming, reporting and analytics, and is implemented using several components of Cloudera’s Hadoop distribution (in particular Apache Flume, Apache Spark, Apache Hive, Apache Oozie), as well as Apache Flink™, the focus of this post.
Real-time data integrationLinkedIn pioneered a design pattern for using Kafka as a central component to collect streams to solve its data integration problem. The idea is to collect all the organization’s data and put it into a central log for real-time data subscription. We have tried to put this design pattern to work at Bouygues as well. Early on, we ran into the following problem: lots of data comes in a raw format, even in binary encoding with no visible business logic information. This renders this data virtually unusable. We explored three possible solutions to this problem:
- Each data source processes its own data before pushing it into Kafka. We rejected this solution as not viable for our case. Data sources have really limited computation resources assigned to log collection, so they need to push the data in raw format.
- Each Kafka consumer decodes the raw data. This is problematic, as every consumer implements the same business logic wasting resources and makes code deployment and maintenance too hard to achieve.
- Inject an intermediate processing pipeline that processes the raw data in Kafka and feeds the data back to Kafka in decoded and enriched format. This seemed like the best solution for our case. However, this processing pipeline needs to be very fast in order to respect our SLAs.
Apache Camel collects raw log data from the sources and forwards it to Kafka with an average rate of 20K event per second (and up to 40K events per second in busy hours). Then, a Flink data transformation streaming topology with exactly-once guarantees that uses Flink’s persistent Kafka source is transforming the raw data into a usable and enriched form on the fly and pushing it back to Kafka. Upstream systems (such as Elasticsearch) consume the transformed data that have been fed back to Kafka. Additionally, Flink is used to implement our alarming functionality, consuming directly the data from the previous Flink topology. There, a sliding window is used to create counters and push them to a specialized metric system to keep track of failure occurrences over time, and forwards alarms to an alarming system if a certain threshold of failures is detected.