Research Track

Hardware-efficient Stream Processing

In the era of big data and AI, many data-intensive applications, such as streaming, exhibit requirements that cannot be satisfied by traditional batch processing models. In response, distributed stream processing systems, such as Spark Streaming or Apache Flink, exploit the resources of a compute cluster for streaming applications. As with any distributed system, this raises the question of how efficiently these systems utilise the available hardware resources on each node. At the same time, with highly-parallel heterogeneous architectures becoming commonplace in data centres, stream processing systems can exploit previously unseen levels of parallel processing even from single nodes. A step towards these changes is our engine SABER, which exploits the parallelism that both multi-core CPUs and GPGPUs offer in a single node, to achieve high processing throughput while maintaining low latency. We use the modified version of Yahoo Streaming Benchmark to measure the performance of aforementioned systems and compare the performance of distributed and centralised computation. SABER processes 79 million tuples per second with 8 CPU cores, outperforming Flink (3x), Spark Streaming (7x) and StreamBox (7x). It exhibits better performance than a cluster-based deployment with 40 CPU cores.

However, even these results are not satisfactory, as there is still a large performance gap between handwritten code and current stream processing systems. By comparing a handwritten C++ program with the single-core implementations of these systems, we notice more than (2x) speedup that is hindered by the systems' design. Thus, we have to recalibrate the way we approach stream processing and focus on hardware-conscious techniques. In our current research, we have begun designing highly efficient streaming operator implementations that exploit superscalar execution and SIMD parallelism. We also envision to introduce compilation-based techniques to keep data in CPU registers as long as possible, while taking into consideration the non-uniform memory access (NUMA) caused by multiple CPU sockets on modern scale-up architectures.

Authors

George Theodorakis
Imperial College
George Theodorakis

George is a Ph.D. student in the Large-Scale Distributed Systems (LSDS) group @ Imperial College London, under the supervision of Dr. Peter Pietzuch. His Ph.D. is supported by a CDT HiPEDS scholarship. Prior to this, he was an undergraduate student in the Electrical and Computer Engineering department of National Technical University of Athens and conducted his thesis in affiliation with CSLab.

Fill out the form to view
the Slides and Video

* All fields required