Database Systems and Information Management

RHINO - Managing Very Large Distributed State for Scalable Stream Processing

Problem. Data stream processing systems are unable to fully cope with the massive amounts of complex-data generated at high-rates in Big Data-, Industry 4.0-, and IoT-based applications.

Challenge. Guaranteeing fault-tolerance, resource elasticity, and dynamic load balance requires the transfer of state, which in turn introduces latency, proportional to its size. Exactly-once stream processing engines (SPEs) require consistent state (i.e., results must be accurate, regardless of system failure, and the rescaling & rebalancing of operations on the state). SPEs must be able to continuously process stream tuples despite any of these operations. To the best of our knowledge, there is no stream processing system that fully offers robust state management, in order to efficiently handle very large, distributed state.

Existing SPEs (e.g., Apache Flink, Apache Spark, Apache Samza, and Timely Dataflow) offer fast stateful processing of data streams with low-latency and high-throughput, despite fluctuations in the data rate. However, stateful processing would further benefit from on-demand resource elasticity. Today, academic and industry researchers are able to address resource elasticity for stateful processing, while ensuring fault tolerance solely for partitioned or partially-distributed large state. Many streaming applications require stateful processing and generate large state that pushes SPEs to their limits (e.g., multimedia services, online marketplaces). In these applications, the size of the state can swell (up to many TBs). Current SPEs fail to use computing resources efficiently for large state sizes.

Objectives

  1. To address scalable data stream processing and analytics challenges arising in Big Data, Cloud Computing, Industry 4.0, and IoT (Internet of Things) applications.
  2. To develop a novel state management solution for scalable (i.e., low-latency, high-throughput) stream processing that enables fine-grained fault-tolerance, on-demand resource scaling, and load balancing in the presence of very large (e.g., hundreds of GBs) distributed state.

To devise a technological framework that seamlessly provides fault-tolerance, resource-scaling with zero downtime, and offers high-resource efficiency, lower operational costs, and reduced time-to-knowledge to end-users working on large-scale data applications.

The Rhino project is funded by the Federal Ministry of Education and Research (BMBF) as part of the Software Campus program, and is supported by Huawei Technologies.

Projektdauer: 03/2019 - 06/2021