Despite recent advances in distributed computing and the availability of big data platforms, such as Apache Flink and Apache Spark, datasets continue to grow in magnitude. The advent of emerging technologies, such as the Internet of Things (IoT) further press the need for the development of novel solutions to expedite data analysis, particularly for streaming data. To overcome this challenge, computer scientists utilize varying approaches to cope with the data deluge. One approach involves computing sketches on large datasets, which enables us to approximate certain characteristics of the original data, such as the average, variance or extrema. In the EDADS (Efficient Data Analysis Based on Data Summaries) Software Campus Project our principal aim is to design and implement sketch algorithms for streaming data in modern dataflow engines. Consequently, this would serve to reduce the size of data streams. Furthermore, in contrast to examining the entire dataset, the sketch could then be used by data analytics (e.g., for anomaly detection) and thereby shorten the data analysis execution time.
Project Duration: 06/2019 - 09/2020
Supervisor: Prof. Dr. Volker Markl
Advisor: Martin Kiefer