DORIAN is a project of Software Campus.
Modern companies rely heavily on data-driven insights. They employ complex data science (DS) processes that consist of a wide spectrum of tasks: analysis of business cases, data collection, integration, preprocessing, modeling and predictive analytics, experimentation and evaluation of the results, deployment, monitoring, visualization, and reporting. The process itself is highly iterative and dynamic, as the modern business and computational environments are. Data sources and execution systems are heterogeneous, responsible teams are diverse. High complexity and variation of the environment produce significant overhead for analysts who carry out and manage data-intensive applications.
In this project, we want to reduce the resulting overhead in monitoring and inspecting complex data science workflows by designing a prototype of the system for end-to-end management of data science processes. We focus on one common management task – automated documentation of workflows for data-intensive experiments, in order to facilitate reproducibility, systematic comparison and further reuse. By documentation we mean the process of deriving a declarative representation of the workflow, capturing provenance and metadata of the underlying digital artifacts (e.g., datasets, DS pipeline, predictive model) at runtime, to control the state of the experiment (software dependencies, hardware specification, versioning of the source code, intermediate artifacts, etc.) and enable reproducibility.
As part of this project, we design the high-level abstraction for the declarative specification of the DS workflows. We implement a prototype of the management system that automatically extracts this declarative intermediate representation (IR) from a data science experiment and persists it in an experiment database for further reproducibility, search, comparison, and reuse.
For more information, please visit https://softwarecampus.de/en/project/dorian-reproducibility-inspection-and-automation-of-data-oriented-experiments/.
Project Duration: 01/01/2020 - 31/12/2021
Software Campus Participant: Sergey Redyuk
Project Partner: Software AG
Funding Agency: German Aerospace Center (DLR)