The Software Campus backed LAPSE aims to develop a system architecture that mitigates communication costs for distributed machine learning.
Problem. Training machine learning (ML) models on a cluster, in contrast to a single machine, increases both compute power and available memory. However, as a trade-off, it requires communication among cluster nodes, in order to synchronize model parameters. For some ML models, synchronization can dominate the training process and thereby negate the benefits of employing a cluster.
Solution. To reduce communication, researchers have developed algorithms that exploit locality, whereby, workers solely update a subset of the model parameters, at a given time. Typically, workers update different subsets over the course of training. Locality-exploiting algorithms (LEA) exist for multiple types of ML models and locality can stem from the training algorithm, ML model, or training data.
Lingering Hurdle. Typically, ML developers implement LEA from scratch, which requires them to possess knowledge (e.g., low-level details) about distributed computing systems. In the LAPSE project, we aim to develop a system that enables both researchers and practitioners to implement LEA and forego the need for detailed distributed computing knowledge.
Contribution. A novel state-of-the-art architecture for distributed ML that meets the needs of parameter servers and is usable and efficient for LEA. Our intention is to yield a solution that is applicable to a wide-range of ML applications and aids in the development of advanced ML-based solutions for today‘s societal challenges.
Project Duration: 01/2019 - 06/2021
Supervisor: Prof. Dr. Volker Markl