Database Systems and Information Management

Myriad

Myriad is a toolkit for scalable parallel data generators.

Generating large sets of synthetic data according to a predefined schema and a set of statistical restrictions is a challenging problem with increasing importance, especially in the context of benchmarking and testing systems designed to handle web-scale amounts of data. Myriad aims to ease this process by providing a fast and intuitive way to define custom data generators tailored towards the requirements of a concrete use-case.

All generators created with the toolkit support parallel execution on shared-nothing architectures. Our parallelization approach builds on the idea of mapping fixed-size chunks from an underlying pseudorandom number generator (PRNG) into pseudorandom sequences with user-defined data types. The parallel execution model adopted by the toolkit relies on horizontal partitioning of all generated record sequences. To this purpose, the runtime library performs efficient seed-skip operations on the underlying PRNGs to adjust the starting position of the assigned record sequences in each generator node.

Moreover, since the random values of each record completely depend on its sequence position, the same technique facilitates efficient realization of a broad set of reference-based model restrictions. Consider a simple data model consisting of two random sequences of type MOVIE (mi) and DIRECTOR (di) and a directed link between them (MOVIE → DIRECTOR) which implies a foreign key constraint of the form m.directorid = d.id for each linked (mi, dj) pair. The seed-skip approach proposed by Myriad enables position-based sampling of an arbitrary director dj for each movie mj (where the partitions containing mi and dj in general are assigned to different nodes) - the imposed foreign key constraint can be implemented through local re-computation of the dj.id value based on the position j of the currently sampled DIRECTOR.

Myriad is developed in the context of the Stratosphere project as an ongoing collaboration between the Database Systems Research Group, TU Berlin and the IBM Center for Advanced Studies, Toronto.