Wasserwirtschaft und Hydrosystemmodellierung

Performance optimisations for the Hydroinformatics Modeling System (hms++)

The Hydroinformatics Modeling System (hms) is a modeling framework developed by the Chair of Water Resources Management and Modeling of Hydrosystems, Technische Universität Berlin, since 2004. See the project page for details.

hms++ is a complete rewrite of hms with the goal of comprehensive performance optimisation.

Software architecture

hms++ aims to provide an efficient framework for the finite volume method on two-dimensional structured and unstructured grids, as well as an optimised implementation of first and second order HLLC Riemann solvers for the shallow water equations (SWE).

As the name indicates, hms++ is fully written in C++, which allows several avenues for optimisation when compared to the previous implementation of hms in Java:

  • no Java Virtual Machine being required on the target machine
  • allowing machine-specific optimisations by the compiler
  • allowing to manually place compiler optimisation directives
  • availability of mature high-performance computation libraries
  • greater control of the memory layout, which is essential for achieving locality and alignment
  • availability of certain advanced optimisation strategies, such as expression templates
  • easier interfacing with graphics processing units (GPUs) using CUDA or HIP

Furthermore, the availability of data pointers allows for more efficient and intuitive usage of the MPI standard for distributed computing.

Optimisation strategies

Multiple optimisation strategies are considered.
Code metrics such as cache utilisation (hits/misses), vectorisation and branch predicts/mispredicts are to be investigated for critical code sections. For flow simulations with the finite volume method (FVM), the most likely candidate for a performance critical section is the flux calculation during each time step.
Both cache utilisation as well as vectorisation depend on data locality. Locality is achieved, when firstly, instructions operating on the same data are executed within a minimal time frame and secondly, data operated on in a given time frame is located within a minimally long memory segment.

For flow simulations with the FVM, the relative memory layout of edge-related and cell-related data is the key to locality. As can be seen in the picture, all horizontal edges lie between cells which are neighbours within the mesh, but not in memory (as indicated by their indices). This is the core challenge for achieving locality in FVM simulations. Different indexing strategies for structured meshes, (possibly dynamic) element reordering for unstructured meshes, and edge traversal strategies are to be investigated in this project as candidates for improving locality and thus performance.

Very early in a flow simulation workflow, the mesh type is decided. This has a significant impact on the memory footprint of the calculation: Uniformly rectangular meshes (type a in the picture) only store two integers (number of cells in either direction) and four floating point numbers (origin and spacing) and thus have constant memory requirements, no matter the mesh size. The same mesh can, however, also be represented as an arbitrary unstructured mesh (type d), but with much higher memory requirements.
This is often the case when dealing with the output of external mesh generators. As an example, the widespread mesh generator gmsh outputs all meshes as type d, even if their topology would allow representation as types a-c.
To ascertain that the optimal mesh representation is used, independent of its source, hms++ can detect structured topology and convert any mesh to the most memory efficient type which can represent it.

During the runtime of a shallow water flow simulation, normally both wet and dry cells occur. Dry cells typically produce a lower computational load, which leads to a load imbalance between processing nodes. Intra-node load balancing can be handled by the operating system, but inter-node load balancing must be handled by the program.

In the picture, two strategies are shown. The upper one prescribes a fixed load factor to wet and dry cells, and then arranges node boundaries to achieve a similar load for each node. This heuristic method requires good calibration of the load factor, which may only be partially portable between machines. However, its high resolution down to cell level makes it a good candidate for a precise and efficient method.

The lower method in the picture takes the actual load of the system and partitions the mesh accordingly. With this method, only one load factor per node is calculated, which may require more iterations to achieve an optimal inter-node load balance. However, this method does not require calibration and inherently accounts for unknown factors (such as services running in the background, I/O operations for writing out calculation results etc.)

Both are to be implemented and compared, for both structured and unstructured meshes. The latter will require efficient graph partitioning, e.g. with the ParMETIS library.