Rob Ross, ANL
- Standards-based Input/Output (I/O) interfaces are a cornerstone of DOE science codes
- The ROMIO MPI-IO implementation is the most widely used I/O library in HPC systems
- Tuning ROMIO for specific platforms is critical for performance of many applications
Many DOE computational science codes take advantages of the standards-based “I/O stack” available on virtually every HPC platform, that includes the MPI-IO interface and libraries such as Parallel netCDF and HDF5. This I/O stack provides features not available in parallel file systems, such as an ability to write data in portable formats, to store attributes of data alongside the data itself, and optimizations that accelerate application I/O. The MPI-IO interface is a critical building block in this I/O stack, and the ROMIO MPI-IO implementation is the implementation of this interface that is available on virtually all HPC platforms, typically provided by the vendor.
ROMIO is maintained by Argonne National Laboratory in part under SciDAC SDAV funding, and as part of SDAV activities we work to ensure that this critical component of the I/O stack is not only available but also well-tuned on platforms at DOE facilities. One way that we work to ensure the performance of MPI-IO on DOE platforms is by using I/O “proxy applications”, small codes that represent the I/O behavior of DOE codes, to help us understand application behavior and tune ROMIO to respond appropriately. The HACC-IO proxy represents the checkpoint operation of the HACC cosmology code, a code being developed under HEP SciDAC funding. This workload writes out 1.5 million particles per process, with each particle describing 9 variables. Initial experiences by the HACC team on the Mira Blue Gene/Q were quite poor, delivering performance that was not acceptable (leftmost bar).
Optimization for HACC involved two categories of changes. First, we adjusted their use of the I/O library to take advantage of features of MPI-IO that allow the entire checkpoint operation to be described in a single (collective) call, rather than a call per variable (shown as two steps, second and third bars). This provides ROMIO with more information with which to optimize while not significantly impacting the design of the code (since all the data was going to be written at that point in any case). Second, we adjusted specific tuning parameters in ROMIO to adjust how many processes directly interact with the underlying GPFS file system, to eliminate locking by these processes, and to fine tune the rearrangement of data (last three bars). The end result was a 15x performance improvement over the initial code and ROMIO configuration, bringing performance through ROMIO on par with other, proprietary, options.
This experience has broader implications in that the tuning performed here should be applicable to a range of codes performing this type of “write a single checkpoint file” pattern, and it has pointed out opportunities for additional optimization within the ROMIO implementation that are being explored with IBM to further improve performance.