PIs:  Adam Hammouda, NE, Andrew Siegel, MCS and Pete Beckman, CLS
Objective: To study operating system design decisions for next generation HPC architectures. In particular, we aim to study the impact which those decisions have on application performance characteristics. Further, we aim to understand how these decisions and performance characteristics relate to the thermal stability and energy efficiency of the architecture in question. Broad questions of interest to our research include the following:

  • Is there a layer of the HPC runtime environment best equipped for mitigating thermal load imbalances (i.e. OS, runtime system, or application)?
  • If it is more optimal to share these responsibilities between layers, what is the optimal division of labor?
  • What are the best metrics for the above referenced optimality?
  • With such metrics, what are acceptable tradeoffs between performance and energy?
  • What are generalizable strategies for the application layer to handle thermal load imbalances? What is a generalizable strategy for the application layer to receive and share information with the runtime system?

In an earlier work, we reformulated an algorithm class of bulk-synchronous computation to account for nonuniformities in process execution rates. The challenge of energy and thermal efficiency presented by next generation machines is precisely what motivated our algorithmic developments in this previous work. The COOLR project presents an opportunity to pursue this line of inquiry even further by exploring the precise characteristics of these anticipated nonuniformities.
Testbed: We have already begun our explorations of the Intel Xeon E5-2670 machine available to us. We are exploring the effects of per-socket controls over cpu frequency, and pstate setings, and we have developed metrics to quantify the tradeoffs in performance and thermal load balance. However, our algorithmic work in Hammouda and Siegel 2013 was designed to exploit per-process nonuniformities in execution rates, and we therefore could benefit greatly from a machine, which provides access to these controls.

© 2020 Joint Laboratory for System Evaluation