PI: William E. Allcock, LCF

Objective: To try and change the way we build compute clusters by treating RAM as a schedule able resource, similar to how disk is centrally managed in a SAN.  Like the SAN, the idea is that you should be able to end up with less aggregate RAM, but be able to put it where it is needed, when it is needed.  The local on-node RAM can be thought of as an “L4 cache”, but an application can access any amount of RAM up to the total available in the external pool.  This ameliorates the problem of some nodes having significant amounts of unused RAM, while other jobs fail because of insufficient RAM.  Sharing node RAM is not a new idea, but in this case we are not stealing it (or performance) from a compute node, but rather have dedicated RAM servers on the IB.

Testbed: In JLSE, the host neurosis is dedicated to this project, and there is currently one QDR XPD on the JLSE IB fabric, with a second expected to be added in the future.  Neurosis has 1TB of RAM enabling “apples-to-apples” comparisons between on-host and remote RAM sizes up to nearly 1TB.

Although not part of JLSE, we also have the Kove software loaded on cooley and have 3 FDR XPD units with an aggregate of 5.5TB of external RAM and 14 FDR IB links.

© 2020 Joint Laboratory for System Evaluation