PI:  Yanjing Li, UofC

ANL Contact: Prasanna Balaprakash, MCS

Description: Cross-layer resilience, where resilience techniques across different layers of the system design stack (circuit, logic, architecture, and software) cooperate to achieve optimized system tradeoffs, is essential for overcoming the so-called “resilience wall” in computing systems. An important aspect of cross-layer resilience is application-level error analysis, which enables cross-layer resilience strategies to be optimized for different program regions. However, such analysis is challenging. Previous work either 1. relies on error injection techniques, which are extremely time consuming, or, 2. utilizes ad-hoc heuristics, which cannot ensure accuracy.

We propose a novel, scalable, and extensible machine learning (ML) approach to analyze error behaviors at the application level. We are implementing a carefully-crafted set of dynamic and static features of any given application (derived based on our domain-specific expertise/knowledge and through assistance from modern ML techniques), then quickly and accurately predicts the probabilities for errors in each instruction to result in: 1. a visible abnormal symptom (e.g., segfault), 2. silent data corruption, and 3. correct program behavior, given that errors have occurred in the destination of this instruction.

To evaluate the effectiveness of our technique, we plan to train/cross-validate our model using millions of data samples for a wide variety of real-world applications including HPC applications. From our study, we also expect that the ML model to reveal several key insights (e.g., about feature importance and correlation) that are essential to the fundamental understanding of error behaviors.

Testbed: Computing resources/nodes for running machine learning applications written in Python and many C++/C programs in parallel.