Evaluating FP64 and FP32 Emulation: Performance, Accuracy, and Energy Efficiency

Project Description

This study investigates the behavior of NVIDIA cuBLAS emulation techniques to provide high-precision (FP32/FP64) results using low-precision hardware like Tensor Cores. While these methods can offer significant speedups (up to 13.2x for DGEMM on modern architectures like Blackwell), they introduce potential "hidden" costs. The goal is to characterize when emulation is "safe" and efficient versus when it incurs prohibitive overhead. [Objectives] - Investigate Data-Dependent Performance: Measure how the absolute-value range and exponent distribution of input matrices affect the emulation's throughput. - Quantify Memory Overhead: Evaluate the memory footprint of intermediate data structures (e.g., matrix slicing and residue storage) required for error-free transformations. - Profile Latency & Throughput: Benchmark the cuBLASLtMatmul dispatcher to determine the fixed "analysis" overhead versus the execution gain. - Evaluate Power Efficiency: Compare the energy consumption of emulated operations against native FP64/FP32 paths.

Testbed

B200, GH200, H100

Argonne Joint Laboratory for System Evaluation JLSE

JLSE Projects

Evaluating FP64 and FP32 Emulation: Performance, Accuracy, and Energy Efficiency