This project proposes application phase-aware performance-energy tradeoff modeling for Multi-Modal Large Language Models (MLLMs). MLLMs extend the capabilities of LLMs by incorporating additional modalities, such as images, video, and audio, to enable advanced capabilities, including perception-grounded reasoning, visual question answering (VQA), captioning, and scene understanding. Unlike pure-text LLMs, MLLMs introduce an additional stage, the visual encoding stage, which transforms multimodal inputs into embeddings consumed by the language model’s prefill and decoding stages. Each stage exhibits fundamentally different computational and memory behaviors. While text-only LLMs are typically compute-intensive during prefill and memory-bound during decoding, MLLMs add a new stage whose bottleneck may be compute-bound or memory-bound depending on input resolution, model size, and hardware characteristics.
NVIDIA GH200, NVIDIA H100