Meta’s Llama 3.1 405B large language model (LLM) is achieving new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer, according to the NVIDIA Technical Blog. The enhancements have resulted in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.
Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM
TensorRT-LLM has already delivered remarkable inference throughput for Llama 3.1 405B since the model’s release. This was achieved through various optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques have accelerated inference performance while maintaining lower precision compute.
TensorRT-LLM added support for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling factors to preserve maximum accuracy. Additionally, user-defined kernels such as matrix multiplications from FBGEMM are optimized via plug-ins inserted into the network graph at compile time.
Boosting Performance Up to 1.44x with TensorRT Model Optimizer
NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available through the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, reducing inference compute overhead.
Table 1 demonstrates the maximum throughput performance, showing significant improvements across various input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e memory each and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth.
Maximum Throughput Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs |
|||
Input | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 120,000 | 2,048 |
TensorRT Model Optimizer FP8 | 463.1 | 320.1 | 71.5 |
Official Llama FP8 Recipe | 399.9 | 230.8 | 49.6 |
Speedup | 1.16x | 1.39x | 1.44x |
Similarly, Table 2 presents the minimum latency performance using the same input and output sequence lengths.
Batch Size = 1 Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs |
|||
Input | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 120,000 | 2,048 |
TensorRT Model Optimizer FP8 | 49.6 | 44.2 | 27.2 |
Official Llama FP8 Recipe | 37.4 | 33.1 | 22.8 |
Speedup | 1.33x | 1.33x | 1.19x |
These results indicate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.
Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ
For developers with hardware resource constraints, the INT4 AWQ technique in TensorRT Model Optimizer compresses the model, allowing Llama 3.1 405B to fit on just two H200 GPUs. This method reduces the required memory footprint significantly by compressing the weights down to 4-bit integers while encoding activations using FP16.
Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, demonstrating that the INT4 AWQ method provides comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.
Maximum Throughput Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs |
|||
Input | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 60,000 | 2,048 |
TensorRT Model Optimizer INT4 AWQ | 75.6 | 28.7 | 16.2 |
Batch Size = 1 Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs |
|||
Input | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 60,000 | 2,048 |
TensorRT Model Optimizer INT4 AWQ | 21.6 | 18.7 | 12.8 |
NVIDIA’s advancements in TensorRT Model Optimizer and TensorRT-LLM are paving the way for enhanced performance and efficiency in running large language models like Llama 3.1 405B. These improvements offer developers more flexibility and cost-efficiency, whether they have extensive hardware resources or more constrained environments.
Image source: Shutterstock