IBM Research has announced significant advancements in the PyTorch framework, aiming to enhance the efficiency of AI model training. These improvements were presented at the PyTorch Conference, highlighting a new data loader capable of handling massive data and significant enhancements to large language model (LLM) training throughput.
Enhancements to PyTorch’s Data Loader
The new high-throughput data loader allows PyTorch users to distribute LLM training workloads seamlessly across multiple machines. This innovation enables developers to save checkpoints more efficiently, reducing duplicated work. According to IBM Research, this tool was developed out of necessity by Davis Wertheimer and his colleagues, who needed a solution to manage and stream vast quantities of data across multiple devices efficiently.
Initially, the team faced challenges with existing data loaders, which caused bottlenecks in training processes. By iterating and refining their approach, they created a PyTorch-native data loader that supports dynamic and adaptable operations. This tool ensures that previously seen data isn’t revisited, even if the resource allocation changes mid-job.
In stress tests, the data loader managed to stream 2 trillion tokens over a month of continuous operation without any failures. It demonstrated the capability to load over 90,000 tokens per second per worker, translating to half a trillion tokens per day on 64 GPUs.
Maximizing Training Throughput
Another significant focus for IBM Research is optimizing GPU usage to prevent bottlenecks in AI model training. The team has employed fully sharded data parallel (FSDP) techniques to distribute large training datasets evenly across multiple machines, enhancing the efficiency and speed of model training and tuning. Using FSDP in conjunction with torch.compile has led to substantial gains in throughput.
IBM Research scientist Linsong Chu highlighted that their team was among the first to train a model using torch.compile and FSDP, achieving a training rate of 4,550 tokens per second per GPU on A100 GPUs. This breakthrough was demonstrated with the Granite 7B model, recently released on Red Hat Enterprise Linux AI (RHEL AI).
Further optimizations are being explored, including the integration of FP8 (8-point floating bit) datatype supported by Nvidia H100 GPUs, which has shown up to 50% gains in throughput. IBM Research scientist Raghu Ganti emphasized the significant impact of these improvements on infrastructure cost reduction.
Future Prospects
IBM Research continues to explore new frontiers, including the use of FP8 for model training and tuning on IBM’s artificial intelligence unit (AIU). The team is also focusing on Triton, Nvidia’s open-source software for AI deployment and execution, which aims to further optimize training by compiling Python code into the specific hardware programming language.
These advancements collectively aim to move faster cloud-based model training from experimental stages into broader community applications, potentially transforming the landscape of AI model training.
Image source: Shutterstock