Reduce Start Latency in LLM Inference with NVIDIA Run:ai Model Streamer

The deployment of large language models (LLMs) poses the challenge of optimizing inference efficiency. In particular, cold start delays—when models require a considerable amount of time to load into GPU memory—can affect user experience and scalability. In increasingly complex production environments, the need for efficient model loading becomes evident. These models often require tens to hundreds of gigabytes of memory, leading to latency and resource issues when scaling to meet unpredictable demand. Cold start delays impact both the end-user experience and operational efficiency.

This article presents the NVIDIA Run:ai Model Streamer, an open-source Python SDK designed to mitigate these issues by simultaneously reading model weights from storage and streaming them directly to GPU memory. Comparative tests were conducted with vLLM's default loader, Hugging Face (HF) Safetensors Loader, and CoreWeave Tensorizer, using local SSDs and Amazon S3.

The experiments described show that the NVIDIA Run:ai Model Streamer significantly reduces model loading times, decreasing cold start latency even in cloud environments. Moreover, it is compatible with the Safetensor format, thus avoiding the need for weight conversion. The findings underscore the importance of storage choice and concurrent streaming for efficient LLM implementation. Specifically, to improve inference performance, it is recommended to use the NVIDIA Run:ai Model Streamer to reduce cold start latency, saturate storage bandwidth, and accelerate time to inference.

How is a model loaded onto a GPU for inference?

To provide context, this section explains the two main steps involved in loading a machine learning model into GPU memory for inference: reading weights from storage into CPU memory and transferring them to the GPU. Understanding this process is key to optimizing inference latency, especially in large-scale or cloud deployments.

Reading weights from storage to CPU memory

Model weights are loaded from storage into CPU memory. These can be in various formats, such as .pt, .h5, and .safetensors, or in custom formats; the storage can be local, clustered, or in the cloud. For the purposes of this article, the .safetensors format is used due to its wide adoption. However, other formats can be employed in different contexts.

Transferring the model to the GPU

The model parameters and relevant tensors are transferred to GPU memory.

Loading models from cloud storage, like Amazon S3, often involves an additional step: first, the weights are downloaded to local disk before being moved to CPU memory and then to GPU memory. Traditionally, these steps are performed sequentially, making model loading one of the biggest bottlenecks when scaling inference.

How does the Model Streamer work?

The Model Streamer is an SDK with a high-performance C++ backend designed to accelerate loading models onto GPUs from various storage sources (e.g., network file systems, cloud, local disks, etc.). It utilizes multiple threads to read tensors concurrently from an object storage file or files into a dedicated buffer in CPU memory. Each tensor has an identifier, which allows simultaneous reading and transferring: while some tensors are being read from storage to the CPU, others are being transferred from the CPU to the GPU.

The tool takes full advantage of the fact that GPU and CPU subsystems are independent. GPUs can access CPU memory directly via PCIe without CPU intervention, allowing for real-time overlap between storage reads and memory transfers. The experiments were conducted on an AWS g5.12xlarge instance with NVIDIA A10G GPUs and 2nd generation AMD EPYC CPUs, providing a balanced architecture for efficient parallel data handling.

Key features of the Model Streamer include:

Concurrency: Multiple threads read model weight files in parallel, including support for splitting large tensors.
Balanced workload: Work is distributed based on tensor size to saturate storage bandwidth.
Support for multiple types of storage: Works with SSDs, remote storage, and cloud object stores like S3.
No tensor format conversion: Natively supports Safetensors, avoiding conversion overhead.
Easy integration: Offers a Python API and an iterator similar to Safetensors, but with concurrent background reading. Easily integrates with inference engines like vLLM and TGI.

For more details on setup and usage, please refer to the Model Streamer documentation.

Operation of the HF Safetensors Loader

The HF Safetensors Loader is an open-source utility that provides a safe and fast format for saving and loading multiple tensors. It uses a memory-mapped filesystem to minimize data copying. In the CPU, tensors are directly mapped into memory. In the GPU, it creates an empty tensor with PyTorch and then moves the tensor data using cudaMemcpy, facilitating a copy-free loading process.

Operation of the CoreWeave Tensorizer

The CoreWeave Tensorizer is an open-source tool that serializes model weights and their respective tensors into a single file. Instead of loading the entire model into RAM before moving it to the GPU, Tensorizer streams the model data tensor by tensor from an HTTP/HTTPS or S3 source.

Performance Comparison of Model Loaders Across Three Types of Storage

The performance of different model loaders (NVIDIA Run:ai Model Streamer, CoreWeave Tensorizer, and HF Safetensors Loader) was compared across three types of storage:

Experiment #1: GP3 SSD

Model loading times were measured with various loaders. The impact of concurrency on the Model Streamer was evaluated, and how the number of workers affected the Tensorizer was examined.

Experiment #2: IO2 SSD

The same loaders were tested on IO2 SSD to assess the impact of higher IOPS and bandwidth.

Experiment #3: Amazon S3

Loaders were compared on cloud storage; the Safetensors Loader was excluded as it does not support S3.

Experiment #4: vLLM with Different Loaders

The Model Streamer was integrated into vLLM to measure total loading and preparation times, compared with the default loader of HF Safetensors Loader and Tensorizer.

All tests were conducted under cold start conditions to avoid cache effects. During these experiments, unexpected cache behaviors emerged on AWS S3. When conducting experiments in rapid succession, model loading times improved significantly, likely due to a caching mechanism in S3. To ensure consistency and avoid benefiting from this "hot cache," a minimum waiting period of three minutes was introduced between each test execution.

Results and Conclusions

The tests show that the NVIDIA Run:ai Model Streamer significantly accelerates model loading times in both local and remote storage, outperforming other common loaders. By enabling concurrent weight loading and streaming to GPU memory, it provides a practical and high-impact solution for production-scale inference workloads.

For those building or scaling inference systems, especially with large models or cloud-based storage, these results provide immediate takeaways: use the Model Streamer to reduce cold start latency, saturate storage bandwidth, and accelerate inference time. With easy integration into frameworks like vLLM and support for high concurrency environments and multiple storage types, it represents a simple optimization that can yield measurable improvements.

Leverage your model loading performance with the NVIDIA Run:ai Model Streamer. For more information, you're invited to continue exploring on this blog.