What is Serverless inference?
Serverless inference is a cloud computing model that allows you to deploy and serve machine learning models without managing the underlying infrastructure. Notable characteristics of a serverless model include:
No server management required
Automatic scaling to handle varying loads
Pay-per-use pricing model
Low operational overhead
Why use serverless inference?
Serverless inference offers several advantages, particularly for deploying and managing expensive transformer-based models. Here’s why it’s beneficial:
Cost-efficiency: Serverless inference eliminates idle GPU time costs. You only pay for the compute resources used during actual inference, making it ideal for models with variable or “bursty” traffic patterns.
Scalability: It automatically scales to handle varying loads, from sporadic requests to sudden traffic spikes, without manual intervention.
Reduced operational overhead: There’s no need to manage servers or worry about capacity planning. The cloud provider handles infrastructure management, allowing you to focus on model development and optimization.
Flexibility: Serverless inference adapts to your needs, whether you’re serving a single model or multiple models with different resource requirements.
While serverless inference may appear more expensive on a “per-minute” basis compared to traditional server-based deployments, it eliminates the need to provision for maximum capacity scenarios. This can lead to significant cost savings, especially for workloads with variable demand.
It’s worth noting that even if you anticipate running GPUs around the clock, actual utilization rarely matches this expectation. Serverless inference helps optimize resource usage and costs in these scenarios.
What are serverless GPUs used for?
You can think of serverless GPUs like AWS Lambda, but with GPU support: run your function, get cloud GPU-acceleration, and avoid paying for idle time. Serverless GPUs are especially useful for these use-cases:
AI Inference: serving custom models as autoscaling HTTP endpoints, including LLMs, diffusion and deep learning models.
Fine-Tuning AI Models: run fine-tuning jobs on custom datasets. Instead of provisioning a training cluster manually, you launch it serverlessly and let it disappear when done.
Image and Video Workloads: running heavy jobs, like 3D rendering, video transcoding, and batch processing.
Scientific Simulations: computational pipelines that require GPU parallelism, like protein folding and bioinformatics analysis
CI/CD and Testing: spin up ephemeral GPU environments to test GPU-accelerated code as part of a build pipeline.
Best practices for serverless inference
To optimize your serverless inference deployments:
Leverage GPU acceleration: For compute-intensive models, utilize GPU resources effectively:
Choose the appropriate GPU type and memory for your model to ensure efficient resource utilization.
Consult your provider’s documentation on how to specify GPU requirements for your functions.
Minimize cold starts: Cold starts (the time it takes to spin up a new container with your model in it) can significantly impact latency for serverless functions. Consider these techniques:
Maintain a pool of warm instances that are always up and running.
Adjust container idle timeouts to keep containers warm for longer periods, if supported.
Optimize model loading and initialization:
Utilize lifecycle methods or initialization hooks provided by your serverless platform to load models during container warm-up rather than on first invocation.
Move large file downloads (e.g. model weights) to the build or deployment phase when possible, so that they are downloaded only once.
Take advantage of pre-built images or layers which come with optimized dependencies for common ML frameworks.
Consider model quantization or pruning techniques to reduce the size of the model that needs to be loaded without significantly impacting performance.
Use persistent storage options to cache model weights, reducing load times on subsequent invocations.
Implement efficient batching:
Utilize batching mechanisms provided by your serverless platform to automatically batch incoming requests, improving throughput.
Implement custom batching logic within your inference function for fine-grained control over batch size and processing.
Serverless inference core technology
Addressing Cold Boot Latency
Cold start latency occurs when models need to be initially loaded before providing inference services, causing slow loading times and potentially affecting model performance and user experience.
Pre-Loading Model Weights
The first optimization to minimize cold boot latency is achieved by pre-loading models when the model container first starts.
This prevents the model from having to be retrieved each time the API is invoked.
Caching Weights in Distributed Storage
One of the slowest operations in machine learning inference is downloading large model weights from the internet. We advise against downloading model weights from cloud object storage; it's way too slow. Instead, we suggest mounting files directly to the container running inference:
Enabling GPU Checkpoint Restore
Checkpoint Restore is a modern technique to save a snapshot, or checkpoint, of a GPU process in order to avoid having to model weights back into memory each cold boot.
Conclusion
Serverless inference offers a powerful way to deploy machine learning models with minimal operational overhead. By understanding the concepts and following best practices, you can leverage serverless platforms to efficiently serve your AI models at scale.
FAQs
Are serverless GPUs cheaper than EC2?
For bursty workloads, serverless GPUs are significantly cheaper than using on-demand EC2 instances. When using EC2, you pay for the full GPU instance by the hour, regardless if it's running code or sitting idle. With serverless GPUs, you only pay when your code is running.
Can I train LLMs on serverless GPUs?
You can train LLMs on serverless GPUs. It's often difficult and expensive to access a large machine, like an H100, on conventional on-demand cloud providers. Many serverless GPU providers offer H100s, but also allow you to access multiple GPUs per workload.
What’s the difference between serverless GPUs and spot GPUs?
Spot instances are hourly rentals that are interruptible, which means the cloud provider can decide to stop your workload if the instance is needed by another customer. Unlike spot instances, serverless GPUs are billed by the second, and are typically not interruptible.
How do you reduce serverless GPU cold starts?
There are several strategies:
Optimize your container image to exclude unnecessary packages so there's less to load at runtime
Pre-load model weights at start time so they’re loaded into VRAM before the first request
Store models weights in distributed storage volumes instead of downloading them from remote object storage
Some platforms support checkpoint restore, which saves a snapshot of the GPU memory and prevents the need to re-load weights into VRAM each time the container boots
Last updated