torchserve

Overview

TorchServe is a flexible and easy-to-use tool for serving PyTorch models. It provides capabilities for packaging models, scaling workers based on hardware availability, and managing multiple model versions via a REST/gRPC API.

When to Use

Use TorchServe when you need a production-ready inference server that handles multi-GPU load balancing, request batching, and custom preprocessing/postprocessing logic via Python handlers.

Decision Tree

Do you need custom logic for image resizing or JSON parsing before model inference?
- OVERRIDE: preprocess() in a class inheriting from BaseHandler.
Do you have multiple GPUs available?
- RELY: On TorchServe's round-robin assignment; check the gpu_id in the handler context.
Do you want to deploy to a system with limited resources?
- CAUTION: TorchServe is in limited maintenance; check environment compatibility.

Workflows

Packaging and Serving a Model
1. Write a custom handler or use a default one (e.g., 'image_classifier').
2. Use torch-model-archiver to package the model, weights, and handler into a .mar file.
3. Start TorchServe specifying the model store and the initial models to load.
4. Test the endpoint using curl or a gRPC client.
Customizing Inference Logic
1. Define a class inheriting from BaseHandler.
2. Override preprocess() to handle incoming JSON/Image data.
3. Override inference() or postprocess() to customize output formatting.
4. Package this script as the --handler in the model archiver.
Scaling Inference Capacity
1. Use the Management API (typically on port 8081) to adjust the number of workers.
2. Send a PUT request to /models/{model_name}?min_worker=N.
3. Monitor logs to ensure new workers are successfully initialized on the available hardware.

Non-Obvious Insights

A/B Testing: TorchServe naturally supports multiple model versions simultaneously, making it trivial to perform A/B testing by routing requests to different model endpoints.
GPU Round-Robin: Workers are assigned GPUs in a round-robin fashion. Handlers must use the gpu_id provided in the context to ensure the model is loaded onto the correct physical device.
The MAR Format: The Model Archive (.mar) file is a self-contained ZIP that includes the model definition, state dictionary, and the handler script, ensuring that the deployment environment exactly matches the development environment.

Evidence

"Archive the model by using the model archiver: torch-model-archiver --model-name densenet161 --version 1.0..." (https://pytorch.org/serve/getting_started.html)
"In case of multiple GPUs TorchServe selects the gpu device in round-robin fashion and passes on this device id to the model handler in context." (https://pytorch.org/serve/custom_service.html)

Scripts

scripts/torchserve_tool.py: Skeleton for a custom TorchServe handler.
scripts/torchserve_tool.js: Script to send inference requests to a running TorchServe instance.

Dependencies

torchserve
torch-model-archiver

References

TorchServe Reference

Safety Notice

Copy this and send it to your AI assistant to learn

Overview

When to Use

Decision Tree

Workflows

Non-Obvious Insights

Evidence

Scripts

Dependencies

References

Source Transparency

Related Skills

ollama-rag

torchaudio

pytorch-onnx

unsloth-sft