torchserve

Model serving engine for PyTorch. Focuses on MAR packaging, custom handlers for preprocessing/inference, and management of multi-GPU worker scaling. (torchserve, mar-file, handler, basehandler, model-archiver, inference-api)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "torchserve" with this command: npx skills add cuba6112/skillfactory/cuba6112-skillfactory-torchserve

Overview

TorchServe is a flexible and easy-to-use tool for serving PyTorch models. It provides capabilities for packaging models, scaling workers based on hardware availability, and managing multiple model versions via a REST/gRPC API.

When to Use

Use TorchServe when you need a production-ready inference server that handles multi-GPU load balancing, request batching, and custom preprocessing/postprocessing logic via Python handlers.

Decision Tree

  1. Do you need custom logic for image resizing or JSON parsing before model inference?
    • OVERRIDE: preprocess() in a class inheriting from BaseHandler.
  2. Do you have multiple GPUs available?
    • RELY: On TorchServe's round-robin assignment; check the gpu_id in the handler context.
  3. Do you want to deploy to a system with limited resources?
    • CAUTION: TorchServe is in limited maintenance; check environment compatibility.

Workflows

  1. Packaging and Serving a Model

    1. Write a custom handler or use a default one (e.g., 'image_classifier').
    2. Use torch-model-archiver to package the model, weights, and handler into a .mar file.
    3. Start TorchServe specifying the model store and the initial models to load.
    4. Test the endpoint using curl or a gRPC client.
  2. Customizing Inference Logic

    1. Define a class inheriting from BaseHandler.
    2. Override preprocess() to handle incoming JSON/Image data.
    3. Override inference() or postprocess() to customize output formatting.
    4. Package this script as the --handler in the model archiver.
  3. Scaling Inference Capacity

    1. Use the Management API (typically on port 8081) to adjust the number of workers.
    2. Send a PUT request to /models/{model_name}?min_worker=N.
    3. Monitor logs to ensure new workers are successfully initialized on the available hardware.

Non-Obvious Insights

  • A/B Testing: TorchServe naturally supports multiple model versions simultaneously, making it trivial to perform A/B testing by routing requests to different model endpoints.
  • GPU Round-Robin: Workers are assigned GPUs in a round-robin fashion. Handlers must use the gpu_id provided in the context to ensure the model is loaded onto the correct physical device.
  • The MAR Format: The Model Archive (.mar) file is a self-contained ZIP that includes the model definition, state dictionary, and the handler script, ensuring that the deployment environment exactly matches the development environment.

Evidence

Scripts

  • scripts/torchserve_tool.py: Skeleton for a custom TorchServe handler.
  • scripts/torchserve_tool.js: Script to send inference requests to a running TorchServe instance.

Dependencies

  • torchserve
  • torch-model-archiver

References

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ollama-rag

No summary provided by upstream source.

Repository SourceNeeds Review
General

torchaudio

No summary provided by upstream source.

Repository SourceNeeds Review
General

pytorch-onnx

No summary provided by upstream source.

Repository SourceNeeds Review
General

unsloth-sft

No summary provided by upstream source.

Repository SourceNeeds Review