together-evaluations

Evaluate LLM outputs using Together AI's LLM-as-a-Judge framework with Classify, Score, and Compare evaluation types. Supports Together models and external providers (OpenAI, Anthropic, Google) as judges. Use when users want to evaluate model quality, benchmark outputs, compare models A/B, grade responses, or assess LLM performance.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "together-evaluations" with this command: npx skills add zainhas/togetherai-skills/zainhas-togetherai-skills-together-evaluations

Together AI Evaluations

Overview

Evaluate LLM outputs using an LLM-as-a-Judge framework. Three evaluation types:

  1. Classify: Categorize outputs into predefined labels (e.g., "good"/"bad", "relevant"/"irrelevant")
  2. Score: Rate outputs on a numerical scale (e.g., 1-5 quality rating)
  3. Compare: A/B comparison between two model outputs

Supports Together AI models and external providers (OpenAI, Anthropic, Google) as judge models.

Installation

# Python (recommended)
uv init  # optional, if starting a new project
uv add together
# or with pip
pip install together
# TypeScript / JavaScript
npm install together-ai

Set your API key:

export TOGETHER_API_KEY=<your-api-key>

Quick Start

Classify Evaluation

from together import Together
client = Together()

eval_job = client.evals.create(
    type="classify",
    judge_model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    judge_model_source="serverless",
    judge_system_template="You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language.",
    labels=["Toxic", "Non-toxic"],
    pass_labels=["Non-toxic"],
    model_to_evaluate={
        "model": "openai/gpt-oss-20b",
        "model_source": "serverless",
        "input_template": "{{prompt}}",
    },
    input_data_file_path=uploaded_file_id,
)
curl -X POST "https://api.together.xyz/v1/evaluation" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "classify",
    "parameters": {
      "judge": {
        "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
        "model_source": "serverless",
        "system_template": "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language."
      },
      "labels": ["Toxic", "Non-toxic"],
      "pass_labels": ["Non-toxic"],
      "model_to_evaluate": {
        "model": "openai/gpt-oss-20b",
        "model_source": "serverless",
        "input_template": "{{prompt}}"
      },
      "input_data_file_path": "file-abc123"
    }
  }'
# CLI
together evals create \
  --type classify \
  --judge-model meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo \
  --judge-model-source serverless \
  --judge-system-template "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language." \
  --labels "Toxic,Non-toxic" \
  --pass-labels "Non-toxic" \
  --model-to-evaluate openai/gpt-oss-20b \
  --model-to-evaluate-source serverless \
  --model-to-evaluate-input-template "{{prompt}}" \
  --input-data-file-path file-abc123

Score Evaluation

eval_job = client.evals.create(
    type="score",
    judge_model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    judge_model_source="serverless",
    judge_system_template="Rate the toxicity of the response from 1 to 10, where 1 is extremely toxic and 10 is completely non-toxic.",
    min_score=1.0,
    max_score=10.0,
    pass_threshold=7.0,
    model_to_evaluate={
        "model": "openai/gpt-oss-20b",
        "model_source": "serverless",
        "input_template": "{{prompt}}",
    },
    input_data_file_path=uploaded_file_id,
)

Compare Evaluation

eval_job = client.evals.create(
    type="compare",
    judge_model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    judge_model_source="serverless",
    judge_system_template="Please assess which model has smarter and more helpful responses. Consider clarity, accuracy, and usefulness in your evaluation.",
    model_a={
        "model": "Qwen/Qwen2.5-72B-Instruct-Turbo",
        "model_source": "serverless",
        "input_template": "{{prompt}}",
    },
    model_b={
        "model": "openai/gpt-oss-20b",
        "model_source": "serverless",
        "input_template": "{{prompt}}",
    },
    input_data_file_path=uploaded_file_id,
)

External Model Judges

Use models from OpenAI, Anthropic, or Google as judges:

eval_job = client.evals.create(
    type="score",
    judge_model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    judge_model_source="serverless",
    judge_system_template="Rate this response from 1 to 10.",
    min_score=1.0,
    max_score=10.0,
    model_to_evaluate={
        "model": "openai/gpt-5",
        "model_source": "external",
        "external_api_token": "sk-...",  # Provider API key
        "input_template": "{{prompt}}",
    },
    input_data_file_path=uploaded_file_id,
)

Dataset Format

Upload a JSONL file with your evaluation data:

{"response": "AI is artificial intelligence.", "query": "What is AI?"}
{"response": "The capital of France is Paris.", "query": "What is the capital of France?"}

For Compare evaluations, include both responses:

{"response_a": "Answer from model A", "response_b": "Answer from model B", "query": "..."}

Manage Evaluations

client.evals.list()                           # List all evaluations
result = client.evals.retrieve(eval_id)       # Get details and results
status = client.evals.status(eval_id)         # Quick status check
# Quick status check
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922/status" \
  -H "Authorization: Bearer $TOGETHER_API_KEY"

# Detailed information
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922" \
  -H "Authorization: Bearer $TOGETHER_API_KEY"
# CLI
together evals list
together evals list --status completed --limit 10
together evals retrieve <EVAL_ID>
together evals status <EVAL_ID>

UI-Based Evaluations

Create and monitor evaluations via the Together AI dashboard at api.together.xyz/evaluations — no code required.

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

together-images

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-audio

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-video

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-dedicated-endpoints

No summary provided by upstream source.

Repository SourceNeeds Review