phoenix-evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "phoenix-evals" with this command: npx skills add arize-ai/phoenix/arize-ai-phoenix-phoenix-evals

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task Files

Setup setup-python, setup-typescript

Decide what to evaluate evaluators-overview

Choose a judge model fundamentals-model-selection

Use pre-built evaluators evaluators-pre-built

Build code evaluator evaluators-code-python, evaluators-code-typescript

Build LLM evaluator evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates

Batch evaluate DataFrame evaluate-dataframe-python

Run experiment experiments-running-python, experiments-running-typescript

Create dataset experiments-datasets-python, experiments-datasets-typescript

Generate synthetic data experiments-synthetic-python, experiments-synthetic-typescript

Validate evaluator accuracy validation, validation-evaluators-python, validation-evaluators-typescript

Sample traces for review observe-sampling-python, observe-sampling-typescript

Analyze errors error-analysis, error-analysis-multi-turn, axial-coding

RAG evals evaluators-rag

Avoid common mistakes common-mistakes-python, fundamentals-anti-patterns

Production production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview

Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overview → production-guardrails → production-continuous

Reference Categories

Prefix Description

fundamentals-*

Types, scores, anti-patterns

observe-*

Tracing, sampling

error-analysis-*

Finding failures

axial-coding-*

Categorizing failures

evaluators-*

Code, LLM, RAG evaluators

experiments-*

Datasets, running experiments

validation-*

Validating evaluator accuracy against human labels

production-*

CI/CD, monitoring

Key Principles

Principle Action

Error analysis first Can't automate what you haven't observed

Custom > generic Build from your failures

Code first Deterministic before LLM

Validate judges

80% TPR/TNR

Binary > Likert Pass/fail, not 1-5

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

phoenix-cli

No summary provided by upstream source.

Repository SourceNeeds Review
General

phoenix-tracing

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

phoenix-evals

No summary provided by upstream source.

Repository SourceNeeds Review
747-github
Automation

Phoenix — Adopt a Phoenix. AI-Native Pet. 凤凰。Fénix.

Adopt a virtual Phoenix AI-native pet at animalhouse.ai. Dies and resurrects. Each cycle it remembers the last life. Feeding every 6 hours. Uncommon tier cre...

Registry SourceRecently Updated
2930Profile unavailable