phoenix-evals

Build and run evaluators for AI/LLM applications using Phoenix.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "phoenix-evals" with this command: npx skills add arize-ai/phoenix/arize-ai-phoenix-phoenix-evals

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

TaskFiles
Setupsetup-python, setup-typescript
Decide what to evaluateevaluators-overview
Choose a judge modelfundamentals-model-selection
Use pre-built evaluatorsevaluators-pre-built
Build code evaluatorevaluators-code-{python|typescript}
Build LLM evaluatorevaluators-llm-{python|typescript}, evaluators-custom-templates
Batch evaluate DataFrameevaluate-dataframe-python
Run experimentexperiments-running-{python|typescript}
Create datasetexperiments-datasets-{python|typescript}
Generate synthetic dataexperiments-synthetic-{python|typescript}
Validate evaluator accuracyvalidation, validation-evaluators-{python|typescript}
Sample traces for reviewobserve-sampling-{python|typescript}
Analyze errorserror-analysis, error-analysis-multi-turn, axial-coding
RAG evalsevaluators-rag
Avoid common mistakescommon-mistakes-python, fundamentals-anti-patterns
Productionproduction-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-pythonevaluators-{code\|llm}-{python\|typescript}validation-evaluators-{python\|typescript}

RAG Systems: evaluators-ragevaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Rule Categories

PrefixDescription
fundamentals-*Types, scores, anti-patterns
observe-*Tracing, sampling
error-analysis-*Finding failures
axial-coding-*Categorizing failures
evaluators-*Code, LLM, RAG evaluators
experiments-*Datasets, running experiments
validation-*Validating evaluator accuracy against human labels
production-*CI/CD, monitoring

Key Principles

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

phoenix-tracing

No summary provided by upstream source.

Repository SourceNeeds Review
General

vercel-react-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review
General

mintlify

No summary provided by upstream source.

Repository SourceNeeds Review
General

phoenix-playwright-tests

No summary provided by upstream source.

Repository SourceNeeds Review