Author Profile: majidraza1228

Skills published by majidraza1228 with real stars/downloads and source-aware metadata.

Total Skills

9

Total Stars

0

Total Downloads

0

RSS Feed

Skills Performance

Comparison chart based on real stars and downloads signals from source data.

build-review-interface

0

Stars
0
Downloads
0

error-analysis

0

Stars
0
Downloads
0

eval-audit

0

Stars
0
Downloads
0

eval-coding-agent

0

Stars
0
Downloads
0

eval-tool-use

0

Stars
0
Downloads
0

evaluate-rag

0

Stars
0
Downloads
0

generate-synthetic-data

0

Stars
0
Downloads
0

validate-evaluator

0

Stars
0
Downloads
0

Published Skills

General

build-review-interface

Build a custom browser-based annotation interface for reviewing LLM traces and collecting human labels. Use when reviewers are working with raw JSON files, when you need to collect Pass/Fail labels at scale, or when trace data needs domain-specific formatting to be readable.

Repository SourceNeeds Review
Research

error-analysis

Systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.

Repository SourceNeeds Review
Security

eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists.

Repository SourceNeeds Review
Coding

eval-coding-agent

Evaluate a coding agent's output quality across the failure modes specific to code generation and editing: correctness, scope discipline, instruction following, safety, and diff quality. Use when building or improving a Claude-powered coding assistant, code review agent, or code generation pipeline.

Repository SourceNeeds Review
Automation

eval-tool-use

Evaluate whether an LLM agent selects the right tools, constructs correct arguments, sequences tool calls appropriately, and handles errors gracefully. Use for any Claude agent that has access to tools (function calling, MCP servers, API integrations).

Repository SourceNeeds Review
Research

evaluate-rag

Evaluate a RAG (retrieval-augmented generation) pipeline's retrieval quality and generation quality separately. Use when the pipeline retrieves context from a knowledge base before generating answers.

Repository SourceNeeds Review
General

generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces.

Repository SourceNeeds Review
General

validate-evaluator

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) to verify alignment before trusting its outputs in production.

Repository SourceNeeds Review
Coding

write-judge-prompt

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness, reasoning quality). Do NOT use when the failure mode can be checked with code (regex, schema validation, test execution). Default to Claude (claude-sonnet-4-6 or claude-opus-4-6) as the judge model.

Repository SourceNeeds Review
Author majidraza1228 | V50.AI