Skill Logger
Track, measure, and improve skill quality through systematic logging and scoring.
When to Use This Skill
Use for:
-
Setting up skill usage logging
-
Defining quality metrics for skill outputs
-
Analyzing skill performance over time
-
Identifying skills that need improvement
-
Building feedback loops for skill enhancement
-
A/B testing skill variations
NOT for:
-
Creating new skills → use agent-creator
-
Skill documentation → use skill-coach
-
Runtime debugging → use appropriate debugger skills
-
General logging/monitoring → use devops-automator
Core Logging Architecture
┌────────────────────────────────────────────────────────────────┐ │ SKILL LOGGING PIPELINE │ ├────────────────────────────────────────────────────────────────┤ │ │ │ 1. CAPTURE 2. ANALYZE 3. SCORE │ │ ├─ Invocation ├─ Output parse ├─ Quality metrics │ │ ├─ Input context ├─ Token usage ├─ User satisfaction │ │ ├─ Output ├─ Tool calls ├─ Goal completion │ │ └─ Timing └─ Error patterns └─ Efficiency │ │ │ │ 4. AGGREGATE 5. ALERT 6. IMPROVE │ │ ├─ Per-skill stats ├─ Quality drops ├─ Identify patterns │ │ ├─ Trend analysis ├─ Error spikes ├─ Suggest changes │ │ └─ Comparisons └─ Underuse └─ Track experiments │ │ │ └────────────────────────────────────────────────────────────────┘
What to Log
Invocation Data
{ "invocation_id": "uuid", "timestamp": "ISO8601", "skill_name": "wedding-immortalist", "skill_version": "1.2.0",
"input": { "user_query": "Create a 3D model from my wedding photos", "context_tokens": 1500, "files_referenced": ["photos/", "config.json"] },
"execution": { "duration_ms": 45000, "tool_calls": [ {"tool": "Bash", "count": 5}, {"tool": "Write", "count": 3} ], "tokens_used": { "input": 8500, "output": 3200 }, "errors": [] },
"output": { "type": "code_generation", "artifacts_created": ["pipeline.py", "config.yaml"], "response_length": 3200 } }
Quality Signals
QUALITY_SIGNALS = { # Implicit signals (automated) 'completion': 'Did the skill complete without errors?', 'token_efficiency': 'Output quality per token used', 'tool_success_rate': 'Tool calls that succeeded', 'retry_count': 'How many retries needed?',
# Explicit signals (user feedback)
'user_edit_ratio': 'How much did user modify output?',
'user_accepted': 'Did user accept/use the output?',
'follow_up_needed': 'Did user need to ask for fixes?',
'explicit_rating': 'Thumbs up/down if available',
# Outcome signals (delayed)
'code_ran_successfully': 'Did generated code work?',
'tests_passed': 'Did it pass tests?',
'reverted': 'Was the output later reverted?',
}
Scoring Framework
Multi-Dimensional Quality Score
def calculate_skill_score(invocation_log): """Score a skill invocation 0-100."""
scores = {
# Completion (25%)
'completion': (
25 if invocation_log['errors'] == [] else
15 if invocation_log['recovered'] else
0
),
# Efficiency (20%)
'efficiency': min(20, 20 * (
BASELINE_TOKENS / invocation_log['tokens_used']
)),
# Output Quality (30%)
'quality': (
30 if invocation_log['user_accepted'] else
20 if invocation_log['user_edit_ratio'] < 0.2 else
10 if invocation_log['user_edit_ratio'] < 0.5 else
0
),
# User Satisfaction (25%)
'satisfaction': (
25 if invocation_log['explicit_rating'] == 'positive' else
15 if invocation_log['no_follow_up'] else
5 if invocation_log['follow_up_resolved'] else
0
),
}
return sum(scores.values())
Score Interpretation
Score Range Quality Level Action
90-100 Excellent Document as exemplar
75-89 Good Monitor for consistency
50-74 Acceptable Review for improvements
25-49 Poor Prioritize fixes
0-24 Failing Immediate intervention
Log Storage Schema
SQLite Schema (Local)
CREATE TABLE skill_invocations ( id TEXT PRIMARY KEY, skill_name TEXT NOT NULL, skill_version TEXT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
-- Input
user_query TEXT,
context_tokens INTEGER,
-- Execution
duration_ms INTEGER,
tokens_input INTEGER,
tokens_output INTEGER,
tool_calls_json TEXT,
errors_json TEXT,
-- Output
output_type TEXT,
artifacts_json TEXT,
response_length INTEGER,
-- Quality signals
user_accepted BOOLEAN,
user_edit_ratio REAL,
follow_up_needed BOOLEAN,
explicit_rating TEXT,
-- Computed
quality_score REAL,
INDEX idx_skill_name (skill_name),
INDEX idx_timestamp (timestamp),
INDEX idx_quality (quality_score)
);
CREATE TABLE skill_aggregates ( skill_name TEXT, period TEXT, -- 'daily', 'weekly', 'monthly' period_start DATE,
invocation_count INTEGER,
avg_quality_score REAL,
error_rate REAL,
avg_tokens_used INTEGER,
avg_duration_ms INTEGER,
PRIMARY KEY (skill_name, period, period_start)
);
JSON Log Format (Portable)
{ "logs_version": "1.0", "skill_name": "wedding-immortalist", "entries": [ { "id": "uuid", "timestamp": "2025-01-15T14:30:00Z", "input": {...}, "execution": {...}, "output": {...}, "quality": { "signals": {...}, "score": 85, "computed_at": "2025-01-15T14:35:00Z" } } ] }
Analytics Queries
Skill Performance Dashboard
-- Overall skill rankings SELECT skill_name, COUNT() as uses, AVG(quality_score) as avg_quality, AVG(tokens_output) as avg_tokens, SUM(CASE WHEN errors_json != '[]' THEN 1 ELSE 0 END) * 100.0 / COUNT() as error_rate FROM skill_invocations WHERE timestamp > datetime('now', '-30 days') GROUP BY skill_name ORDER BY avg_quality DESC;
-- Quality trend (weekly) SELECT skill_name, strftime('%Y-%W', timestamp) as week, AVG(quality_score) as avg_quality, COUNT(*) as uses FROM skill_invocations GROUP BY skill_name, week ORDER BY skill_name, week;
-- Problem detection SELECT skill_name, COUNT(*) as failures FROM skill_invocations WHERE quality_score < 50 AND timestamp > datetime('now', '-7 days') GROUP BY skill_name HAVING failures >= 3 ORDER BY failures DESC;
Improvement Opportunities
def identify_improvement_opportunities(skill_name, logs): """Analyze logs to suggest skill improvements."""
opportunities = []
# Pattern 1: Common follow-up questions
follow_ups = extract_follow_up_patterns(logs)
if follow_ups:
opportunities.append({
'type': 'missing_capability',
'description': f'Users frequently ask: {follow_ups[0]}',
'suggestion': 'Add guidance for this common need'
})
# Pattern 2: High edit ratio in specific output types
edit_patterns = analyze_edit_patterns(logs)
if edit_patterns['code'] > 0.4:
opportunities.append({
'type': 'code_quality',
'description': 'Users frequently edit generated code',
'suggestion': 'Review code examples and templates'
})
# Pattern 3: Repeated errors
error_patterns = cluster_errors(logs)
for error_type, count in error_patterns:
if count >= 3:
opportunities.append({
'type': 'recurring_error',
'description': f'{error_type} occurred {count} times',
'suggestion': 'Add error handling or documentation'
})
return opportunities
Implementation Guide
Basic Logger Hook
hooks/skill_logger.py
import json import sqlite3 from datetime import datetime from pathlib import Path
LOG_DB = Path.home() / '.claude' / 'skill_logs.db'
def log_skill_invocation( skill_name: str, user_query: str, output: str, tool_calls: list, duration_ms: int, tokens: dict, errors: list = None ): """Log a skill invocation to the database."""
conn = sqlite3.connect(LOG_DB)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO skill_invocations
(id, skill_name, timestamp, user_query, duration_ms,
tokens_input, tokens_output, tool_calls_json, errors_json,
response_length)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
str(uuid.uuid4()),
skill_name,
datetime.utcnow().isoformat(),
user_query,
duration_ms,
tokens.get('input', 0),
tokens.get('output', 0),
json.dumps(tool_calls),
json.dumps(errors or []),
len(output)
))
conn.commit()
conn.close()
Quality Signal Collection
def collect_quality_signals(invocation_id: str, signals: dict): """Update an invocation with quality signals."""
conn = sqlite3.connect(LOG_DB)
cursor = conn.cursor()
# Update with user feedback
cursor.execute('''
UPDATE skill_invocations
SET user_accepted = ?,
user_edit_ratio = ?,
follow_up_needed = ?,
explicit_rating = ?,
quality_score = ?
WHERE id = ?
''', (
signals.get('accepted'),
signals.get('edit_ratio'),
signals.get('follow_up'),
signals.get('rating'),
calculate_score(signals),
invocation_id
))
conn.commit()
conn.close()
Alerting & Notifications
Alert Conditions
ALERT_CONDITIONS = { 'quality_drop': { 'condition': 'avg_quality_7d < avg_quality_30d * 0.8', 'message': 'Skill {skill} quality dropped 20%+ in past week', 'severity': 'warning' }, 'error_spike': { 'condition': 'error_rate_24h > error_rate_7d * 2', 'message': 'Skill {skill} error rate doubled in past 24h', 'severity': 'critical' }, 'underused': { 'condition': 'uses_7d < uses_30d_avg * 0.5', 'message': 'Skill {skill} usage down 50%+ this week', 'severity': 'info' }, 'high_performer': { 'condition': 'avg_quality_7d > 90 AND uses_7d > 10', 'message': 'Skill {skill} performing excellently', 'severity': 'positive' } }
Anti-Patterns
"Log Everything"
Wrong: Logging complete input/output for every invocation. Why: Privacy concerns, storage explosion, noise. Right: Log metadata, summaries, and opt-in detailed logging.
"Score Once, Forget"
Wrong: Calculating quality score immediately after completion. Why: Misses delayed signals (did code work? was it reverted?). Right: Collect signals over time, recalculate periodically.
"Averages Only"
Wrong: Only tracking average quality scores. Why: Hides distribution, misses failure modes. Right: Track percentiles, failure rates, and patterns.
"No Baseline"
Wrong: Measuring quality without establishing baselines. Why: Can't detect improvement or regression. Right: Establish baselines per skill, compare trends.
Output Reports
Weekly Skill Health Report
Skill Health Report - Week of 2025-01-13
Overview
- Total invocations: 247
- Average quality: 78.3 (up 2.1 from last week)
- Error rate: 4.2% (down 1.8%)
Top Performers
- wedding-immortalist - 92.1 avg quality, 18 uses
- skill-coach - 89.4 avg quality, 34 uses
- api-architect - 87.2 avg quality, 22 uses
Needs Attention
- legacy-code-converter - 52.3 avg quality (down 15%)
- Common issue: Missing dependency detection
- Suggested fix: Add dependency scanning step
Improvement Opportunities
partner-text-coach: Users frequently ask for tone adjustmentyard-landscaper: High edit ratio on plant recommendations
Integration Points
-
skill-coach: Feed quality data for skill improvements
-
agent-creator: Use metrics when designing new skills
-
automatic-stateful-prompt-improver: Quality signals for prompt optimization
Core Philosophy: What gets measured gets improved. Skill logging transforms intuition about skill quality into actionable data, enabling continuous improvement of the entire skill ecosystem.