Ontology Engineer
Extract candidate ontology models from existing data. Build and maintain personal knowledge graphs.
Core principle: Make implicit business models in existing data explicit. Don't create from scratch.
Division of labor: Scripts handle mechanical extraction (file scanning, format conversion, table parsing). LLM handles semantic judgment (entity identification, property selection, relationship discovery, naming, cross-source merging).
Security model:
- No external API calls. The LLM running this skill (Claude, OpenClaw, etc.) IS the semantic engine. No credentials, no network endpoints, no data exfiltration paths.
- User-scoped scanning. Step 1.5 is a MANDATORY interactive checkpoint — the user reviews and approves every folder before any content is read. Nothing is analyzed without explicit confirmation.
- Local-only output. All artifacts (graph.jsonl, schema.yaml, review.md) are written to a user-specified local directory. No data leaves the machine.
- Append-only writes. Scripts only create/append files. No deletion, no modification of existing user files.
When This Skill Adds Value (and When It Doesn't)
Knowledge graphs and ontology extraction are not universally useful. Before starting, assess fit:
| Scenario | Value | Why |
|---|---|---|
| 3+ heterogeneous systems with inconsistent naming for the same concepts | High (Mode A) | Cross-system concept alignment is the core use case |
| Agent product needs factual grounding to reduce hallucination | High (Mode B/C) | Graph becomes Agent's fact base — auto-query before every response |
| 1000+ entities with dense relationships across long time spans | High | Pattern discovery humans can't do manually (churn, cross-sell, capability mapping) |
| Client consulting engagement analyzing their data landscape | High (Mode A) | Core consulting deliverable: "here's what your data assets look like" |
| Small org, <200 entities, info fits in one person's head + Excel | Low (Mode B) | Graph just re-stores what user already knows — use as PoC/capability validation only |
| Single system, no cross-system integration need | Low (Mode A) | Read the schema directly; ontology layer adds overhead without value |
Rule of thumb: If the user's reaction to the output is "I already knew all this", the graph isn't producing incremental value. Redirect to Mode A (client projects) or Agent integration.
Detailed value scenarios: references/value-scenarios.md
Operating Modes
| Mode | Input | Output | Use When |
|---|---|---|---|
| A: Database Extraction | SQL DDL, data dictionaries, Word/Excel schemas | ontology.json + review.md | Analyzing enterprise business systems |
| B: Filesystem Scanning | Local/cloud directories | graph.jsonl + schema.yaml | Building personal knowledge graph |
| C: External Data | Others' data spaces, shared drives | graph.jsonl (source=external) | Acquiring others' business models |
Mode A: Database Extraction
Three-phase workflow for extracting ontology from structured data sources.
Phase 1: SCAN
Run scripts/scan_directory.py to discover and classify files by priority (P1-P7).
python scripts/scan_directory.py "<dir>" --output scan_result.json --report
Review scan report. Process P1-P2 files first, expand as needed.
Phase 2: EXTRACT
- Convert .doc files if needed:
scripts/convert_doc.py - Extract tables from Word/Excel:
scripts/extract_tables.py - Read extracted data, apply Rules 1-7 (see analysis-rules.md)
- For text formats (.sql, .json, .yaml, .md, .csv): read directly
Phase 3: MERGE
- Cross-source entity deduplication (Rule 5)
- Relationship consolidation
- Output:
ontology.json+review.md
Detailed rules: references/analysis-rules.md (Rules 1-7) Quality checks: references/quality-checks.md Script details: references/script-operations.md Modeling decisions: references/modeling-decisions.md
Mode B/C: Knowledge Graph
Two-step pipeline for building personal knowledge graphs from file systems.
Step 1: File Indexing (script, no LLM)
python scripts/scan_filesystem.py --root /path --config namespace_rules.yaml --extract-metadata
Creates Document + Project entities in graph.jsonl. Pure mechanical operation.
Key features: Auto namespace inference, duplicate detection, .docx/.pdf metadata extraction, universal noise filtering.
Step 1.5: User Scope Confirmation (MANDATORY interactive step)
After Step 1 completes, present the scan summary to the user and ask for scope confirmation before proceeding to Step 2. The user knows which folders matter most.
Display a table of all discovered projects/namespaces with document counts, then ask:
扫描完成,发现 {N} 个项目,共 {M} 篇文档。请标记每个文件夹的优先级:
- 🔴 重点(高采样率,优先分析)
- ⚪ 普通(默认采样率)
- ⚫ 忽略(跳过,不分析)
- 或输入"全部"跳过选择,按默认策略处理所有文件夹
| # | 项目 | 文档数 | 格式分布 | 默认优先级 |
|---|------|--------|----------|-----------|
| 1 | work/myfiles | 15,617 | .doc .docx .pdf .xlsx | 🔴 重点 |
| 2 | work/classified | 1,578 | .doc .pdf .xlsx | ⚪ 普通 |
| ... | ... | ... | ... | ... |
请输入调整(如 "2=忽略, 5=重点")或 "全部" 或 "确认":
Rules:
- User can mark any folder as 重点/普通/忽略
- User can type "全部" to skip selection and use defaults
- 重点 folders get 2-3x sampling rate, 忽略 folders are skipped entirely
- Default priority is auto-inferred: human work folders=重点, AI-generated=普通, downloads/cache=忽略
- Never skip this step. Even if obvious, let the user confirm.
Step 2: Semantic Analysis (LLM, core step)
Five phases: Sampling → Document Reading → Aggregation → Cross-project Alignment → Output.
Key decisions (details in knowledge-graph-workflow.md):
- Minimum 10% coverage, 重点 folders 2-3x, 忽略 folders skip
- Structured lists (Rule 13): Files named 列表/台账/名单/登记表/清单 etc. → full extraction (every row = one entity), NOT sampling. See analysis-rules.md Rule 13.
- Dual-track extraction: Track A (named entities) + Track B (domain terms)
- Subagents must be
general-purposetype (Bash access). Never use Explore type. - Format tools: see formats-and-deps.md
- Relation semantics: Use enriched relation format with direction, cardinality, temporal range. See relation-ontology.md.
Step 3: Runtime Evolution
Agent enriches the knowledge graph during daily conversations. source.type = "runtime".
When to trigger (passive, no user action needed):
- User mentions a person by name + role/org → check graph, append if new
- User discusses a project/event with dates → append Event
- User makes a strategic decision or key insight → append Note
- User mentions a new organization/client → append Organization
How to append:
python query_graph.py search "张三" # Check if entity exists
# If not found, append to graph.jsonl:
echo '{"op":"create","ts":"...","entity":{"id":"per-NNNNN","type":"Person","graph":"core/persons","source":{"type":"runtime","conversation_id":"..."},...}}' >> graph.jsonl
Rules:
- Only append entities with concrete evidence from the conversation
- Never overwrite existing entities — only add new ones or note conflicts
- Use
source.type = "runtime"to distinguish from scan-derived entities - Keep it lightweight: 1-3 entities per conversation, not a full re-scan
Full workflow details: references/knowledge-graph-workflow.md Analysis rules (8-12): references/analysis-rules.md Format support & deps: references/formats-and-deps.md
Key Principles
- Model business concepts, not database tables. Table names ≠ object names.
- Extract then express. Make implicit models explicit, don't create from nothing.
- Experts judge. Produce candidates; final decisions belong to humans. When in doubt, flag it.
- Invest in invariants. Stable entities and relationships, not technical details.
- Handle what exists. Real projects use Word and Excel. Adapt to the data.
- Scripts extract, LLM analyzes. Mechanical extraction via Python. Semantic judgment via LLM.
- Coverage over perfection. 60% of files at moderate depth beats 3% at maximum depth.
- Generic skeleton + domain discovery. 8 core types (BFO-aligned). Domain types discovered by scanning.
- Single source of truth. All data in one graph.jsonl. Soft partition via graph/labels/source.
- Relations carry semantics. Direction, cardinality, temporal range, evidence. Not just type + target.
- Append-only evolution. Never delete entities. Deprecate, reclassify, version.
Ontology Theory References
| Reference | When to Read |
|---|---|
| modeling-decisions.md | Core type boundaries, entity vs enum, promotion judgment |
| relation-ontology.md | Relation format, core relation catalog, ternary relations |
| ontology-evolution.md | Schema versioning, entity reclassification, conflict resolution |
| constraints-and-inference.md | Type/relation constraints, inference rules, inconsistency detection |
| value-scenarios.md | When this skill adds value and when it doesn't |
Output Formats
graph.jsonl (Mode B/C)
{"op":"create","ts":"2026-01-15T10:00:00Z","entity":{"id":"per-00001","type":"Person","graph":"core/persons","labels":["employee"],"source":{"type":"scan","scan_id":"step2-r1"},"properties":{"name":"张三","roles":["项目经理"],"organizations":["某科技公司"]},"relations":[{"type":"works_at","target_id":"org-00002","direction":"forward","cardinality":"N:1","temporal":{"start":"2019-01","end":null},"confidence":"high"}],"created_at":"2026-01-15T10:00:00Z"}}
Required: id, type, graph, source, created_at. Optional: labels, properties, relations.
Relation fields: type + target_id required. Optional: direction (forward/reverse/bidirectional), cardinality (1:1/1:N/N:1/N:M), temporal ({start, end}), evidence (source entity ID), confidence (high/medium/low). See relation-ontology.md.
schema.yaml (Mode B/C)
meta:
version: "2.0"
core_types: # 8 fixed (BFO-aligned): Person, Organization, Project, Task, Document, Event, Note, Goal
domain_types: # Discovered by Step 2 Track B, grouped by domain
namespaces: # core/, work/*, personal/*, external/*, uncategorized/*
source_types: # scan | runtime | manual | email | cloud | chat
relation_schema: # Relation fields: type, target_id, direction, cardinality, temporal, evidence, confidence
relation_types: # Core relation catalog grouped by source type pair
constraints: # type_constraints (required props, enums), relation_constraints, id_pattern
inference_rules: # Transitive subsidiary, symmetric partner, inverse works_at, etc.
schema_evolution: # Version format, backward compatibility rules
ontology.json (Mode A)
{
"meta": {"generated_by": "ontology-engineer", "source_files": [], "domain": "..."},
"object_types": [{"name": "...", "english": "...", "core_properties": [], "confidence": "high|medium|low"}],
"link_types": [{"from": "A", "relation": "verb", "to": "B", "cardinality": "1:N", "evidence": "..."}],
"review_flags": [{"type": "promotion|merge|ambiguity|missing", "item": "...", "question": "..."}]
}
review.md
- Scan summary 2. Model overview 3. Object catalog 4. Relationship map 5. Review items 6. Cross-source merges 7. Data quality notes 8. Decision log
Scripts
| Script | Mode | Purpose |
|---|---|---|
scripts/scan_filesystem.py | B/C | File indexing, namespace inference, metadata extraction |
scripts/scan_directory.py | A | File discovery with P1-P7 priority classification |
scripts/convert_doc.py | A | .doc → .docx conversion |
scripts/extract_tables.py | A | Table extraction from Word/Excel |
Details: references/script-operations.md
Agent Integration
| Component | Purpose | Status |
|---|---|---|
query_graph.py | Search entities by type/name/graph/labels, traverse relations | Done |
| Runtime write | Agent appends new entities during conversation (Step 3) | Done |
| MCP Server | Expose graph as tools: search_entities, get_relations | Planned |
| Prompt injection | Agent auto-queries graph for context before handling tasks | Planned |
Query tool usage:
python query_graph.py stats # Overview
python query_graph.py search "关键词" # Search
python query_graph.py type Person --limit 20 # By type
python query_graph.py get per-00001 # Details
python query_graph.py relations per-00001 # Relations
python query_graph.py domain --limit 30 # Domain terms
python query_graph.py export Person --format csv # Export