Document Extraction Skill
Extract requirements from existing documentation sources for systematic requirement mining.
When to Use This Skill
Keywords: extract requirements, document mining, PDF requirements, transcript analysis, parse document, existing documentation, legacy requirements, competitive analysis
Invoke this skill when:
-
Mining requirements from existing documents
-
Processing meeting transcripts for requirements
-
Extracting requirements from competitor products
-
Analyzing regulatory documents for compliance requirements
-
Converting legacy documentation to structured requirements
Supported Document Types
Type Extension Extraction Method
Markdown .md Direct Read
Text .txt Direct Read
PDF .pdf Read tool (PDF support)
Word .docx Read tool
Web Page URL WebFetch tool
Meeting Notes .md, .txt Transcript patterns
Specification .md, .docx Requirement patterns
Extraction Workflow
Step 1: Document Assessment
Analyze the document to determine extraction strategy:
document_assessment: path: "{file path or URL}" type: "{detected document type}" size: "{approximate size}" structure: has_sections: true|false has_lists: true|false has_tables: true|false quality: formal_language: true|false clear_requirements: true|false needs_interpretation: true|false
Step 2: Pattern Matching
Apply requirement detection patterns:
Explicit Requirement Markers:
- "The system shall..."
- "The system must..."
- "Users should be able to..."
- "REQ-XXX:"
- Numbered requirements (1.1, 1.2, etc.)
EARS Patterns:
- "When [trigger], the [system] shall [response]"
- "While [state], the [system] shall [behavior]"
- "Where [feature], the [system] shall [behavior]"
- "If [condition], then the [system] shall [response]"
Implicit Requirement Indicators:
- "It is important that..."
- "We need..."
- "The goal is to..."
- "Users expect..."
- "Performance should..."
Step 3: Requirement Extraction
For each identified requirement:
extracted_requirement: id: REQ-{sequence} text: "{cleaned requirement statement}" source: document source_file: "{file path}" source_location: "{section/page/line}" original_text: "{exact text from document}" type: functional|non-functional|constraint|assumption confidence: high|medium|low extraction_method: explicit|pattern|inferred needs_review: true|false review_notes: "{why review needed}"
Step 4: Categorization
Categorize extracted requirements:
categories: functional: - features - behaviors - interactions non_functional: - performance - security - usability - reliability - scalability constraints: - technical - business - regulatory assumptions: - environmental - user_behavior - dependencies
Step 5: Deduplication
Identify and merge duplicate requirements:
deduplication: strategy: semantic_similarity threshold: 0.8 action: merge|flag_for_review merged_requirements: - id: REQ-merged-001 sources: [REQ-001, REQ-015] text: "{consolidated requirement}"
Document-Specific Strategies
Meeting Transcripts
transcript_extraction: focus_on: - Action items - Decisions made - Requirements discussed - Concerns raised patterns: - "We decided that..." - "The requirement is..." - "Action item:" - "TODO:" - "Need to..." speaker_context: - Note who said what - Weight by speaker role
Regulatory Documents
regulatory_extraction: focus_on: - Mandatory requirements ("shall", "must") - Prohibited actions ("shall not", "must not") - Conditional requirements ("if...then") compliance_mapping: - Reference section numbers - Note effective dates - Track version/revision
Competitor Analysis
competitor_extraction: focus_on: - Feature descriptions - User capabilities - Unique selling points output: - Feature requirements - Differentiation opportunities - Gap identification confidence: low # Based on external observation
Legacy Specifications
legacy_extraction: focus_on: - Existing requirements - System behaviors - Integration points modernization: - Update terminology - Convert to EARS format - Flag deprecated requirements
Output Format
Per-Document Output
extraction_result: source: file: "{path or URL}" type: "{document type}" extraction_date: "{ISO-8601}" confidence: high|medium|low
statistics: total_candidates: {number} extracted: {number} filtered: {number} needs_review: {number}
requirements: - id: REQ-{number} text: "{requirement}" type: functional|non-functional|constraint source_location: "{section/page}" confidence: high|medium|low original_text: "{exact source text}"
review_items: - requirement_id: REQ-{number} reason: "{why review needed}" suggestion: "{proposed action}"
metadata: sections_processed: {number} extraction_patterns_used: ["{pattern names}"]
Autonomy Levels
Guided Mode
guided_behavior: document_selection: Human selects extraction_strategy: AI suggests, human approves each_requirement: AI highlights, human confirms categorization: AI suggests, human validates
Semi-Autonomous Mode
semi_auto_behavior: document_selection: AI suggests priority, human approves list extraction_strategy: AI chooses autonomously requirements: AI extracts all, human reviews in batches categorization: AI categorizes, human spot-checks
Fully Autonomous Mode
full_auto_behavior: document_selection: AI processes all relevant extraction_strategy: AI optimizes per document requirements: AI extracts, deduplicates, categorizes output: Full extraction report for final review
Quality Indicators
High Confidence Extraction
-
Explicit requirement markers ("shall", "must")
-
EARS-pattern matches
-
Numbered requirement lists
-
Clear imperative statements
Medium Confidence Extraction
-
Implicit indicators ("should", "needs to")
-
Context-dependent interpretation
-
Partial pattern matches
-
Requires domain knowledge
Low Confidence Extraction
-
Inferred from descriptions
-
Narrative text interpretation
-
Competitive analysis
-
Assumptions based on context
Delegation
For related tasks, delegate to:
-
gap-analysis: Check extracted requirements for completeness
-
domain-research: Research unfamiliar terms or concepts
-
elicitation-methodology: Route back for technique selection
Output Location
Save extraction results to:
.requirements/{domain}/documents/DOC-{filename}-{timestamp}.yaml
Related
-
elicitation-methodology
-
Parent hub skill
-
gap-analysis
-
Post-extraction completeness checking
-
interview-conducting
-
Clarify extracted requirements with stakeholders