large-scale-text-editing

Large-Scale Text Editing

Overview

This skill provides guidance for efficiently transforming large text files containing thousands to millions of lines. It covers strategies for understanding transformation requirements, designing efficient solutions (particularly with Vim macros), testing approaches, and verification techniques.

When to Use This Skill

Transforming CSV, TSV, or other delimited files at scale
Applying repetitive edits across files with millions of rows
Working within keystroke or operation count constraints
Using Vim macros, sed, awk, or similar batch processing tools
Pattern-based text transformations requiring regex

Approach Strategy

Phase 1: Understand the Transformation

Before writing any transformation logic:

Assess file size first - Check file size with ls -lh or wc -l before attempting to read. Avoid reading multi-million line files directly.

Sample strategically - Extract samples from multiple locations:

Beginning: head -n 100 input.csv > sample_head.csv
Middle: sed -n '500000,500100p' input.csv > sample_middle.csv
End: tail -n 100 input.csv > sample_tail.csv

Compare input and expected output - Identify all transformations needed:

Column reordering or removal
Delimiter changes
Case transformations
Whitespace handling
Value appending or prepending
Format conversions

Verify structural assumptions:

Consistent column count across all rows
Presence of header rows
Empty lines or malformed rows
Special characters that might break regex patterns

Phase 2: Design the Solution

When designing transformations:

Break complex transformations into discrete steps - Each step should handle one logical transformation. This improves debuggability and allows independent testing.

Choose the right tool for the scale:

Vim macros: Excellent for complex, multi-step transformations; efficient keystroke counting
sed: Fast for simple substitutions across large files
awk: Powerful for column manipulation and conditional logic
Perl/Python: For complex logic that exceeds regex capabilities

Design for efficiency:

Minimize the number of passes through the file
Use line-based operations (:%normal! in Vim) rather than iterating with explicit loops
Leverage built-in commands (e.g., gU for uppercase in Vim) over manual character manipulation

Document design decisions - Record why specific approaches were chosen, especially when multiple valid alternatives exist.

Phase 3: Test Incrementally

Create a test sample - Use a small subset (100-1000 lines) for initial testing:

head -n 100 input.csv > test_input.csv head -n 100 expected.csv > test_expected.csv

Test each transformation independently - Verify each macro or command produces correct output before combining.

Verify with diff - Use byte-for-byte comparison:

diff test_output.csv test_expected.csv

Check for edge cases in test output:

First and last lines transformed correctly
Lines with varying content lengths handled
Special characters preserved or transformed as expected

Phase 4: Execute with Safeguards

Create backups before in-place modifications:

cp input.csv input.csv.backup

Set appropriate timeouts - For million-row files, allow sufficient processing time (e.g., 2-5 minutes depending on complexity).

Monitor progress when possible - Use tools that show progress or check intermediate output.

Verify final output:

Confirm row count matches: wc -l output.csv
Run diff against expected output
Spot-check samples from different file locations

Vim-Specific Guidance

Macro Design Principles

Register allocation: Use distinct registers (a, b, c) for different transformation stages
Keystroke efficiency: Prefer built-in commands over character-by-character operations
Regex patterns: Use non-greedy patterns and explicit delimiters to avoid over-matching

Common Vim Patterns for Large Files

Task Approach

Apply macro to all lines :%normal! @a

Uppercase transformation gU motion or \U in substitution

Column manipulation Capture groups with () and backreferences \1 , \2

Delimiter replacement :s/old_delim/new_delim/g

Whitespace removal :s/\s+//g

Escaping in Vim Scripts

When using setreg() for macro definitions:

Escape backslashes: \ for literal backslash
Use \r for carriage return
Special characters may need double-escaping

Verification Checklist

Before considering the task complete:

Output file exists and is non-empty
Row count matches expected count
Byte-for-byte diff passes against expected output (if available)
Spot-check samples from beginning, middle, and end of file
Any constraints (keystroke limits, command restrictions) are satisfied
Tool exited with success code (exit code 0)

Common Pitfalls

Pitfall Prevention

Reading large files directly Always check file size first; use head/tail/sed for sampling

No backup before in-place edit Create backup copy before any modification

Testing only on first few lines Sample from multiple file locations

Assuming uniform structure Verify structure with samples from different positions

Regex over-matching Use explicit delimiters and non-greedy quantifiers

Insufficient timeout Calculate expected processing time for file size

Not verifying exit codes Check tool exit status after operations

Efficiency Considerations

When keystroke or operation counts matter:

Count accurately - Understand what constitutes a "keystroke" in the specific context (escape sequences, special keys)
Combine operations - A single regex substitution may replace multiple simpler operations
Use built-in commands - Native commands are typically more efficient than manual equivalents
Minimize redundancy - Avoid repeated file reads or redundant transformations

large-scale-text-editing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

extracting-pdf-text

video-processing

google-workspace

portfolio-optimization