Manuscript Provenance Audit

Purpose

Verify that a manuscript is a faithful rendering of computational outputs. Every number, table, figure, category label, ordering, and threshold in the document must trace to a specific script, config file, or pipeline output. Manual data entry in a manuscript is a reproducibility defect.

This skill produces a provenance map — a structured report linking each manuscript artifact to its generating code — and flags every break in the chain.

Companion skill: manuscript-review audits the document as prose (structure, argumentation, citations). This skill audits whether the document content is computationally grounded. Run both for complete pre-publication coverage.

Boundary Agreement with manuscript-review

Concern	manuscript-review	This skill (manuscript-provenance)
Reproducibility	Does the paper describe enough to reproduce? (§6)	Does the code actually produce what the paper claims? (§1, §7)
Figures/Tables	Legible, accessible, well-formatted? (§12)	Generated by scripts, not manual entry? (§2, §3)
Rendered visuals	Readable at print scale? Floats near references? (§23)	Figure generation script produces correct format? (§3)
Hyperparameters	Listed in the paper with rationale? (§6)	Values trace to config files, not hardcoded? (§1, §8)
Code availability	Statement exists in the paper? (§17)	Repo URL valid, README accurate, pipeline works? (§11)
Terminology	Abbreviations consistent within document? (§14)	Terms match code identifiers? (§5)
Significant figures	Consistent precision within document? (§12)	Precision matches script output? (§2)
Figure format	Appropriate format for document quality? (§12)	Format generated by script, not manually exported? (§3)
Computational cost	Reported in the paper? (§7)	Values trace to benchmarking scripts? (§1)
Macro-prose coherence	Prose framing appropriate for injected value? (§24)	Value traced to code, macro manifest produced? (§4)
Cross-element consistency	Prose, captions, figures, tables mutually consistent? (§24)	All elements from same run/pipeline output? (§9)

Rule: This skill never judges prose quality. manuscript-review never opens the codebase. Each reads the other's report when available.

Integration point — Macro Manifest: This skill produces a macro manifest as part of the §4 audit: a structured list of every macro-injected value with:

Macro name (e.g., \bestf)
Resolved value (e.g., 0.847)
Source (script + output file that generates it)
Location(s) in manuscript text (file, line number, surrounding sentence)
Classification (TRACED / MACRO-TRACED / CONFIG-TRACED / UNTRACED / STALE)

manuscript-review's Pass 13 (Cross-Element Coherence, §24) consumes this manifest to check whether the prose surrounding each injected value is appropriate for the actual numeric value. Provenance owns "is this value computationally grounded?" Review owns "does the text wrapping this value make sense given what the value is?"

Scope

In scope:

Numbers, metrics, percentages in manuscript text
Tables (content, ordering, formatting)
Figures (generation scripts, data sources)
LaTeX macros (\newcommand, \def, \pgfmathsetmacro)
Terminology, mode names, mechanism labels, category names
Ordering of items in enumerations, tables, discussion
Config values (thresholds, hyperparameters, model names)
Pipeline completeness (raw data → final PDF)
Timestamp consistency (scripts vs outputs)

Out of scope:

Prose quality (→ manuscript-review)
Citation hygiene (→ manuscript-review)
Argumentation structure (→ manuscript-review)
Code quality/style (separate concern)

Inputs

This audit requires TWO artifacts:

Manuscript source — LaTeX .tex files (preferred), or PDF/DOCX as fallback
Codebase — the scripts, configs, and pipeline that generate manuscript content

If the user provides only one, ask for the other. LaTeX source is strongly preferred over compiled PDF — provenance auditing requires seeing the raw markup, macros, and input commands.

Workflow

Phase 1 — Inventory

1a. Manuscript Artifact Extraction

Read all .tex files (main + included via \input/\include). Extract:

Inline values: bare numbers in running text (percentages, counts, metrics, p-values, confidence intervals, thresholds, sizes)
LaTeX macros: all \newcommand, \def, \pgfmathsetmacro, and custom command definitions that carry data values
Tables: full content of every tabular/table environment — cell values, row/column ordering, headers
Figures: \includegraphics paths, caption content, referenced data
Input files: any \input{generated/*.tex} patterns that pull from script-generated LaTeX fragments
Labels and references: \label/\ref pairs for cross-referencing
Terminology: named modes, mechanisms, strategies, categories, method names used in prose
Ordered lists: any enumerated or ranked items (methods compared, features listed, results ordered)

Build an artifact registry — a flat list of every data-carrying element in the manuscript with its location (file, line number).

1b. Codebase Mapping

Scan the project directory. Identify:

Pipeline entry points: Makefile, snakemake, dvc.yaml, run.sh, main.py, or equivalent orchestration
Analysis scripts: files that produce numbers, tables, figures
Config files: config.toml, config.yaml, .env, params.yaml, hyperparameter files
Output directories: where scripts write results (results/, output/, figures/, tables/, generated/)
Generated LaTeX fragments: .tex files in output directories that scripts produce for \input inclusion
Data files: CSVs, JSON, HDF5, pickles that intermediate results flow through

Build a source registry — a flat list of every code artifact that produces or configures manuscript content.

Phase 2 — Provenance Tracing

For each entry in the artifact registry, attempt to establish a provenance chain: manuscript value → generated output → script → input data/config.

2a. Value Provenance

For every number in the manuscript:

Search for the value in script outputs (logs, result files, generated LaTeX)
Trace the output back to the script that produces it
Verify the script reads from data/config (not hardcoded)
Record the full chain or flag as UNTRACED

Classification:

TRACED — full chain from manuscript value to generating code
MACRO-TRACED — value defined in a LaTeX macro that is generated by a script
CONFIG-TRACED — value comes from a config file read by scripts
UNTRACED — no provenance chain found; manually entered
STALE — provenance chain exists but output is older than generating script

2b. Table Provenance

For each table:

Is the table content generated by a script (CSV → LaTeX, or direct LaTeX generation)?
Is the row/column ordering determined by code (sorted by metric, alphabetical, grouped by category) or manually arranged?
Are header labels matching code-defined names?
Are formatting choices (bold for best, significant figures) applied by code?

Classification:

GENERATED — entire table produced by script
PARTIAL — some cells generated, some manual
MANUAL — no generation script found
ORDER-MANUAL — content generated but ordering is manually set

2c. Figure Provenance

For each figure:

Does a script produce the exact file referenced by \includegraphics?
Does the script use a deterministic seed for reproducibility?
Is the figure output path in the script consistent with the LaTeX reference?
Are figure parameters (colors, labels, axis ranges) set in code or manually edited post-generation?

Classification:

GENERATED — script produces the exact file
POST-EDITED — script generates base figure, but manual edits detected (e.g., Illustrator metadata, different checksum than script output)
MANUAL — no generating script found
STALE — generating script modified after figure file

2d. Terminology Provenance

For each named mode, mechanism, category, or method label:

Is the term defined in code (enum, constant, config key, class name)?
Does the manuscript term match the code term exactly?
If the manuscript uses a display-friendly name, is there an explicit mapping in code or config?

Classification:

CODE-DEFINED — term matches code definition
MAPPED — explicit code→display mapping exists
UNMAPPED — term appears in manuscript but not in code
INCONSISTENT — term appears in both but differs (e.g., code says greedy_search, manuscript says "Greedy Search" in some places and "greedy approach" in others)

2e. Ordering Provenance

For each ordered list, ranked comparison, or sequenced enumeration:

Does code determine the ordering (sort by metric, alphabetical, enum order)?
Does the manuscript ordering match the code-determined order?
Are there items in the manuscript list not present in code output, or vice versa?

Classification:

CODE-ORDERED — ordering matches code output
MANUAL-ORDER — ordering differs from code output or no ordering logic in code
SUBSET-MISMATCH — manuscript lists different items than code produces

Phase 3 — Infrastructure Audit

3a. LaTeX Macro Hygiene

Every data-carrying macro should be generated by a script, not hand-typed in the preamble
Pattern to detect: \newcommand{\someMetric}{42.7} defined directly in .tex files (bad) vs \input{generated/metrics.tex} where that file is script output (good)
Flag macros whose values appear nowhere in script outputs
Flag macros defined in main .tex files that carry numeric/data values

3b. Pipeline Completeness

Does a single command reproduce all manuscript artifacts from raw data?
Is the pipeline documented (Makefile, README, CI config)?
Are intermediate steps cached or do they require full re-execution?
Are random seeds fixed for reproducibility?
Are software versions pinned (requirements.txt, environment.yml, lock files)?

3c. Config/Code Separation

Are hyperparameters, thresholds, model names in config files?
Are file paths relative (portable) or absolute (fragile)?
Are credentials, API keys, or machine-specific paths absent from committed code?
Is there a single config entry point or are settings scattered across scripts?

3d. Stale Output Detection

Compare modification timestamps: script vs its output files
Flag outputs that are older than their generating scripts (stale)
Flag outputs with no corresponding script (orphaned)
Flag scripts with no corresponding output (dead code or unrun)

3e. Version Pinning

Are dependencies locked (requirements.txt with versions, conda environment.yml, poetry.lock, package-lock.json)?
Are data versions tracked (DVC, git-lfs, data checksums)?
Is the manuscript itself versioned alongside code (same repo, tagged releases)?

Phase 4 — Cross-Reference and Manifest Generation

4a. Macro Manifest Generation

Produce the macro manifest — the primary handoff artifact to manuscript-review. For every data-carrying macro identified in Phase 1a and traced in Phase 2a:

Macro: \bestf
Value: 0.847
Source: results/metrics.json → scripts/generate_latex_macros.py → generated/metrics.tex
Locations:
  - paper.tex:142 — "achieving an F1 score of \bestf{}"
  - paper.tex:287 — "The \bestf{} result represents a substantial improvement"
  - abstract.tex:8 — "...with \bestf{} F1 score"
Classification: MACRO-TRACED

Also include every bare number (not a macro) found in Phase 1a that carries data (metrics, counts, parameters) — these are values that SHOULD be macros but aren't:

Bare value: 50
Location: paper.tex:198 — "convergence after 50 epochs"
Should-be-macro: YES — this is a training parameter, should trace to config
Classification: UNTRACED (no macro, no provenance)

Save the manifest as [manuscript-name]-macro-manifest.json alongside the provenance report. This file is consumed by manuscript-review Pass 13 (Cross-Element Coherence) to verify prose-value appropriateness.

4b. Cross-Reference with manuscript-review

If a manuscript-review report exists for this manuscript, load it and:

Map UNTRACED values to manuscript-review §6 (Methodology) and §7 (Results) findings — provenance gaps often co-occur with reproducibility concerns
Flag terminology inconsistencies as potential §14 (Abbreviations) or §15 (Notation) issues in the manuscript-review framework
Feed HIGH-priority provenance issues as §6/§7 failures
Feed macro manifest into manuscript-review §24 (Cross-Element Coherence) findings — macro values whose surrounding prose uses inappropriate qualitative language ("marginal" for 14.3%, "dramatic" for 0.3%) are §24 failures

If no manuscript-review report exists, recommend running it as a companion audit and note that the macro manifest is available for its Pass 13.

Phase 5 — Report Generation

Load references/checklist.md and references/report-template.md.

Read references/checklist.md
Read references/report-template.md

Generate the provenance report following the template structure:

Provenance Summary — overall score, breakdown by category
Provenance Map — each manuscript artifact linked to its source
Defect Registry — every UNTRACED, STALE, MANUAL, INCONSISTENT finding
Infrastructure Assessment — pipeline, config, versioning status
Remediation Queue — prioritized fixes
Checklist Status — full checklist with pass/fail per checkpoint

Phase 6 — Output

Save two files in the manuscript directory:

[manuscript-name]-provenance-report.md — the full provenance report
[manuscript-name]-macro-manifest.json — the structured macro manifest for consumption by manuscript-review Pass 13

The macro manifest JSON structure:

{
  "macros": [
    {
      "name": "\\bestf",
      "value": "0.847",
      "source_chain": "results/metrics.json → scripts/gen_macros.py → generated/metrics.tex",
      "locations": [
        {
          "file": "paper.tex",
          "line": 142,
          "context": "achieving an F1 score of \\bestf{}"
        },
        {
          "file": "paper.tex",
          "line": 287,
          "context": "The \\bestf{} result represents a substantial improvement"
        }
      ],
      "classification": "MACRO-TRACED"
    }
  ],
  "bare_numbers": [
    {
      "value": "50",
      "location": {
        "file": "paper.tex",
        "line": 198,
        "context": "convergence after 50 epochs"
      },
      "section": "methodology",
      "should_be_macro": true,
      "rationale": "Training parameter — should trace to config",
      "classification": "UNTRACED"
    }
  ]
}

Present to the user:

Provenance coverage percentage (TRACED / total artifacts)
Count of UNTRACED / STALE / MANUAL findings by severity
Count of bare numbers that should be macros
Top 5 remediation actions
Pipeline completeness verdict
Note that macro manifest is available for manuscript-review Pass 13

Severity Classification

CRITICAL — Value in manuscript has no provenance chain AND is a key result (main finding, abstract metric, table headline number). This means the paper's core claims cannot be verified from code.
HIGH — Value/table/figure is untraced or stale, and appears in results or methodology sections. Reproducibility gap.
MEDIUM — Terminology mismatch, manual ordering, partial table generation, config values hardcoded in scripts. Maintenance and consistency risk.
LOW — Minor issues: display-name mapping missing but terms are close, non-critical figures without generation scripts, cosmetic post-editing of generated figures.

Core Principles

Binary provenance. Every artifact is either traced or not. No "partially reproducible" — partial means broken.
Code is truth. When manuscript and code disagree, the manuscript is wrong until proven otherwise. Flag the disagreement; do not assume the manuscript author "meant to" override code output.
Macros over magic numbers. Every data value in LaTeX should be a macro. Every macro should be generated. No exceptions for "obvious" values.
Pipeline as proof. If make (or equivalent) does not produce the PDF from raw data, the manuscript is not reproducible. Partial pipelines get partial credit, not a pass.
Config is not code. Hyperparameters, thresholds, model names, file paths — all belong in config files, not scattered through script bodies.
Ordering is data. The sequence of items in a table or enumeration is an assertion. It must come from code (sort order, enum definition) not from the author's sense of what "looks right."
Timestamps matter. A figure generated last month from a script modified yesterday is suspect. Stale outputs are provenance failures.
Companion, not replacement. This audit checks computational grounding. manuscript-review checks document quality. Both are needed. Neither subsumes the other.

Example Invocation Patterns

User says any of:

"Check provenance"
"Are my numbers from code"
"Audit my pipeline"
"Verify reproducibility"
"Check manuscript against scripts"
"Provenance audit"
"Are my tables generated"
"Do my figures come from scripts"
"/manuscript-provenance"

All trigger this skill.