Minimize Issue Reproduction
Fetch a GitHub issue, evaluate whether it has a reasonable repro, check if it still reproduces, and systematically minimize the repro to the smallest possible self-contained script.
Tools
Assume the current environment is correct and run python directly. Only use conda run -n <env> for version bisection (step 5a) where you need to temporarily use a different environment. Use the Bash tool's timeout
parameter to enforce timeouts when running repro scripts.
-
gh issue view <NUMBER> --repo pytorch/pytorch to fetch the issue body
-
gh issue view --comments <NUMBER> --repo pytorch/pytorch to fetch comments
-
python <script> to run repro scripts
-
gh issue comment <NUMBER> --repo pytorch/pytorch --body <BODY> to comment
-
gh issue edit <NUMBER> --repo pytorch/pytorch --add-label <LABEL> to add labels
Multiple gh issue edit flags can be combined in a single command (e.g. --add-label "bug,help wanted" --add-assignee "@me" ). Prefer batching edits into one command to minimize API calls and reduce the chance of auto-subscribing to notifications.
Preserving notification subscription state
Modifying an issue (commenting, adding labels) auto-subscribes you to notifications. Use tools/stale_issues.py to save and restore subscription state:
-
Before the first modification, save the current state. This can be run in parallel with fetching the issue body/comments, but it must complete before any gh issue edit , gh issue comment , or gh issue close command is executed: python tools/stale_issues.py subscription save <NUMBER>
-
After the last modification, restore the saved state — unless the issue was closed (via gh issue close ), in which case skip the restore so the user stays subscribed to follow any responses to the closure. The restore must be the very last GitHub API call — do not run it in the background or in parallel with any gh issue edit or gh issue comment
commands: python tools/stale_issues.py subscription restore <NUMBER>
If either the save or restore command fails, warn the user and continue without the save/restore mechanism.
Important: Never run gh issue edit , gh issue comment , gh issue close , or subscription save/restore commands in the background. These must all run in the foreground so their completion can be verified before proceeding. If commenting or editing fails because the issue is locked, report this to the user and skip the modification.
Security review checklist
Before running any repro code, check for the following concerns:
-
Network requests to untrusted URLs (requests, urllib, curl, wget)
-
File operations outside /tmp/
-
Shell command execution (os.system, subprocess, eval, exec) — but subprocess used to launch torchrun or mp.spawn for distributed repros is expected and not a concern
-
Downloading or loading external files (model weights, pickled objects, data files) — especially torch.load on untrusted .pt /.pth files
-
Obfuscated code (base64-encoded strings, encoded bytes, unusual escapes)
-
Package installation (pip install, conda install)
-
Environment variable manipulation that could affect the host system — but setting MASTER_ADDR , MASTER_PORT , RANK , WORLD_SIZE , CUDA_VISIBLE_DEVICES , or other standard PyTorch/CUDA env vars is expected and not a concern
If any of these are present, explain the concern to the user and ask whether to proceed, skip, or modify the repro to remove the risky parts. If the user chooses to skip, still refresh the triaged label timestamp (remove and re-add, or just add if not present) before reporting that the analysis is finished.
Even if the repro passes the checklist above, check whether the author of the repro code is a PyTorch collaborator by running python tools/stale_issues.py collaborator-check <username> . If the command exits with a non-zero status (user is not a collaborator), show the repro code to the user and ask them to verify it is safe to run before executing it.
Steps
- Fetch the issue
Fetch the issue body and comments in parallel. Identify the reported repro script and error. If multiple repros are present, prefer the most recent one from the original poster. If a commenter has posted a strictly shorter and more self-contained repro that doesn't require additional context from the issue description, prefer that one. Note which repro you selected. If the repro code is only present in screenshots or images rather than copyable text, stop and report this to the user.
- Check if the issue is actionable
Before investing effort in reproduction, check whether the issue is actionable. Do not proceed to later steps if any of the following apply:
-
Already closed or resolved in comments. Report to the user and stop. Do not modify labels on closed issues.
-
Duplicate of another issue (linked or obviously the same bug)
-
Not a bug report (feature request, question, discussion, refactoring / code cleanup task). If the issue is a feature request and doesn't already have the feature label, add it via gh issue edit . If the issue is a better-engineering / refactoring task and doesn't already have the better-engineering label, add it via gh issue edit . If the issue includes a repro script that demonstrates the current behavior, apply the security review checklist first, then run it to verify the behavior persists. If the repro needed to be modernized (e.g. updated imports for renamed APIs), or if you verified that the behavior still persists, comment on the issue with the findings and updated repro. After adding any applicable labels (and optionally running/commenting on a repro), report the analysis to the user and do not proceed to later steps.
-
Tracking/meta issue (umbrella issue tracking multiple bugs, burn-down lists, improvement proposals without a specific repro). If the issue doesn't already have the tracker label, add it via gh issue edit . Then stop and report to the user.
-
Requires unavailable hardware (specific GPU models, TPUs, multi-node) with no path to simplify. Note: CUDA is available on the current machine, so single-GPU CUDA repros can be run directly.
For non-actionable issues that are old (more than ~1 year), have no assignees, no recent progress, and are underspecified or lack concrete motivation, suggest closing them as "not planned" (gh issue close --reason "not planned" ) with a comment explaining the rationale. Ask the user before closing.
If the issue is not actionable and no GitHub-visible modification was made (no label added, no comment posted — saving subscription state does not count), refresh the triaged label to update the issue's "last updated" timestamp. A single gh issue edit with both --remove-label
and --add-label doesn't work because the remove and add cancel each other out. Instead, chain both edits in a single Bash tool call: gh issue edit ... --remove-label triaged && gh issue edit ... --add-label triaged . If the issue doesn't have the triaged label, just add it.
Then summarize why the issue is not actionable. If a label or other update was made, just report that the analysis is finished. Only ask the user how to proceed if no update was made and the situation is ambiguous. Always ask the user before closing an issue.
- Analyze the repro
Evaluate whether the issue has a reasonable repro:
-
Is there a code snippet that can be run?
-
Are the dependencies available (CUDA, distributed, specific hardware)? Note: torch.distributed repros often don't require special hardware — they can be launched with torchrun --nproc_per_node=1 or mp.spawn on a single machine.
-
Is the expected error described?
-
Is the repro self-contained or does it need external data/models?
If there is no repro code at all, or the issue is missing critical info (no error message, no description of expected vs actual behavior), add the needs reproduction label (if not already present) via gh issue edit , then stop and report to the user. Do not attempt to write a repro from the description without being asked.
Conversely, if the issue already has the needs reproduction label but does have a valid repro, remove the label via gh issue edit --remove-label "needs reproduction" .
Apply the security review checklist (see above) to the repro before running it.
If the repro requires third-party packages that are not installed (e.g. transformers , torchvision , numpy ), stop and ask the user how to proceed rather than installing them yourself.
Summarize your assessment before proceeding.
- Check for recent verification
If a comment from the last six months already confirms the issue still reproduces (with a matching error and a reasonable repro), stop and ask the user whether they want to re-verify or skip ahead to minimization.
- Check if it still reproduces
Extract the repro code into a temporary file under /tmp/ and run it with a timeout of 120 seconds. For repros involving torch.compile that call compile multiple times in the same process, add torch._dynamo.reset() between invocations to reset in-memory Dynamo state. This is unnecessary for scripts that compile once and exit.
If the script times out, consider whether a hang is the reported bug or an unrelated issue, and report to the user.
Record the PyTorch version before running (python -c "import torch; print(torch.version)" ) for inclusion in the report.
Check both the exit code and output to determine the result. An exit code > 128 indicates the process was killed by a signal (e.g. segfault = 139, OOM kill = 137) — this is a valid crash reproduction even without a Python traceback. If the repro crashes with a CUDA out-of-memory error and OOM is not the reported bug, try reducing tensor sizes before concluding it doesn't reproduce. For correctness bugs (wrong numerical results rather than crashes), the repro should include an assertion that fails when the bug is present. If the original repro only prints output without asserting, add a simple assertion based on the expected behavior described in the issue (e.g. assert torch.allclose(actual, expected) ) so the repro has a clear pass/fail signal.
If the result is inconsistent across runs, run 3-5 times to assess flakiness. For non-deterministic bugs, try setting PYTHONHASHSEED=0 and a fixed torch.manual_seed to stabilize reproduction. Report the success/failure ratio to the user. Flaky repros are still valid bugs — note the flakiness in the issue comment (step 8) and include the success/failure ratio.
Three possible outcomes (to determine which outcome applies, match on the exception class and a distinctive substring of the error message — the substring should be specific enough to identify the bug, e.g. RuntimeError: expected scalar type Float rather than just RuntimeError ):
-
Same error as reported: the bug still reproduces. Continue to step 6.
-
No error (passes): the bug may have been fixed. Try to identify the fixing PR (see step 5a), then report to the user. The issue will be closed in step 8.
-
Different error: distinguish between setup issues (missing import, renamed API) that can be fixed and genuinely different bugs. If the error is due to API changes between the reported version and the current version (renamed functions, moved modules, changed signatures), adapt the repro to use the current API while preserving the original intent. If the error is unclear, consider re-running with TORCH_LOGS=+dynamo or other relevant logging flags for more diagnostic output. Report genuinely different errors to the user.
5a) Identify when and how it was fixed
When the bug no longer reproduces, try to determine which version fixed it and which PR introduced the fix.
Version bisection: Check if conda environments named pytorch-<version>
(e.g. pytorch-2.6 , pytorch-2.8 ) are available (conda env list | grep pytorch- ). If they exist, binary-search across them to find the first version where the bug is fixed. Run from /tmp in a subshell and clear PYTHONPATH
to avoid picking up the local source tree (which would cause torch.C import errors): (cd /tmp && PYTHONPATH= conda run -n pytorch-<version> python /tmp/repro....py) . To speed up bisection, pick 2-3 evenly-spaced probe points from the candidate range and test them in parallel each round (e.g. if candidates are 2.2 through 2.8, test 2.4 and 2.6 simultaneously to split the range into thirds). If no versioned conda environments are available, skip bisection and just report that the bug no longer reproduces on the current version.
PR identification: Try to find the specific PR that fixed the bug. Use version control blame on the relevant fix code to find the changeset, then look up the commit message for the PR number (format (#NNNNN) ). Alternatively, search version control history for commits touching the relevant file with a related keyword. If this doesn't yield a clear answer quickly, just report the version — don't spend extra time on PR identification.
- Minimize the repro
Save the original working repro to /tmp/repro_<issue_number>_original.py
before making any changes.
First, assess whether the repro is already reasonably minimal. Only minimize if the repro has significant unnecessary complexity (large models, unused code paths, unnecessary dependencies, etc.). When counting complexity, don't count irreducible boilerplate — e.g. a tensor subclass definition that only contains the required dunder methods (new , init , torch_dispatch , tensor_flatten , tensor_unflatten ) is not reducible even if it's 20+ lines. Focus on whether the trigger code and model/setup complexity can be meaningfully reduced. If the only possible "simplifications" are cosmetic (inlining variables, removing repr ), skip minimization.
Systematically reduce the repro by testing whether each simplification still triggers the same error. Use a shorter timeout (30-60 seconds) during minimization since simplified repros should run faster. Run multiple candidate simplifications in parallel when they are independent (i.e. they modify non-overlapping parts of the code and neither depends on what the other removes).
Reduction strategies (apply in roughly this order):
-
Remove unnecessary imports and setup (distributed init, env vars, logging)
-
Shrink the model (replace large modules with minimal equivalents)
-
Remove the class/module wrapper if a bare function suffices
-
Reduce tensor sizes (large dims → small dims like 4 or 8)
-
Remove device/dtype requirements (try CPU and float32 first)
-
Simplify the computation (replace complex ops with minimal ones that still trigger the bug)
-
Remove unnecessary control flow (branches, loops, conditions)
-
Try simpler backends (e.g. aot_eager instead of inductor) if the bug is not backend-specific
After each round, verify the error still reproduces — same exception class and a distinctive error message substring as described in step 5 (minor traceback differences are fine). For correctness bugs, preserve the assertion that demonstrates the wrong result and verify it still shows the same incorrect behavior. When merging multiple successful parallel simplifications, verify the combined result still reproduces since independent simplifications can interact. Stop minimizing when the repro is under ~20 lines of non-blank non-import code, or when two consecutive rounds (where a round is one full pass through the applicable reduction strategies) fail to simplify further.
- Apply trivial fixes
If the analysis reveals a trivial fix (e.g. removing a stale xfailIfTorchDynamo
or expectedFailure annotation from a test because the underlying issue is fixed), report the fix to the user and ask whether to apply it. Do not modify source files without the user's approval. In step 8, mention the fix regardless of whether it was applied or declined.
- Report findings
If the repro was minimized, save it to /tmp/repro_<issue_number>.py so the user can run it directly.
Present findings to the user including:
-
Whether the bug still reproduces (and on what PyTorch version)
-
The fixing PR, if identified (only if the bug no longer reproduces)
-
The minimized repro script (only if we minimized it)
-
The necessary conditions to trigger the bug
-
Any trivial fix identified in step 7 (whether applied or not)
-
Recommended next action (e.g. "still a valid bug", "appears fixed", "needs more info from reporter")
After presenting findings, always comment on the issue with the results. Keep the comment concise — don't repeat information already on the issue. Only include a repro if it was materially changed from the original (e.g. minimized, modernized imports, fixed to run on current API). Only include trigger conditions if they are new findings not already discussed in prior comments. If the only finding is "still reproduces" or "no longer reproduces", a short comment is sufficient.
This issue [still reproduces / no longer reproduces] on PyTorch <version>.
[If fixed and PR identified:] Fixed by #NNNNN.
[If minimized or modernized — only include repro if changed from original:] Minimized repro:
```python <repro script> ```
[Only if new findings about trigger conditions:] All of the following are necessary to trigger the bug:
- <condition 1>
- <condition 2>
(Analysis done by Claude.)
If the bug no longer reproduces, after commenting close the issue: gh issue close <NUMBER> --repo pytorch/pytorch --reason completed --comment "Closing as this was fixed in PyTorch <version>." . Do not ask the user before closing — fixed bugs should always be closed.
After the last GitHub modification, restore the notification subscription state (see "Preserving notification subscription state" above) — unless the issue was closed, in which case skip the restore.