Defect Handling & Debugging Playbook
Purposeโ
This playbook defines a repeatable process for handling defects and debugging issues in a controlled, evidence-driven way.
In AI-assisted development, debugging can easily become chaotic (trial-and-error changes, speculative fixes, untracked regressions). This playbook prevents that by enforcing discipline.
๐ก๏ธ The "Stop Loss" Rule for AIsโ
Strict Rule: If the AI has tried to fix the same bug 3 times without success:
- Stop. Do not generate a 4th attempt.
- Escalate: Ask user for new logs/context.
- Rethink: Usually, the Test is wrong, not the code.
Scopeโ
This playbook applies to:
- Production bugs
- Test failures
- Regressions
- Unexpected behavior discovered during development
It does not replace incident response for large outages (see Incident Response Playbook).
Core Principleโ
Debugging is an investigation, not a guessing game. Prefer evidence, reproduction, and minimal fixes over broad changes.
Inputsโ
- Defect report (even if minimal)
- Logs, error messages, stack traces
- Steps to reproduce (if available)
- Expected vs actual behavior
Phase 1: Triage and Classificationโ
Goalโ
Understand severity, impact, and urgency.
Decision Matrix (The "Stop the Bleeding" Check)โ
- Critical (P0): Data loss, Security Breach, Main Flow blocked. -> Drop everything. Fix immediately.
- High (P1): Major feature broken, no workaround. -> Fix before next feature.
- Medium (P2): Annoyance, workaround exists. -> Schedule in next sprint/slot.
- Low (P3): Visual glitch, typos. -> Backlog.
Questionsโ
- Who is affected and how badly?
- Is this a regression?
- Is data integrity at risk?
- Is there a safe workaround?
Outputsโ
- Severity (Critical / High / Medium / Low)
- Impact summary
- Immediate mitigation plan (if required)
Phase 2: Reproduce and Observeโ
Goalโ
Achieve reliable reproduction and collect evidence.
Activitiesโ
- Reproduce the issue locally or in a controlled environment
- Record exact steps
- Capture logs and relevant state
- Reduce reproduction to the smallest possible case
Outputsโ
- Minimal reproduction steps
- Evidence (logs, traces, screenshots)
Phase 3: Hypothesis and Isolationโ
Goalโ
Form plausible hypotheses and narrow the cause.
Activitiesโ
- List 2โ4 plausible hypotheses (not 20)
- Identify which signals support or refute each hypothesis
- Add temporary instrumentation if needed
- Bisect recent changes if it appears to be a regression
Outputsโ
- Leading hypothesis with supporting evidence
- Identified probable root cause area
Phase 4: Create a Safety Netโ
Goalโ
Prevent the defect from returning.
Activitiesโ
- Create a failing automated test that reproduces the defect
- If testing is hard, create a characterization test or harness
Outputsโ
- Failing test that demonstrates the defect
Phase 5: Minimal Fixโ
Goalโ
Fix the defect with minimal collateral impact.
Activitiesโ
- Implement the smallest change that resolves the issue
- Avoid unrelated refactoring
- Re-run the failing test until it passes
- Run full test suite
Outputsโ
- Minimal code change
- Passing tests
Phase 6: Validation and Closureโ
Goalโ
Confirm resolution and ensure quality.
Activitiesโ
- Verify expected behavior matches requirements
- Check for regressions in adjacent areas
- Validate against Definition of Done
- Add release note entry if relevant
Outputsโ
- Validation summary
- Updated documentation (if applicable)
Phase 7: Knowledge Capture (The Learning Loop)โ
Goalโ
Prevent this bug from happening in the future or in other parts of the system.
Activityโ
- Ask: "Is this a recurring pattern?"
- Ask: "Did we learn a system constraint (e.g., 'Do not call API X without Header Y')?"
- Action: If YES, update
docs/lessons-learned.md.
Outputsโ
- Updated Project Intelligence.
Completion Criteriaโ
A defect is considered resolved only if:
- It is reproducible (or evidence is sufficient to confidently diagnose)
- A test exists that would fail without the fix (where feasible)
- The fix is minimal and reviewed against DoD
- No new regressions are introduced
Interaction with AI (Recommended Pattern)โ
Act as a QA Engineer.
Context:
- Defect description and evidence (logs, steps, stack trace)
- Definition of Done
- This Defect Handling Playbook
Task:
Guide me through triage, reproduction, hypothesis formation, and root cause isolation.
Then propose a minimal fix strategy.
Rules:
- Do not propose broad refactors.
- Do not guess without evidence.
- Prefer adding a failing test before fixing.
Anti-Patternsโ
- Shotgun Debugging: Random code changes to "see if it helps".
- Symptom Fixing: Hiding errors instead of solving the cause.
- Scope Creep: Large refactors during debugging.
- Blind Closing: Closing without a test or clear evidence.
Statusโ
This playbook is intentionally conservative.
It reduces time-to-fix by preventing chaos and regression cycles.