Reasoning Chains
Every finding must explain WHY before stating WHAT. Reduces false positives by 51% (Cubic research).
Research-driven code review with confidence-scored, evidence-validated findings. Session review or full codebase audit via parallel teams.
Confidence-scored code review with evidence validation. Session or full codebase audit. Use when reviewing changes or auditing quality. NOT for writing code or benchmarking.
Install:
npx skills add wyattowalsh/agents/skills/honest-review -gUse: /honest-review [path | audit | PR#]
Works with Claude Code, Gemini CLI, and other agentskills.io-compatible agents.
Research-driven code review where every finding is validated with evidence. The core differentiator is research validation — findings are confirmed with external evidence (Context7, WebSearch, gh) rather than relying solely on LLM knowledge.
Reasoning Chains
Every finding must explain WHY before stating WHAT. Reduces false positives by 51% (Cubic research).
Citation Anchors
[file:start-end] references mechanically verified against source. Mismatched refs discard the finding.
Agentic Verification
Three-phase review: Flag, Verify (tool calls), then Validate (research). Grep/Read confirm before reporting.
Multi-Pass Diversity
3 parallel Pass A subagents with deterministic ordering diversity. Majority voting elevates consensus flags.
Conventional Comments
Machine-parseable PR output: issue (blocking): ... format for CI annotations and PR comments.
Dependency Context
Cross-file dependency graph built during triage. High fan-in files auto-elevated to HIGH risk.
Learning Loop
Store false-positive dismissals per project. Similar findings suppressed in future reviews.
OWASP 2025
Updated checklists for A03:2025 (Supply Chain) and A10:2025 (Exception Handling).
Also includes: 10 creative lenses, review history, HTML dashboard, degraded mode, classification gating, CI integration, and hooks.
| $ARGUMENTS | Mode |
|---|---|
| Empty + changes in session (git diff) | Session review of changed files |
| Empty + no changes (first message) | Full codebase audit |
| File or directory path | Scoped review of that path |
| ”audit” | Force full codebase audit |
| PR number/URL | Review PR changes (gh pr diff) |
| Git range (HEAD~3..HEAD) | Review changes in that range |
| ”history” [project] | Show review history for project |
| ”diff” or “delta” [project] | Compare current vs. previous review |
--format sarif (with any mode) | Output findings in SARIF v2.1 |
| ”learnings” [command] | Manage false-positive learnings (add/list/clear) |
--format conventional (with any mode) | Output in Conventional Comments format |
| Unrecognized input | Ask for clarification |
Both modes follow a 4-wave pipeline:
Triage (Wave 0) — Risk-stratify files as HIGH/MEDIUM/LOW. Run uv run scripts/project-scanner.py for project profiling. Compute review depth score (0-10) for classification gating. Determine specialist triggers (security, observability, requirements).
Analysis (Wave 1) — Always run the content-adaptive team at maximum depth (no inline-only mode): Correctness, Design, Efficiency, Code Reuse, and Test Quality reviewers always spawn; specialists (Security, Observability, Requirements, Data Migration, Frontend) are triggered by triage. Each reviewer runs 3 internal passes (A: scan, B: deep dive, C: research).
Research Validation (Wave 2) — Three-phase review: Flag (hypothesize), Verify (tool calls via Grep/Read to confirm assumptions before reporting), Validate (spawn research subagents for external evidence). Dispatch order: slopsquatting detection first, then HIGH-risk (2+ sources), then MEDIUM-risk. In degraded mode, apply confidence ceilings per unavailable tool.
Judge Reconciliation (Wave 3) — Normalize findings, cluster by root cause, deduplicate with weighted confidence merging (1-(1-c1)(1-c2)...), apply confidence filter, resolve conflicts, check interactions, elevate systemic patterns (3+ files), and rank by score = severity_weight x confidence x blast_radius.
Three abstraction levels, each examining defects and unnecessary complexity:
| Level | Focus | Simplify |
|---|---|---|
| Correctness (does it work?) | Error handling, boundary conditions, security, API misuse, concurrency, resource leaks | Phantom error handling, defensive checks for impossible states, dead error paths |
| Design (is it well-built?) | Abstraction quality, coupling, cohesion, test quality, cognitive complexity | Dead code, 1:1 wrappers, single-use abstractions, over-engineering |
| Efficiency (is it economical?) | Algorithmic complexity, N+1, data structure choice, resource usage, caching | Unnecessary serialization, redundant computation, premature optimization |
Context-dependent triggers activate automatically when relevant: security, observability, AI code smells, config/secrets, resilience, i18n/accessibility, data migration, backward compatibility, infrastructure as code, and requirements validation.
Apply at least 2 lenses per review scope. For security-sensitive code, Adversary is mandatory.
Every finding follows this mandatory order:
Citation anchor — [file:start-end] exact source location, mechanically verified
Reasoning chain — WHY this is a problem (written before the finding statement)
Finding statement — WHAT the problem is
Evidence — external validation source (Context7, WebSearch, gh)
Fix — recommended approach
Adapts review depth to project type:
| Project Type | Review Depth |
|---|---|
| Prototype | P0/S0 only. Skip style, structure, and optimization concerns. |
| Production | Full review at all levels and severities. |
| Library | Full review plus backward compatibility focus on public API surfaces. |
[Lead: triage (Wave 0), Judge reconciliation (Wave 3), final report] |-- Correctness Reviewer --> Passes A/B/C internally |-- Design Reviewer --> Passes A/B/C internally |-- Efficiency Reviewer --> Passes A/B/C internally |-- [Security Specialist if triage triggers] |-- [Observability Specialist if triage triggers] |-- [Requirements Validator if intent available]Each reviewer runs 3 internal passes: Pass A (quick scan, haiku), Pass B (deep dive HIGH-risk files, opus), Pass C (research validate findings).
Review history is persisted to ~/.claude/honest-reviews/ via scripts/review-store.py:
| Command | Description |
|---|---|
save | Save review findings with project, mode, commit, and scope metadata |
load | Retrieve a specific review (by project and optional date) |
list | List saved reviews with metadata |
diff | Compare two reviews — shows new, resolved, and recurring findings |
Use /honest-review history my-project to view history or /honest-review diff my-project to compare against a previous review.
After Judge reconciliation, findings can be rendered into a self-contained HTML dashboard at templates/dashboard.html. Inject the findings JSON into the <script id="data"> tag. The dashboard auto-detects the view type:
| Script | Purpose |
|---|---|
scripts/project-scanner.py | Wave 0 triage — project profiling with dependency graph construction and fan-in risk scoring |
scripts/finding-formatter.py | Wave 3 Judge — normalize findings to JSON, supports --format sarif and --format conventional |
scripts/review-store.py | State management — save, load, list, diff review history (schema v2 with reasoning tracking) |
scripts/learnings-store.py | Learning loop — add, check, list, clear false-positive dismissals per project |
scripts/sarif-uploader.py | Upload SARIF results to GitHub Code Scanning |
| File | When to Read | ~Tokens |
|---|---|---|
references/triage-protocol.md | Wave 0 triage (incl. dependency graph) | 1500 |
references/checklists.md | Analysis or teammate prompts (incl. OWASP 2025) | 2800 |
references/research-playbook.md | Three-phase research validation (Wave 2) | 2200 |
references/judge-protocol.md | Judge reconciliation (Wave 3, incl. learnings check) | 1200 |
references/self-verification.md | Wave 3.5 — agentic verification + hallucination detection | 900 |
references/output-formats.md | Final output (reasoning chains + citation anchors) | 1100 |
references/team-templates.md | Team design (multi-pass ordering diversity) | 2200 |
references/review-lenses.md | Creative review lenses (10 lenses) | 1600 |
references/ci-integration.md | CI pipelines (Conventional Comments format) | 700 |
references/conventional-comments.md | PR comments and CI annotations | 400 |
references/dependency-context.md | Cross-file dependency analysis | 500 |
references/supply-chain-security.md | Dependency and build pipeline security | 1000 |
references/auto-fix-protocol.md | Implementing fixes after approval | 800 |
references/sarif-output.md | SARIF format for CI tooling | 700 |
18 evaluation scenarios covering all dispatch paths: session review, full audit, finding quality, healthy codebase, PR review, git range, degraded mode, reasoning chains, agentic verification, conventional comments, dependency context, learnings loop, multi-pass review, and self-verification.
[file:start-end].| Field | Value |
|---|---|
| Name | honest-review |
| License | MIT |
| Version | 5.0 |
| Author | wyattowalsh |
| Agent | Reads | Bridge File |
|---|---|---|
| Claude Code | CLAUDE.md | CLAUDE.md |
| Gemini CLI | GEMINI.md | GEMINI.md |
| Antigravity | GEMINI.md | GEMINI.md |
| Codex | AGENTS.md | — |
| Crush | AGENTS.md | — |
| OpenCode | AGENTS.md | — |
| Cursor | AGENTS.md | — |
| GitHub Copilot | Generated .github/copilot-instructions.md + AGENTS.md | .github/copilot-instructions.md |
---name: honest-reviewdescription: >- Research-driven code review with confidence-scored, evidence-validated findings. Session review or full codebase audit via parallel teams. Use when reviewing changes, auditing codebases, verifying work quality. NOT for writing new code, explaining code, or benchmarking.argument-hint: "[path | audit | PR#]"license: MITmetadata: author: wyattowalsh version: "5.0"model: sonnethooks: PreToolUse: - matcher: Edit hooks: - command: "bash -c 'if git diff --quiet \"$TOOL_INPUT_file_path\" 2>/dev/null; then exit 0; else echo \"WARNING: $(basename \"$TOOL_INPUT_file_path\") has uncommitted changes\" >&2; exit 0; fi'" PostToolUse: - matcher: Edit hooks: - command: "bash -c 'git diff --stat \"$TOOL_INPUT_file_path\" 2>/dev/null || true'"---
# Honest Review
Research-driven code review. Every finding validated with evidence.4-wave pipeline: Triage → Analysis → Research → Judge.
**Scope:** Code review and audit only. NOT for writing new code, explaining code, or benchmarking.