Skip to content

honest-review

Research-driven code review with confidence-scored, evidence-validated findings. Session review or full codebase audit via parallel teams.

honest-review MIT v5.0 wyattowalsh

Confidence-scored code review with evidence validation. Session or full codebase audit. Use when reviewing changes or auditing quality. NOT for writing code or benchmarking.

Install:

Terminal window
npx skills add wyattowalsh/agents/skills/honest-review -g

Use: /honest-review [path | audit | PR#]

Works with Claude Code, Gemini CLI, and other agentskills.io-compatible agents.

Research-driven code review where every finding is validated with evidence. The core differentiator is research validation — findings are confirmed with external evidence (Context7, WebSearch, gh) rather than relying solely on LLM knowledge.

Reasoning Chains

Every finding must explain WHY before stating WHAT. Reduces false positives by 51% (Cubic research).

Citation Anchors

[file:start-end] references mechanically verified against source. Mismatched refs discard the finding.

Agentic Verification

Three-phase review: Flag, Verify (tool calls), then Validate (research). Grep/Read confirm before reporting.

Multi-Pass Diversity

3 parallel Pass A subagents with deterministic ordering diversity. Majority voting elevates consensus flags.

Conventional Comments

Machine-parseable PR output: issue (blocking): ... format for CI annotations and PR comments.

Dependency Context

Cross-file dependency graph built during triage. High fan-in files auto-elevated to HIGH risk.

Learning Loop

Store false-positive dismissals per project. Similar findings suppressed in future reviews.

OWASP 2025

Updated checklists for A03:2025 (Supply Chain) and A10:2025 (Exception Handling).

Also includes: 10 creative lenses, review history, HTML dashboard, degraded mode, classification gating, CI integration, and hooks.

$ARGUMENTSMode
Empty + changes in session (git diff)Session review of changed files
Empty + no changes (first message)Full codebase audit
File or directory pathScoped review of that path
”audit”Force full codebase audit
PR number/URLReview PR changes (gh pr diff)
Git range (HEAD~3..HEAD)Review changes in that range
”history” [project]Show review history for project
”diff” or “delta” [project]Compare current vs. previous review
--format sarif (with any mode)Output findings in SARIF v2.1
”learnings” [command]Manage false-positive learnings (add/list/clear)
--format conventional (with any mode)Output in Conventional Comments format
Unrecognized inputAsk for clarification

Both modes follow a 4-wave pipeline:

  1. Triage (Wave 0) — Risk-stratify files as HIGH/MEDIUM/LOW. Run uv run scripts/project-scanner.py for project profiling. Compute review depth score (0-10) for classification gating. Determine specialist triggers (security, observability, requirements).

  2. Analysis (Wave 1) — Always run the content-adaptive team at maximum depth (no inline-only mode): Correctness, Design, Efficiency, Code Reuse, and Test Quality reviewers always spawn; specialists (Security, Observability, Requirements, Data Migration, Frontend) are triggered by triage. Each reviewer runs 3 internal passes (A: scan, B: deep dive, C: research).

  3. Research Validation (Wave 2) — Three-phase review: Flag (hypothesize), Verify (tool calls via Grep/Read to confirm assumptions before reporting), Validate (spawn research subagents for external evidence). Dispatch order: slopsquatting detection first, then HIGH-risk (2+ sources), then MEDIUM-risk. In degraded mode, apply confidence ceilings per unavailable tool.

  4. Judge Reconciliation (Wave 3) — Normalize findings, cluster by root cause, deduplicate with weighted confidence merging (1-(1-c1)(1-c2)...), apply confidence filter, resolve conflicts, check interactions, elevate systemic patterns (3+ files), and rank by score = severity_weight x confidence x blast_radius.

Three abstraction levels, each examining defects and unnecessary complexity:

LevelFocusSimplify
Correctness (does it work?)Error handling, boundary conditions, security, API misuse, concurrency, resource leaksPhantom error handling, defensive checks for impossible states, dead error paths
Design (is it well-built?)Abstraction quality, coupling, cohesion, test quality, cognitive complexityDead code, 1:1 wrappers, single-use abstractions, over-engineering
Efficiency (is it economical?)Algorithmic complexity, N+1, data structure choice, resource usage, cachingUnnecessary serialization, redundant computation, premature optimization

Context-dependent triggers activate automatically when relevant: security, observability, AI code smells, config/secrets, resilience, i18n/accessibility, data migration, backward compatibility, infrastructure as code, and requirements validation.

Apply at least 2 lenses per review scope. For security-sensitive code, Adversary is mandatory.

  • Inversion — assume the code is wrong; what would break first?
  • Deletion — remove each unit; does anything else notice?
  • Newcomer — read as a first-time contributor; where do you get lost?
  • Incident — imagine a 3 AM page; what path led here?
  • Evolution — fast-forward 6 months of feature growth; what becomes brittle?
  • Adversary — what would an attacker do with this code?
  • Compliance — does this code meet regulatory requirements?
  • Dependency — is the dependency graph healthy?
  • Cost — what does this cost to run?
  • Sustainability — will this scale without linear cost growth?

Every finding follows this mandatory order:

  1. Citation anchor[file:start-end] exact source location, mechanically verified

  2. Reasoning chain — WHY this is a problem (written before the finding statement)

  3. Finding statement — WHAT the problem is

  4. Evidence — external validation source (Context7, WebSearch, gh)

  5. Fix — recommended approach

Adapts review depth to project type:

Project TypeReview Depth
PrototypeP0/S0 only. Skip style, structure, and optimization concerns.
ProductionFull review at all levels and severities.
LibraryFull review plus backward compatibility focus on public API surfaces.
[Lead: triage (Wave 0), Judge reconciliation (Wave 3), final report]
|-- Correctness Reviewer --> Passes A/B/C internally
|-- Design Reviewer --> Passes A/B/C internally
|-- Efficiency Reviewer --> Passes A/B/C internally
|-- [Security Specialist if triage triggers]
|-- [Observability Specialist if triage triggers]
|-- [Requirements Validator if intent available]

Each reviewer runs 3 internal passes: Pass A (quick scan, haiku), Pass B (deep dive HIGH-risk files, opus), Pass C (research validate findings).

Review history is persisted to ~/.claude/honest-reviews/ via scripts/review-store.py:

CommandDescription
saveSave review findings with project, mode, commit, and scope metadata
loadRetrieve a specific review (by project and optional date)
listList saved reviews with metadata
diffCompare two reviews — shows new, resolved, and recurring findings

Use /honest-review history my-project to view history or /honest-review diff my-project to compare against a previous review.

After Judge reconciliation, findings can be rendered into a self-contained HTML dashboard at templates/dashboard.html. Inject the findings JSON into the <script id="data"> tag. The dashboard auto-detects the view type:

  • Session view — findings table with severity/confidence heatmap, strengths, statistics
  • Audit view — multi-domain visualization with health radar chart
  • Diff view — three-column layout: new (red), resolved (green), recurring (yellow)
ScriptPurpose
scripts/project-scanner.pyWave 0 triage — project profiling with dependency graph construction and fan-in risk scoring
scripts/finding-formatter.pyWave 3 Judge — normalize findings to JSON, supports --format sarif and --format conventional
scripts/review-store.pyState management — save, load, list, diff review history (schema v2 with reasoning tracking)
scripts/learnings-store.pyLearning loop — add, check, list, clear false-positive dismissals per project
scripts/sarif-uploader.pyUpload SARIF results to GitHub Code Scanning
  1. Never skip triage (Wave 0) — risk classification informs everything downstream.
  2. Every non-trivial finding must have evidence validation or be discarded.
  3. Confidence < 0.3 = discard (except P0/S0 — report as unconfirmed).
  4. Do not police style preferences; follow the codebase’s conventions. Exception: violations of agent behavior rules in AGENTS.md/CLAUDE.md are reportable findings.
  5. Do not report phantom bugs requiring impossible conditions.
  6. More than 12 findings means re-prioritize — 5 validated findings beat 50 speculative.
  7. Never skip Judge reconciliation (Wave 3).
  8. Always present findings before implementing (approval gate).
  9. Always verify after implementing (build, tests, behavior).
  10. Never assign overlapping file ownership.
  11. Maintain a 3:1 positive-to-constructive ratio, except when 3+ P0/P1 findings are present.
  12. Acknowledge healthy codebases explicitly when no P0/P1 or S0 findings survive.
  13. Apply at least 2 creative lenses per scope — Adversary is mandatory for security-sensitive code.
  14. Load ONE reference file at a time.
  15. Review against the codebase’s conventions, not an idealized standard.
  16. Run self-verification (Wave 3.5) when 2+ findings survive Judge (skip for fewer findings or fully degraded mode).
  17. Follow the auto-fix protocol — never apply without diff preview and user confirmation.
  18. Check convention files (AGENTS.md, CLAUDE.md, GEMINI.md, .github/copilot-instructions.md) during triage.
  19. Every finding must include a reasoning chain (WHY) before the finding statement (WHAT).
  20. Every finding must include a mechanically verified citation anchor [file:start-end].
  21. Check learnings store during Judge Wave 3 Step 4 — suppress known false positives.
  22. Code Reuse Reviewer runs on every review scope and must cite existing equivalents (file:line).
  23. Fact evidence (Grep) is sufficient for reuse/simplification findings; external research is required for assumption-based findings.
  24. No inline-only review — always spawn the full content-adaptive team.
  25. Test Quality Reviewer always spawns (full review when tests are in scope; coverage-gap search otherwise).
  26. Agent behavior rule violations are findings; style-only preferences remain non-findings.
  27. Post-approval execution uses orchestration Pattern E and parallelizes independent fixes.

FieldValue
Namehonest-review
LicenseMIT
Version5.0
Authorwyattowalsh
AgentReadsBridge File
Claude CodeCLAUDE.mdCLAUDE.md
Gemini CLIGEMINI.mdGEMINI.md
AntigravityGEMINI.mdGEMINI.md
CodexAGENTS.md
CrushAGENTS.md
OpenCodeAGENTS.md
CursorAGENTS.md
GitHub CopilotGenerated .github/copilot-instructions.md + AGENTS.md.github/copilot-instructions.md
View Full SKILL.md
SKILL.md
---
name: honest-review
description: >-
Research-driven code review with confidence-scored, evidence-validated findings.
Session review or full codebase audit via parallel teams. Use when reviewing
changes, auditing codebases, verifying work quality. NOT for writing new code,
explaining code, or benchmarking.
argument-hint: "[path | audit | PR#]"
license: MIT
metadata:
author: wyattowalsh
version: "5.0"
model: sonnet
hooks:
PreToolUse:
- matcher: Edit
hooks:
- command: "bash -c 'if git diff --quiet \"$TOOL_INPUT_file_path\" 2>/dev/null; then exit 0; else echo \"WARNING: $(basename \"$TOOL_INPUT_file_path\") has uncommitted changes\" >&2; exit 0; fi'"
PostToolUse:
- matcher: Edit
hooks:
- command: "bash -c 'git diff --stat \"$TOOL_INPUT_file_path\" 2>/dev/null || true'"
---
# Honest Review
Research-driven code review. Every finding validated with evidence.
4-wave pipeline: Triage → Analysis → Research → Judge.
**Scope:** Code review and audit only. NOT for writing new code, explaining code, or benchmarking.

Download from GitHub


View source on GitHub