research
Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.
Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.
Quick Start
Section titled “Quick Start”Install:
npx skills add wyattowalsh/agents/skills/research -gUse: /research <question or topic> [--depth quick|standard|deep|exhaustive] [--format brief|deep|bib|matrix]
Works with Claude Code, Gemini CLI, and other agentskills.io-compatible agents.
What It Does
Section titled “What It Does”General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).
$ARGUMENTS | Action |
|---|---|
Question or topic text (has verb or ?) | Investigate — classify complexity, execute wave pipeline |
Vague input (<5 words, no verb, no ?) | Intake — ask 2-3 clarifying questions, then classify |
check <claim> or verify <claim> | Fact-check — verify claim against 3+ search engines |
compare <A> vs <B> [vs <C>...] | Compare — structured comparison with decision matrix output |
survey <field or topic> | Survey — landscape mapping, annotated bibliography |
track <topic> | Track — load prior journal, search for updates since last session |
resume [number or keyword] | Resume — resume a saved research session |
list [active|domain|tier] | List — show journal metadata table |
archive | Archive — move journals older than 90 days |
delete <N> | Delete — delete journal N with confirmation |
export [N] | Export — render HTML dashboard for journal N (default: current) |
| Empty | Gallery — show topic examples + “ask me anything” prompt |
Critical Rules
Section titled “Critical Rules”- No claim >= 0.7 unless supported by 2+ independent sources — single-source claims cap at 0.6
- Never fabricate citations — if URL, author, title, or date cannot be verified, use vague attribution (“a study in this tradition”) rather than inventing specifics
- Always surface contradictions explicitly — never silently resolve disagreements; present both sides with evidence
- Always present triage scoring before executing research — user must see and can override complexity tier
- Save journal after every wave in Deep/Exhaustive mode — enables resume after interruption
- Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers — this is the anti-hallucination mechanism
- Multi-engine search is mandatory for fact-checking — use minimum 2 different search tools (e.g., brave-search + duckduckgo-search)
- Apply the Accounting Rule after every parallel dispatch — N dispatched = N accounted for before proceeding to next wave
- Distinguish facts from interpretations in all output — factual claims carry evidence; interpretive claims are explicitly labeled as analysis
- Flag all LLM-prior findings — claims matching common training data but lacking fresh evidence must be flagged with bias marker
- Max confidence 0.4 in degraded mode — when all research tools are unavailable, report all findings as “unverified — based on training knowledge”
- Load ONE reference file at a time — do not preload all references into context
- Track mode must load prior journal before searching — avoid re-researching what is already known
- The synthesis is not a summary — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source
- PreToolUse Edit hook is non-negotiable — the research skill never modifies source files; it only creates/updates journals in
~/.claude/research/
| Field | Value |
|---|---|
| Name | research |
| License | MIT |
| Version | 1.0.0 |
| Author | wyattowalsh |
| Field | Value |
|---|---|
| Model | opus |
| Argument Hint | `[question or topic] [—depth quick |
View Full SKILL.md
---name: researchdescription: >- Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.argument-hint: "<question or topic> [--depth quick|standard|deep|exhaustive] [--format brief|deep|bib|matrix]"model: opuslicense: MITmetadata: author: wyattowalsh version: "1.0.0"hooks: PreToolUse: - matcher: Edit hooks: - command: 'bash -c "echo BLOCKED: research skill is read-only — no file edits permitted >&2; exit 1"'---
# Deep Research
General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).
## Vocabulary
| Term | Definition ||------|-----------|| **query** | The user's research question or topic; the unit of investigation || **claim** | A discrete assertion to be verified; extracted from sources or user input || **source** | A specific origin of information: URL, document, database record, or API response || **evidence** | A source-backed datum supporting or contradicting a claim; always has provenance || **provenance** | The chain from evidence to source: tool used, URL, access timestamp, excerpt || **confidence** | Score 0.0-1.0 per claim; based on evidence strength and cross-validation || **cross-validation** | Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism || **triangulation** | Confirming a finding using 3+ methodologically diverse sources || **contradiction** | When two credible sources assert incompatible claims; must be surfaced explicitly || **synthesis** | The final research product: not a summary but a novel integration of evidence with analysis || **journal** | The saved markdown record of a research session, stored in `~/.claude/research/` || **sweep** | Wave 1: broad parallel search across multiple tools and sources || **deep dive** | Wave 2: targeted follow-up on specific leads from the sweep || **lead** | A promising source or thread identified during the sweep, warranting deeper investigation || **tier** | Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10) || **finding** | A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output || **gap** | An identified area where evidence is insufficient, contradictory, or absent || **bias marker** | An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.) || **degraded mode** | Operation when research tools are unavailable; confidence ceilings applied |
## Dispatch
| `$ARGUMENTS` | Action ||---|---|| Question or topic text (has verb or `?`) | **Investigate** — classify complexity, execute wave pipeline || Vague input (<5 words, no verb, no `?`) | **Intake** — ask 2-3 clarifying questions, then classify || `check <claim>` or `verify <claim>` | **Fact-check** — verify claim against 3+ search engines || `compare <A> vs <B> [vs <C>...]` | **Compare** — structured comparison with decision matrix output || `survey <field or topic>` | **Survey** — landscape mapping, annotated bibliography || `track <topic>` | **Track** — load prior journal, search for updates since last session || `resume [number or keyword]` | **Resume** — resume a saved research session || `list [active\|domain\|tier]` | **List** — show journal metadata table || `archive` | **Archive** — move journals older than 90 days || `delete <N>` | **Delete** — delete journal N with confirmation || `export [N]` | **Export** — render HTML dashboard for journal N (default: current) || Empty | **Gallery** — show topic examples + "ask me anything" prompt |
### Auto-Detection Heuristic
If no mode keyword matches:
1. Ends with `?` or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → **Investigate**2. Contains `vs`, `versus`, `compared to`, `or` between noun phrases → **Compare**3. Declarative statement with factual claim, no question syntax → **Fact-check**4. Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"5. Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"
### Gallery (Empty Arguments)
Present research examples spanning domains:
| # | Domain | Example | Likely Tier ||---|--------|---------|-------------|| 1 | Technology | "What are the current best practices for LLM agent architectures?" | Deep || 2 | Academic | "What is the state of evidence on intermittent fasting for longevity?" | Standard || 3 | Market | "How does the competitive landscape for vector databases compare?" | Deep || 4 | Fact-check | "Is it true that 90% of startups fail within the first year?" | Standard || 5 | Architecture | "When should you choose event sourcing over CRUD?" | Standard || 6 | Trends | "What emerging programming languages gained traction in 2025-2026?" | Standard |
> Pick a number, paste your own question, or type `guide me`.
### Skill Awareness
Before starting research, check if another skill is a better fit:
| Signal | Redirect ||--------|----------|| Code review, PR review, diff analysis | Suggest `/honest-review` || Strategic decision with adversaries, game theory | Suggest `/wargame` || Multi-perspective expert debate | Suggest `/host-panel` || Prompt optimization, model-specific prompting | Suggest `/prompt-engineer` |
If the user confirms they want general research, proceed.
## Complexity Classification
Score the query on 5 dimensions (0-2 each, total 0-10):
| Dimension | 0 | 1 | 2 ||-----------|---|---|---|| **Scope breadth** | Single fact/definition | Multi-faceted, 2-3 domains | Cross-disciplinary, 4+ domains || **Source difficulty** | Top search results suffice | Specialized databases or multiple source types | Paywalled, fragmented, or conflicting sources || **Temporal sensitivity** | Stable/historical | Evolving field (months matter) | Fast-moving (days/weeks matter), active controversy || **Verification complexity** | Easily verifiable (official docs) | 2-3 independent sources needed | Contested claims, expert disagreement, no consensus || **Synthesis demand** | Answer is a fact or list | Compare/contrast viewpoints | Novel integration of conflicting threads |
| Total | Tier | Strategy ||-------|------|----------|| 0-2 | **Quick** | Inline, 1-2 searches, fire-and-forget || 3-5 | **Standard** | Subagent wave, 3-5 parallel searchers, report delivered || 6-8 | **Deep** | Agent team (TeamCreate), 3-5 teammates, interactive session || 9-10 | **Exhaustive** | Agent team, 4-6 teammates + nested subagent waves, interactive |
Present the scoring to the user. User can override tier with `--depth <tier>`.
## Wave Pipeline
All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.
### Wave 0: Triage (always inline, never parallelized)
1. Run `!uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS"` for deterministic pre-scan2. Decompose query into 2-5 sub-questions3. Score complexity on the 5-dimension rubric4. Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per `references/source-selection.md`5. Select tools per domain signals — read `references/source-selection.md`6. Check for existing journals — if `track` or `resume`, load prior state7. **Present triage to user** — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.
### Wave 1: Broad Sweep (parallel)
Scale by tier:
**Quick (inline):** 1-2 tool calls sequentially. No subagents.
**Standard (subagent wave):** Dispatch 3-5 parallel subagents via Task tool:```Subagent A → brave-search + duckduckgo-search for sub-question 1Subagent B → exa + g-search for sub-question 2Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specificsSubagent D → wikipedia / wikidata for factual grounding[Subagent E → PubMed / openalex if academic domain detected]```
**Deep (agent team):** TeamCreate `"research-{slug}"`:```Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4) |-- web-researcher: brave-search, duckduckgo-search, exa, g-search |-- tech-researcher: context7, deepwiki, arxiv, semantic-scholar, package-version |-- content-extractor: fetcher, trafilatura, docling, wikipedia, wayback |-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed] |-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]```Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.
**Exhaustive:** Deep team + each teammate runs nested subagent waves internally.
Each subagent/teammate returns structured findings:```json{ "sub_question": "...", "findings": [{"claim": "...", "source_url": "...", "source_tool": "...", "excerpt": "...", "confidence_raw": 0.6}], "leads": ["url1", "url2"], "gaps": ["could not find data on X"]}```
### Wave 1.5: Perspective Expansion (Deep/Exhaustive only)
STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:
| Perspective | Focus | Question Style ||-------------|-------|---------------|| **Skeptic** | What could be wrong? What's missing? | "What evidence would disprove this?" || **Domain Expert** | Technical depth, nuance, edge cases | "What do practitioners actually encounter?" || **Practitioner** | Real-world applicability, trade-offs | "What matters when you actually build this?" || **Theorist** | First principles, abstractions, frameworks | "What underlying model explains this?" |
Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.
### Wave 2: Deep Dive (parallel, targeted)
1. Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)2. Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads3. Follow citation chains — if a source cites another, fetch the original4. Fill gaps — for each gap identified in Wave 1, dispatch targeted searches5. Use thinking MCPs: - `cascade-thinking` for multi-perspective analysis of complex findings - `structured-thinking` for tracking evidence chains and contradictions - `think-strategies` for complex question decomposition (Standard+ only)
### Wave 3: Cross-Validation (parallel)
The anti-hallucination wave. Read `references/confidence-rubric.md` and `references/self-verification.md`.
For every claim surviving Waves 1-2:
1. **Independence check** — are supporting sources truly independent? Sources citing each other are NOT independent.2. **Counter-search** — explicitly search for evidence AGAINST each major claim using a different search engine3. **Freshness check** — verify sources are current (flag if >1 year old for time-sensitive topics)4. **Contradiction scan** — read `references/contradiction-protocol.md`, identify and classify disagreements5. **Confidence scoring** — assign 0.0-1.0 per `references/confidence-rubric.md`6. **Bias sweep** — check each finding against 10 bias categories (7 core + 3 LLM-specific) per `references/bias-detection.md`
**Self-Verification (3+ findings survive):** Spawn devil's advocate subagent per `references/self-verification.md`:> For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.
Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0.Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.
### Wave 4: Synthesis (always inline, lead only)
Produce the final research product. Read `references/output-formats.md` for templates.
The synthesis is NOT a summary. It must:
1. **Answer directly** — answer the user's question clearly2. **Map evidence** — all verified findings with confidence and citations3. **Surface contradictions** — where sources disagree, with analysis of why4. **Show confidence landscape** — what is known confidently, what is uncertain, what is unknown5. **Audit biases** — biases detected during research6. **Identify gaps** — what evidence is missing, what further research would help7. **Distill takeaways** — 3-7 numbered key findings8. **Cite sources** — full bibliography with provenance
**Output format** adapts to mode:- Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)- Fact-check → Quick Answer with verdict + evidence- Compare → Decision Matrix- Survey → Annotated Bibliography- User can override with `--format brief|deep|bib|matrix`
## Confidence Scoring
| Score | Basis ||-------|-------|| 0.9-1.0 | Official docs + 2 independent sources agree, no contradictions || 0.7-0.8 | 2+ independent sources agree, minor qualifications || 0.5-0.6 | Single authoritative source, or 2 sources with partial agreement || 0.3-0.4 | Single non-authoritative source, or conflicting evidence || 0.2-0.3 | Multiple non-authoritative sources with partial agreement, or single source with significant caveats || 0.1-0.2 | LLM reasoning only, no external evidence found || 0.0 | Actively contradicted by evidence |
**Hard rules:**- No claim reported at >= 0.7 unless supported by 2+ independent sources- Single-source claims cap at 0.6 regardless of source authority- Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"
**Merged confidence** (for claims supported by multiple sources):`c_merged = 1 - (1-c1)(1-c2)...(1-cN)` capped at 0.99
## Evidence Chain Structure
Every finding carries this structure:
```FINDING RR-{seq:03d}: [claim statement] CONFIDENCE: [0.0-1.0] EVIDENCE: 1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words] 2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words] CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources BIAS MARKERS: [none | list of detected biases with category] GAPS: [none | what additional evidence would strengthen this finding]```
Use `!uv run python skills/research/scripts/finding-formatter.py --format markdown` to normalize.
## Source Selection
Read `references/source-selection.md` during Wave 0 for the full tool-to-domain mapping. Summary:
| Domain Signal | Primary Tools | Secondary Tools ||--------------|---------------|-----------------|| Library/API docs | context7, deepwiki, package-version | brave-search || Academic/scientific | arxiv, semantic-scholar, PubMed, openalex | crossref, brave-search || Current events/trends | brave-search, exa, duckduckgo-search, g-search | fetcher, trafilatura || GitHub repos/OSS | deepwiki, repomix | brave-search || General knowledge | wikipedia, wikidata, brave-search | fetcher || Historical content | wayback, brave-search | fetcher || Fact-checking | 3+ search engines mandatory | wikidata for structured claims || PDF/document analysis | docling | trafilatura |
**Multi-engine protocol:** For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.
## Bias Detection
Check every finding against 10 bias categories. Read `references/bias-detection.md` for full detection signals and mitigation strategies.
| Bias | Detection Signal | Mitigation ||------|-----------------|------------|| **LLM prior** | Matches common training patterns, lacks fresh evidence | Flag; require fresh source confirmation || **Recency** | Overweighting recent results, ignoring historical context | Search for historical perspective || **Authority** | Uncritically accepting prestigious sources | Cross-validate even authoritative claims || **Confirmation** | Queries constructed to confirm initial hypothesis | Use neutral queries; search for counterarguments || **Survivorship** | Only finding successful examples | Search for failures/counterexamples || **Selection** | Search engine bubble, English-only | Use multiple engines; note coverage limitations || **Anchoring** | First source disproportionately shapes interpretation | Document first source separately; seek contrast |
## State Management
- **Journal path:** `~/.claude/research/`- **Archive path:** `~/.claude/research/archive/`- **Filename convention:** `{YYYY-MM-DD}-{domain}-{slug}.md` - `{domain}`: `tech`, `academic`, `market`, `policy`, `factcheck`, `compare`, `survey`, `track`, `general` - `{slug}`: 3-5 word semantic summary, kebab-case - Collision: append `-v2`, `-v3`- **Format:** YAML frontmatter + markdown body + `<!-- STATE -->` blocks
**Save protocol:**- Quick: save once at end with `status: Complete`- Standard/Deep/Exhaustive: save after Wave 1 with `status: In Progress`, update after each wave, finalize after synthesis
**Resume protocol:**1. `resume` (no args): find `status: In Progress` journals. One → auto-resume. Multiple → show list.2. `resume N`: Nth journal from `list` output (reverse chronological).3. `resume keyword`: search frontmatter `query` and `domain_tags` for match.
Use `!uv run python skills/research/scripts/journal-store.py` for all journal operations.
**State snapshot** (appended after each wave save):```html<!-- STATEwave_completed: 2findings_count: 12leads_pending: ["url1", "url2"]gaps: ["topic X needs more sources"]contradictions: 1next_action: "Wave 3: cross-validate top 8 findings"-->```
## In-Session Commands (Deep/Exhaustive)
Available during active research sessions:
| Command | Effect ||---------|--------|| `drill <finding #>` | Deep dive into a specific finding with more sources || `pivot <new angle>` | Redirect research to a new sub-question || `counter <finding #>` | Explicitly search for evidence against a finding || `export` | Render HTML dashboard || `status` | Show current research state without advancing || `sources` | List all sources consulted so far || `confidence` | Show confidence distribution across findings || `gaps` | List identified knowledge gaps || `?` | Show command menu |
Read `references/session-commands.md` for full protocols.
## Reference File Index
| File | Content | Read When ||------|---------|-----------|| `references/source-selection.md` | Tool-to-domain mapping, multi-engine protocol, degraded mode | Wave 0 (selecting tools) || `references/confidence-rubric.md` | Scoring rubric, cross-validation rules, independence checks | Wave 3 (assigning confidence) || `references/evidence-chain.md` | Finding template, provenance format, citation standards | Any wave (structuring evidence) || `references/bias-detection.md` | 10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategies | Wave 3 (bias audit) || `references/contradiction-protocol.md` | 4 contradiction types, resolution framework | Wave 3 (contradiction detection) || `references/self-verification.md` | Devil's advocate protocol, hallucination detection | Wave 3 (self-verification) || `references/output-formats.md` | Templates for all 5 output formats | Wave 4 (formatting output) || `references/team-templates.md` | Team archetypes, subagent prompts, perspective agents | Wave 0 (designing team) || `references/session-commands.md` | In-session command protocols | When user issues in-session command || `references/dashboard-schema.md` | JSON data contract for HTML dashboard | `export` command |
**Loading rule:** Load ONE reference at a time per the "Read When" column. Do not preload.
## Critical Rules
1. **No claim >= 0.7 unless supported by 2+ independent sources** — single-source claims cap at 0.62. **Never fabricate citations** — if URL, author, title, or date cannot be verified, use vague attribution ("a study in this tradition") rather than inventing specifics3. **Always surface contradictions explicitly** — never silently resolve disagreements; present both sides with evidence4. **Always present triage scoring before executing research** — user must see and can override complexity tier5. **Save journal after every wave in Deep/Exhaustive mode** — enables resume after interruption6. **Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers** — this is the anti-hallucination mechanism7. **Multi-engine search is mandatory for fact-checking** — use minimum 2 different search tools (e.g., brave-search + duckduckgo-search)8. **Apply the Accounting Rule after every parallel dispatch** — N dispatched = N accounted for before proceeding to next wave9. **Distinguish facts from interpretations in all output** — factual claims carry evidence; interpretive claims are explicitly labeled as analysis10. **Flag all LLM-prior findings** — claims matching common training data but lacking fresh evidence must be flagged with bias marker11. **Max confidence 0.4 in degraded mode** — when all research tools are unavailable, report all findings as "unverified — based on training knowledge"12. **Load ONE reference file at a time** — do not preload all references into context13. **Track mode must load prior journal before searching** — avoid re-researching what is already known14. **The synthesis is not a summary** — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source15. **PreToolUse Edit hook is non-negotiable** — the research skill never modifies source files; it only creates/updates journals in `~/.claude/research/`