observability-advisor
Design and review logs, metrics, traces, SLOs, and alerting for reliable systems. Use for telemetry strategy and coverage gaps. NOT for live incident command or vendor-specific setup.
observability-advisor1461 wordsMITRepo-owned
Design and review logs, metrics, traces, SLOs, and alerting for reliable systems. Use for telemetry strategy and coverage gaps. NOT for live incident command or vendor-specific setup.
Quick Start
Install:
npx skills add github:wyattowalsh/agents --skill observability-advisor -y -g --agent antigravity --agent claude-code --agent codex --agent crush --agent cursor --agent gemini-cli --agent github-copilot --agent grok --agent opencode Use: /observability-advisor <mode> [target]
Works with Claude Code, Gemini CLI, OpenCode, and other agentskills.io-compatible agents.
What It Does
Section titled “What It Does”Design and review telemetry that helps teams detect, diagnose, and improve service behavior before and during reliability problems.
| $ARGUMENTS | Mode |
|---|---|
design <system> | Design an observability architecture for a service or workflow |
review <service or stack> | Audit existing telemetry, dashboards, and alerts |
instrument <service or path> | Plan what to emit and where to add instrumentation |
alert <service or journey> | Design actionable alerting and escalation |
slo <service or journey> | Define SLIs, SLOs, and error budget policy |
investigate <signal or symptom> | Structure cross-signal diagnosis for an issue |
| Natural language about logs, metrics, traces, dashboards, or alerting | Auto-detect the closest mode |
| Empty | Show the mode menu with examples |
Critical Rules
Section titled “Critical Rules”- Reject telemetry plans that optimize infrastructure visibility while leaving user-impact questions unanswered.
- Require a stable request, workflow, or journey identifier whenever the proposed design needs cross-signal correlation.
- Reject labels, fields, or exemplars that create avoidable cardinality explosions or expose raw PII.
- Keep dashboards, alerts, and runbooks as separate deliverables; do not collapse them into one artifact or one ownerless checklist.
- Page only on symptoms or leading indicators that demand operator action; downgrade the rest to ticket, dashboard, or review-only signals.
- Redirect vendor-specific setup, implementation commands, or managed-service configuration to the relevant platform skill instead of inventing provider steps here.
Canonical Vocabulary
Section titled “Canonical Vocabulary”| Term | Definition |
|---|---|
| telemetry | Logs, metrics, traces, profiles, and events emitted by a system |
| signal | A measurable indicator used to detect or explain behavior |
| metric | Numeric time-series measurement aggregated over time |
| log | Structured event record capturing context for a specific occurrence |
| trace | End-to-end record of work moving through distributed components |
| span | A timed unit of work within a trace |
| SLI | Concrete measurement of a user-relevant reliability property |
| SLO | Target threshold and window for an SLI |
| error budget | Allowed unreliability implied by an SLO over its window |
| cardinality | Number of unique label or attribute values attached to telemetry |
When To Use
Section titled “When To Use”- A team can see failures but cannot explain them quickly
- Alerts are noisy, late, or missing user-impact context
- A service lacks clear SLIs, SLOs, or error budget policy
- You need to add instrumentation to a new service, workflow, or migration
- Dashboards exist but ownership, escalation, or runbook linkage is weak
Classification Gate
Section titled “Classification Gate”- If the task is active outage coordination, use incident-response-engineer.
- If the task is CPU, memory, query, or runtime hotspot analysis, use performance-profiler.
- If the task is AWS-native dashboard, alarm, or log-group setup, use cloudwatch.
- If the task is CI, deploy, or platform rollout wiring, use devops-engineer.
Mode Menu
Section titled “Mode Menu”| # | Mode | Example |
|---|---|---|
| 1 | Design | design observability for multi-region checkout service |
| 2 | Review | review telemetry coverage for payments-api |
| 3 | Instrument | instrument order placement workflow across api and workers |
| 4 | Alert | alert strategy for login availability and latency |
| 5 | SLO | slo for customer webhook delivery |
| 6 | Investigate | investigate rising 5xx with queue lag and timeout traces |
Instructions
Section titled “Instructions”Mode: Design
Section titled “Mode: Design”- Identify the user journeys, critical dependencies, and failure domains that matter most.
- Define the questions operators must be able to answer within minutes during degradation.
- Read
references/signal-selection-matrix.mdwhen signal tradeoffs, sampling, or join strategy are unclear. - Choose the minimum useful signals across logs, metrics, and traces for each critical boundary.
- Specify correlation identifiers, structured fields, and service naming so signals can be joined reliably.
- Define dashboards, alerts, runbook links, and ownership for each critical path.
- Call out sampling, retention, and cardinality constraints before recommending implementation details.
- Use
references/output-templates.md#design-templatewhen producing the final deliverable.
Mode: Review
Section titled “Mode: Review”- Inspect current logs, metrics, traces, dashboards, alerts, and on-call pathways.
- Check whether user-visible symptoms can be detected before customer reports arrive.
- Read
references/alert-anti-patterns.mdwhen alert noise, duplication, or escalation quality is part of the review. - Identify blind spots, duplicate signals, noisy alerts, weak labels, and missing trace correlation.
- Separate findings into coverage gaps, alert quality issues, and operational debt.
- Rank issues by detection risk and operator impact.
- Use
references/output-templates.md#review-templatewhen formatting the audit.
Mode: Instrument
Section titled “Mode: Instrument”- Map the request or workflow path and identify the decision points, retries, queues, and external calls.
- Read
references/signal-selection-matrix.mdbefore choosing signal types for each boundary. - Define which metrics, logs, and spans should be emitted at each boundary.
- Require stable request, tenant, or workflow identifiers only where they aid diagnosis without creating cardinality explosions.
- Keep logs structured and redact or exclude secrets and unnecessary PII.
- Produce a rollout plan that starts with the highest-value path first.
- Use
references/output-templates.md#instrumentation-templatefor the emitted deliverable shape.
Mode: Alert
Section titled “Mode: Alert”- Distinguish page-worthy conditions from ticket-only or dashboard-only signals.
- Prefer alerts tied to user symptoms, SLO burn, saturation, or stalled workflows over internal noise.
- Read
references/alert-anti-patterns.mdbefore recommending thresholds, paging, or deduplication changes. - Define threshold, duration, owner, runbook, and escalation target for every alert.
- Call out what evidence an operator should inspect first after the alert fires.
- Reduce duplicate alerts that page different teams for the same symptom.
- Use
references/output-templates.md#alert-templatewhen presenting the alert plan.
Mode: SLO
Section titled “Mode: SLO”- Start from the user-facing promise, not the easiest internal metric to measure.
- Read
references/sli-slo-examples.mdwhen choosing SLI type, exclusions, windows, or error-budget policy. - Define the SLI precisely: numerator, denominator, exclusions, and measurement window.
- Choose a target that matches business expectations and operational reality.
- State the error budget policy, review cadence, and what actions are triggered when the budget is burned.
- Separate availability, latency, freshness, or correctness objectives when one combined SLO would hide tradeoffs.
- Use
references/output-templates.md#slo-templatefor the final deliverable.
Mode: Investigate
Section titled “Mode: Investigate”- Start from verified symptoms, not assumed root causes.
- Correlate recent deploys, traffic changes, metrics, logs, traces, and dependency health.
- Read
references/investigation-workflows.mdwhen building the hypothesis tree or evidence order. - Build a short hypothesis list and name the next measurement that would confirm or reject each one.
- Distinguish signal quality problems from system behavior problems.
- If the issue is actively impacting customers and needs command-and-control response, route to incident-response-engineer.
- Use
references/output-templates.md#investigation-templatefor the final response.
Output Requirements
Section titled “Output Requirements”- Every design must name the key questions, signals, owners, and escalation path.
- Every review must separate missing coverage, alert quality, and observability debt.
- Every instrumentation plan must define correlation strategy and data-safety constraints.
- Every alert plan must distinguish paging from informational notifications.
- Every SLO plan must name the SLI, target, window, and error budget policy.
Scaling Strategy
Section titled “Scaling Strategy”- Start with the highest-value user journey or failure path before broadening coverage.
- Prefer one dependable service-level dashboard and a small alert set over wide but noisy signal sprawl.
- Expand dimensions, retention, and trace depth only after the base signal set proves useful in practice.
State Management
Section titled “State Management”- Preserve correlation identifiers across service boundaries, queue hops, and async retries.
- Track alert ownership, runbook links, and SLO definitions as first-class operational metadata.
- Re-evaluate telemetry after major architecture, dependency, or traffic-shape changes.
Progressive Disclosure
Section titled “Progressive Disclosure”- Do not load all references by default.
- Read only the reference files needed for the active mode:
- signal selection work:
references/signal-selection-matrix.md - alert quality work:
references/alert-anti-patterns.md - SLI or SLO design:
references/sli-slo-examples.md - symptom-first diagnosis:
references/investigation-workflows.md - final formatting:
references/output-templates.md
- signal selection work:
- Keep
SKILL.mdas the operator contract and use the references for matrices, examples, and output shapes.
Scope Boundaries
Section titled “Scope Boundaries”IS for: telemetry design, coverage reviews, instrumentation strategy, SLO definition, alert quality, cross-signal diagnosis.
NOT for: live incident command, low-level profiler output analysis, or vendor-specific configuration walkthroughs.
| Field | Value |
|---|---|
| Source Type | repo-owned |
| Display Source | github:wyattowalsh/agents |
| Source Kind | repo |
| Installability | portable command |
| Review State | reviewed |
| Target Agents | antigravity, claude-code, codex, crush, cursor, gemini-cli, github-copilot, grok, opencode |
| Field | Value |
|---|---|
| Name | observability-advisor |
| License | MIT |
| Version | 1.0.0 |
| Author | wyattowalsh |
| Field | Value |
|---|---|
| Argument Hint | [mode] [target] |
Related Skills
Section titled “Related Skills” incident-response-engineer Operational incident response for triage, containment, communications, recovery, and postmortems. Use when coordinating outages or service degradation.
performance-profiler Performance analysis: complexity estimation, profiler output parsing, caching design, regression risk. Use for optimization guidance.
devops-engineer Design, optimize, and debug CI/CD pipelines. GitHub Actions and GitLab CI patterns. Use for pipeline work.
View Full SKILL.md
---name: observability-advisordescription: >- Design and review logs, metrics, traces, SLOs, and alerting for reliable systems. Use for telemetry strategy and coverage gaps. NOT for live incident command or vendor-specific setup.argument-hint: "<mode> [target]"license: MITmetadata: author: wyattowalsh version: "1.0.0"---
# Observability Advisor
Design and review telemetry that helps teams detect, diagnose, and improveservice behavior before and during reliability problems.
**Scope:** Vendor-neutral observability architecture, signal design, coveragereviews, SLOs, alerting, and instrumentation plans. NOT for live incidentcoordination (incident-response-engineer), deep runtime bottleneck profiling(performance-profiler), or CloudWatch-specific implementation details(cloudwatch).
## Canonical Vocabulary
| Term | Definition ||------|------------|| **telemetry** | Logs, metrics, traces, profiles, and events emitted by a system || **signal** | A measurable indicator used to detect or explain behavior || **metric** | Numeric time-series measurement aggregated over time || **log** | Structured event record capturing context for a specific occurrence || **trace** | End-to-end record of work moving through distributed components || **span** | A timed unit of work within a trace || **SLI** | Concrete measurement of a user-relevant reliability property || **SLO** | Target threshold and window for an SLI || **error budget** | Allowed unreliability implied by an SLO over its window || **cardinality** | Number of unique label or attribute values attached to telemetry |
## Dispatch
| $ARGUMENTS | Mode ||------------|------|| `design <system>` | Design an observability architecture for a service or workflow || `review <service or stack>` | Audit existing telemetry, dashboards, and alerts || `instrument <service or path>` | Plan what to emit and where to add instrumentation || `alert <service or journey>` | Design actionable alerting and escalation || `slo <service or journey>` | Define SLIs, SLOs, and error budget policy || `investigate <signal or symptom>` | Structure cross-signal diagnosis for an issue || Natural language about logs, metrics, traces, dashboards, or alerting | Auto-detect the closest mode || Empty | Show the mode menu with examples |
## When to Use
- A team can see failures but cannot explain them quickly- Alerts are noisy, late, or missing user-impact context- A service lacks clear SLIs, SLOs, or error budget policy- You need to add instrumentation to a new service, workflow, or migration- Dashboards exist but ownership, escalation, or runbook linkage is weak
## Classification Gate
- If the task is active outage coordination, use incident-response-engineer.- If the task is CPU, memory, query, or runtime hotspot analysis, use performance-profiler.- If the task is AWS-native dashboard, alarm, or log-group setup, use cloudwatch.- If the task is CI, deploy, or platform rollout wiring, use devops-engineer.
## Mode Menu
| # | Mode | Example ||---|------|---------|| 1 | Design | `design observability for multi-region checkout service` || 2 | Review | `review telemetry coverage for payments-api` || 3 | Instrument | `instrument order placement workflow across api and workers` || 4 | Alert | `alert strategy for login availability and latency` || 5 | SLO | `slo for customer webhook delivery` || 6 | Investigate | `investigate rising 5xx with queue lag and timeout traces` |
## Reference File Index
| File | Use When ||------|----------|| `references/signal-selection-matrix.md` | Choosing between metrics, logs, traces, profiles, and workflow events || `references/alert-anti-patterns.md` | Reviewing noisy, duplicate, or unactionable alerts || `references/sli-slo-examples.md` | Defining availability, latency, freshness, or correctness SLIs and SLOs || `references/investigation-workflows.md` | Structuring symptom-first diagnosis across signals and dependency boundaries || `references/output-templates.md` | Formatting design, review, instrumentation, alert, SLO, and investigation deliverables |
## Instructions
### Mode: Design
1. Identify the user journeys, critical dependencies, and failure domains that matter most.2. Define the questions operators must be able to answer within minutes during degradation.3. Read `references/signal-selection-matrix.md` when signal tradeoffs, sampling, or join strategy are unclear.4. Choose the minimum useful signals across logs, metrics, and traces for each critical boundary.5. Specify correlation identifiers, structured fields, and service naming so signals can be joined reliably.6. Define dashboards, alerts, runbook links, and ownership for each critical path.7. Call out sampling, retention, and cardinality constraints before recommending implementation details.8. Use `references/output-templates.md#design-template` when producing the final deliverable.
### Mode: Review
1. Inspect current logs, metrics, traces, dashboards, alerts, and on-call pathways.2. Check whether user-visible symptoms can be detected before customer reports arrive.3. Read `references/alert-anti-patterns.md` when alert noise, duplication, or escalation quality is part of the review.4. Identify blind spots, duplicate signals, noisy alerts, weak labels, and missing trace correlation.5. Separate findings into coverage gaps, alert quality issues, and operational debt.6. Rank issues by detection risk and operator impact.7. Use `references/output-templates.md#review-template` when formatting the audit.
### Mode: Instrument
1. Map the request or workflow path and identify the decision points, retries, queues, and external calls.2. Read `references/signal-selection-matrix.md` before choosing signal types for each boundary.3. Define which metrics, logs, and spans should be emitted at each boundary.4. Require stable request, tenant, or workflow identifiers only where they aid diagnosis without creating cardinality explosions.5. Keep logs structured and redact or exclude secrets and unnecessary PII.6. Produce a rollout plan that starts with the highest-value path first.7. Use `references/output-templates.md#instrumentation-template` for the emitted deliverable shape.
### Mode: Alert
1. Distinguish page-worthy conditions from ticket-only or dashboard-only signals.2. Prefer alerts tied to user symptoms, SLO burn, saturation, or stalled workflows over internal noise.3. Read `references/alert-anti-patterns.md` before recommending thresholds, paging, or deduplication changes.4. Define threshold, duration, owner, runbook, and escalation target for every alert.5. Call out what evidence an operator should inspect first after the alert fires.6. Reduce duplicate alerts that page different teams for the same symptom.7. Use `references/output-templates.md#alert-template` when presenting the alert plan.
### Mode: SLO
1. Start from the user-facing promise, not the easiest internal metric to measure.2. Read `references/sli-slo-examples.md` when choosing SLI type, exclusions, windows, or error-budget policy.3. Define the SLI precisely: numerator, denominator, exclusions, and measurement window.4. Choose a target that matches business expectations and operational reality.5. State the error budget policy, review cadence, and what actions are triggered when the budget is burned.6. Separate availability, latency, freshness, or correctness objectives when one combined SLO would hide tradeoffs.7. Use `references/output-templates.md#slo-template` for the final deliverable.
### Mode: Investigate
1. Start from verified symptoms, not assumed root causes.2. Correlate recent deploys, traffic changes, metrics, logs, traces, and dependency health.3. Read `references/investigation-workflows.md` when building the hypothesis tree or evidence order.4. Build a short hypothesis list and name the next measurement that would confirm or reject each one.5. Distinguish signal quality problems from system behavior problems.6. If the issue is actively impacting customers and needs command-and-control response, route to incident-response-engineer.7. Use `references/output-templates.md#investigation-template` for the final response.
## Output Requirements
- Every design must name the key questions, signals, owners, and escalation path.- Every review must separate missing coverage, alert quality, and observability debt.- Every instrumentation plan must define correlation strategy and data-safety constraints.- Every alert plan must distinguish paging from informational notifications.- Every SLO plan must name the SLI, target, window, and error budget policy.
## Critical Rules
1. Reject telemetry plans that optimize infrastructure visibility while leaving user-impact questions unanswered.2. Require a stable request, workflow, or journey identifier whenever the proposed design needs cross-signal correlation.3. Reject labels, fields, or exemplars that create avoidable cardinality explosions or expose raw PII.4. Keep dashboards, alerts, and runbooks as separate deliverables; do not collapse them into one artifact or one ownerless checklist.5. Page only on symptoms or leading indicators that demand operator action; downgrade the rest to ticket, dashboard, or review-only signals.6. Redirect vendor-specific setup, implementation commands, or managed-service configuration to the relevant platform skill instead of inventing provider steps here.
## Scaling Strategy
- Start with the highest-value user journey or failure path before broadening coverage.- Prefer one dependable service-level dashboard and a small alert set over wide but noisy signal sprawl.- Expand dimensions, retention, and trace depth only after the base signal set proves useful in practice.
## State Management
- Preserve correlation identifiers across service boundaries, queue hops, and async retries.- Track alert ownership, runbook links, and SLO definitions as first-class operational metadata.- Re-evaluate telemetry after major architecture, dependency, or traffic-shape changes.
## Progressive Disclosure
- Do not load all references by default.- Read only the reference files needed for the active mode: - signal selection work: `references/signal-selection-matrix.md` - alert quality work: `references/alert-anti-patterns.md` - SLI or SLO design: `references/sli-slo-examples.md` - symptom-first diagnosis: `references/investigation-workflows.md` - final formatting: `references/output-templates.md`- Keep `SKILL.md` as the operator contract and use the references for matrices, examples, and output shapes.
## Scope Boundaries
**IS for:** telemetry design, coverage reviews, instrumentation strategy, SLO definition, alert quality, cross-signal diagnosis.
**NOT for:** live incident command, low-level profiler output analysis, or vendor-specific configuration walkthroughs.Resources
Section titled “Resources” Skill Catalog Browse custom and external skills.
CLI Reference Install and manage skills.
agentskills.io The open ecosystem for cross-agent skills.