Design and review logs, metrics, traces, SLOs, and alerting for reliable systems. Use for telemetry strategy and coverage gaps. NOT for live incident command or vendor-specific setup.
Design and review logs, metrics, traces, SLOs, and alerting for reliable systems. Use for telemetry strategy and coverage gaps. NOT for live incident command or vendor-specific setup.
Reject telemetry plans that optimize infrastructure visibility while leaving user-impact questions unanswered.
Require a stable request, workflow, or journey identifier whenever the proposed design needs cross-signal correlation.
Reject labels, fields, or exemplars that create avoidable cardinality explosions or expose raw PII.
Keep dashboards, alerts, and runbooks as separate deliverables; do not collapse them into one artifact or one ownerless checklist.
Page only on symptoms or leading indicators that demand operator action; downgrade the rest to ticket, dashboard, or review-only signals.
Redirect vendor-specific setup, implementation commands, or managed-service configuration to the relevant platform skill instead of inventing provider steps here.
incident-response-engineerOperational incident response for triage, containment, communications, recovery, and postmortems. Use when coordinating outages or service degradation.
performance-profilerPerformance analysis: complexity estimation, profiler output parsing, caching design, regression risk. Use for optimization guidance.
devops-engineerDesign, optimize, and debug CI/CD pipelines. GitHub Actions and GitLab CI patterns. Use for pipeline work.
View Full SKILL.md
SKILL.md
---
name: observability-advisor
description: >-
Design and review logs, metrics, traces, SLOs, and alerting for reliable
systems. Use for telemetry strategy and coverage gaps. NOT for live incident
command or vendor-specific setup.
argument-hint: "<mode> [target]"
license: MIT
metadata:
author: wyattowalsh
version: "1.0.0"
---
# Observability Advisor
Design and review telemetry that helps teams detect, diagnose, and improve
service behavior before and during reliability problems.
**Scope:** Vendor-neutral observability architecture, signal design, coverage
reviews, SLOs, alerting, and instrumentation plans. NOT for live incident
coordination (incident-response-engineer), deep runtime bottleneck profiling
(performance-profiler), or CloudWatch-specific implementation details
(cloudwatch).
## Canonical Vocabulary
| Term | Definition |
|------|------------|
| **telemetry** | Logs, metrics, traces, profiles, and events emitted by a system |
| **signal** | A measurable indicator used to detect or explain behavior |
| **metric** | Numeric time-series measurement aggregated over time |
| **log** | Structured event record capturing context for a specific occurrence |
| **trace** | End-to-end record of work moving through distributed components |
| **span** | A timed unit of work within a trace |
| **SLI** | Concrete measurement of a user-relevant reliability property |
| **SLO** | Target threshold and window for an SLI |
| **error budget** | Allowed unreliability implied by an SLO over its window |
| **cardinality** | Number of unique label or attribute values attached to telemetry |
## Dispatch
| $ARGUMENTS | Mode |
|------------|------|
| `design <system>` | Design an observability architecture for a service or workflow |
| `review <service or stack>` | Audit existing telemetry, dashboards, and alerts |
| `instrument <service or path>` | Plan what to emit and where to add instrumentation |
| `alert <service or journey>` | Design actionable alerting and escalation |
| `slo <service or journey>` | Define SLIs, SLOs, and error budget policy |
| `investigate <signal or symptom>` | Structure cross-signal diagnosis for an issue |
| Natural language about logs, metrics, traces, dashboards, or alerting | Auto-detect the closest mode |
| Empty | Show the mode menu with examples |
## When to Use
- A team can see failures but cannot explain them quickly
- Alerts are noisy, late, or missing user-impact context
- A service lacks clear SLIs, SLOs, or error budget policy
- You need to add instrumentation to a new service, workflow, or migration
- Dashboards exist but ownership, escalation, or runbook linkage is weak
## Classification Gate
- If the task is active outage coordination, use incident-response-engineer.
- If the task is CPU, memory, query, or runtime hotspot analysis, use performance-profiler.
- If the task is AWS-native dashboard, alarm, or log-group setup, use cloudwatch.
- If the task is CI, deploy, or platform rollout wiring, use devops-engineer.
3. Read `references/investigation-workflows.md` when building the hypothesis tree or evidence order.
4. Build a short hypothesis list and name the next measurement that would confirm or reject each one.
5. Distinguish signal quality problems from system behavior problems.
6. If the issue is actively impacting customers and needs command-and-control response, route to incident-response-engineer.
7. Use `references/output-templates.md#investigation-template` for the final response.
## Output Requirements
- Every design must name the key questions, signals, owners, and escalation path.
- Every review must separate missing coverage, alert quality, and observability debt.
- Every instrumentation plan must define correlation strategy and data-safety constraints.
- Every alert plan must distinguish paging from informational notifications.
- Every SLO plan must name the SLI, target, window, and error budget policy.
## Critical Rules
1. Reject telemetry plans that optimize infrastructure visibility while leaving user-impact questions unanswered.
2. Require a stable request, workflow, or journey identifier whenever the proposed design needs cross-signal correlation.
3. Reject labels, fields, or exemplars that create avoidable cardinality explosions or expose raw PII.
4. Keep dashboards, alerts, and runbooks as separate deliverables; do not collapse them into one artifact or one ownerless checklist.
5. Page only on symptoms or leading indicators that demand operator action; downgrade the rest to ticket, dashboard, or review-only signals.
6. Redirect vendor-specific setup, implementation commands, or managed-service configuration to the relevant platform skill instead of inventing provider steps here.
## Scaling Strategy
- Start with the highest-value user journey or failure path before broadening coverage.
- Prefer one dependable service-level dashboard and a small alert set over wide but noisy signal sprawl.
- Expand dimensions, retention, and trace depth only after the base signal set proves useful in practice.
## State Management
- Preserve correlation identifiers across service boundaries, queue hops, and async retries.
- Track alert ownership, runbook links, and SLO definitions as first-class operational metadata.
- Re-evaluate telemetry after major architecture, dependency, or traffic-shape changes.
## Progressive Disclosure
- Do not load all references by default.
- Read only the reference files needed for the active mode:
- signal selection work: `references/signal-selection-matrix.md`