Skip to main content
Incidents are records of problems detected in your infrastructure. They are created automatically when:
  • An uptime monitor detects a failure (site down, slow response, SSL expiring)
  • An alert rule threshold is breached (CPU, memory, disk, etc.)
  • A CloudWatch metric crosses a critical threshold (RDS, ALB, CloudFront, EC2)
  • A cost budget is approaching or exceeded
Every incident tracks who acknowledged it, who resolved it, and how long it lasted — giving your team a clear audit trail for post-mortems.

Incident Lifecycle

Incidents follow a three-state lifecycle:
StateMeaning
OpenA problem was detected and no one has responded yet. Notifications are sent.
AcknowledgedA team member has seen the incident and is working on it. Recorded with timestamp and user.
ResolvedThe problem is fixed. Can happen manually or automatically when conditions return to normal.
The flow is always: Open → Acknowledged → Resolved
You can resolve an incident directly from Open — acknowledging first is recommended but not required.

Viewing Incidents

Navigate to Incidents from the sidebar. The page shows:
  • Summary cards at the top — total open, acknowledged, critical open, and resolved today
  • Incident list — sortable by date, with status badges and severity indicators

Filters

Use the filter bar to narrow down incidents:
FilterOptions
StatusAll, Open, Acknowledged, Resolved
SeverityAll, Critical, Warning, Info
Trigger TypeDynamically populated from your existing incidents (e.g., monitor_down, infra_metric)
Date RangeStart and end date pickers
Filters are reflected in the URL, so you can bookmark or share filtered views.

Managing Incidents

Acknowledge

Acknowledging signals to your team that someone is looking at the problem. Click an incident to expand it, then click Acknowledge. You can optionally add notes explaining what you’re doing.
  • Records your name and timestamp
  • Sends an acknowledgement notification to configured channels (Slack, email, etc.)
  • Logged in the audit trail
Acknowledging incidents quickly — even before fixing them — reduces noise for the rest of the team.

Resolve

Click Resolve on an open or acknowledged incident. Add resolution notes to document what was done.
  • Records your name and timestamp
  • Resets the notification rule cooldown so future alerts on the same rule can fire immediately
  • Sends a resolution notification to configured channels

Delete

Superadmins can delete incidents for cleanup (e.g., test incidents or false positives). This action is permanent and logged in the audit trail.

Bulk Actions

Select multiple incidents using the checkboxes, then use the bulk action buttons to Acknowledge or Resolve all selected incidents at once. Bulk actions require admin or superadmin role.
Bulk resolve resets the cooldown on all associated notification rules. If the underlying problem persists, new incidents will be created immediately.

Diagnostics

When an incident is linked to an infrastructure agent, you can run remote diagnostic commands directly from the incident detail panel. This helps you triage without SSH-ing into the server. Available diagnostics include:
  • System uptime
  • Top processes by CPU/memory usage
  • Memory and swap usage
  • Disk space usage
  • Failed services
  • Recent error logs
  • Listening ports
  • Container status
  • Network connectivity
Click Run Diagnostics on an incident to dispatch commands to the agent. Results appear inline with parsed summaries highlighting critical findings (e.g., “Memory usage: 92%”).
Diagnostics require the diagnostics module on your plan, and the infrastructure agent must be online. If the agent is offline, diagnostic commands will fail.

Severity Levels

SeverityMeaningExamples
CriticalImmediate action required. Service is down or at risk of failure.Site unreachable, CPU > 90%, disk > 95%, RDS storage < 1 GB
WarningAttention needed. Something is degraded but not yet critical.Slow response time, CPU > 75%, memory > 85%, budget approaching limit
InfoInformational. No immediate action needed.SSL certificate expiring in 30 days, monitor recovered, budget notification
Severity is set by the alert rule that created the incident.

Trigger Types

Trigger TypeSourceDescription
monitor_downUptime MonitoringHTTP check failed after all retry attempts
monitor_degradedUptime MonitoringResponse time exceeded threshold
ssl_expiringUptime MonitoringSSL certificate approaching expiration
infra_metricInfrastructure AgentHost metric (CPU, memory, disk, network) crossed alert rule threshold
cloudwatch_metricCloudWatchAWS managed service metric breached threshold
budget_thresholdCost TrackingBudget usage approaching the limit
budget_exceededCost TrackingBudget limit exceeded

Auto-Resolution

Many incidents resolve automatically when the underlying condition returns to normal:
  • Uptime monitors: When a site comes back up, the incident is auto-resolved after the monitor passes the recovery threshold (default: 2 consecutive successful checks). This prevents false recoveries from a single lucky check.
  • Infrastructure metrics: When a metric drops back below the alert threshold for N consecutive evaluations (default: 2), the incident is auto-resolved. Evaluations run every 2 minutes.
  • CloudWatch metrics: Similar to infra metrics, but with a default recovery threshold of 3 consecutive evaluations.
Auto-resolved incidents show “System” as the resolver instead of a user name.
If you’re seeing incidents auto-resolve and then immediately re-open, increase the recovery threshold on the alert rule to filter out flapping.

Troubleshooting

  • Verify the alert rule is enabled (not toggled off)
  • Check that the metric is being collected — go to the agent or monitor page and confirm data is flowing
  • Check the alert rule cooldown — if a recent incident already fired, the rule won’t trigger again until the cooldown expires or the previous incident is resolved
  • For CloudWatch metrics, ensure the cloud account has the required IAM permissions
  • Increase the breach threshold on the alert rule so the metric must exceed the threshold for multiple evaluations before firing
  • Increase the cooldown period to space out repeated alerts
  • Use per-agent overrides if specific servers are naturally noisy
  • Verify the metric is actually back below the threshold — check the agent or monitor page
  • Ensure the infrastructure agent or uptime monitor is still running and reporting data
  • The recovery threshold requires multiple consecutive good readings — wait for the full cycle
  • The incident must be linked to an infrastructure agent (trigger type: infra_metric)
  • Your plan must include the diagnostics module
  • The agent must be online — check agent status at Settings > Infra Agents
  • Acknowledging and resolving require admin or superadmin role
  • Viewers can see incidents but cannot change their status
  • If you need to take action, ask your organization admin to upgrade your role