- An uptime monitor detects a failure (site down, slow response, SSL expiring)
- An alert rule threshold is breached (CPU, memory, disk, etc.)
- A CloudWatch metric crosses a critical threshold (RDS, ALB, CloudFront, EC2)
- A cost budget is approaching or exceeded
Incident Lifecycle
Incidents follow a three-state lifecycle:| State | Meaning |
|---|---|
| Open | A problem was detected and no one has responded yet. Notifications are sent. |
| Acknowledged | A team member has seen the incident and is working on it. Recorded with timestamp and user. |
| Resolved | The problem is fixed. Can happen manually or automatically when conditions return to normal. |
You can resolve an incident directly from Open — acknowledging first is recommended but not required.
Viewing Incidents
Navigate to Incidents from the sidebar. The page shows:- Summary cards at the top — total open, acknowledged, critical open, and resolved today
- Incident list — sortable by date, with status badges and severity indicators
Filters
Use the filter bar to narrow down incidents:| Filter | Options |
|---|---|
| Status | All, Open, Acknowledged, Resolved |
| Severity | All, Critical, Warning, Info |
| Trigger Type | Dynamically populated from your existing incidents (e.g., monitor_down, infra_metric) |
| Date Range | Start and end date pickers |
Managing Incidents
Acknowledge
Acknowledging signals to your team that someone is looking at the problem. Click an incident to expand it, then click Acknowledge. You can optionally add notes explaining what you’re doing.- Records your name and timestamp
- Sends an acknowledgement notification to configured channels (Slack, email, etc.)
- Logged in the audit trail
Resolve
Click Resolve on an open or acknowledged incident. Add resolution notes to document what was done.- Records your name and timestamp
- Resets the notification rule cooldown so future alerts on the same rule can fire immediately
- Sends a resolution notification to configured channels
Delete
Superadmins can delete incidents for cleanup (e.g., test incidents or false positives). This action is permanent and logged in the audit trail.Bulk Actions
Select multiple incidents using the checkboxes, then use the bulk action buttons to Acknowledge or Resolve all selected incidents at once. Bulk actions require admin or superadmin role.Diagnostics
When an incident is linked to an infrastructure agent, you can run remote diagnostic commands directly from the incident detail panel. This helps you triage without SSH-ing into the server. Available diagnostics include:- System uptime
- Top processes by CPU/memory usage
- Memory and swap usage
- Disk space usage
- Failed services
- Recent error logs
- Listening ports
- Container status
- Network connectivity
Diagnostics require the diagnostics module on your plan, and the infrastructure agent must be online. If the agent is offline, diagnostic commands will fail.
Severity Levels
| Severity | Meaning | Examples |
|---|---|---|
| Critical | Immediate action required. Service is down or at risk of failure. | Site unreachable, CPU > 90%, disk > 95%, RDS storage < 1 GB |
| Warning | Attention needed. Something is degraded but not yet critical. | Slow response time, CPU > 75%, memory > 85%, budget approaching limit |
| Info | Informational. No immediate action needed. | SSL certificate expiring in 30 days, monitor recovered, budget notification |
Trigger Types
| Trigger Type | Source | Description |
|---|---|---|
| monitor_down | Uptime Monitoring | HTTP check failed after all retry attempts |
| monitor_degraded | Uptime Monitoring | Response time exceeded threshold |
| ssl_expiring | Uptime Monitoring | SSL certificate approaching expiration |
| infra_metric | Infrastructure Agent | Host metric (CPU, memory, disk, network) crossed alert rule threshold |
| cloudwatch_metric | CloudWatch | AWS managed service metric breached threshold |
| budget_threshold | Cost Tracking | Budget usage approaching the limit |
| budget_exceeded | Cost Tracking | Budget limit exceeded |
Auto-Resolution
Many incidents resolve automatically when the underlying condition returns to normal:- Uptime monitors: When a site comes back up, the incident is auto-resolved after the monitor passes the recovery threshold (default: 2 consecutive successful checks). This prevents false recoveries from a single lucky check.
- Infrastructure metrics: When a metric drops back below the alert threshold for N consecutive evaluations (default: 2), the incident is auto-resolved. Evaluations run every 2 minutes.
- CloudWatch metrics: Similar to infra metrics, but with a default recovery threshold of 3 consecutive evaluations.
Troubleshooting
Incidents are not being created
Incidents are not being created
- Verify the alert rule is enabled (not toggled off)
- Check that the metric is being collected — go to the agent or monitor page and confirm data is flowing
- Check the alert rule cooldown — if a recent incident already fired, the rule won’t trigger again until the cooldown expires or the previous incident is resolved
- For CloudWatch metrics, ensure the cloud account has the required IAM permissions
Too many incidents for the same issue
Too many incidents for the same issue
- Increase the breach threshold on the alert rule so the metric must exceed the threshold for multiple evaluations before firing
- Increase the cooldown period to space out repeated alerts
- Use per-agent overrides if specific servers are naturally noisy
Incidents are not auto-resolving
Incidents are not auto-resolving
- Verify the metric is actually back below the threshold — check the agent or monitor page
- Ensure the infrastructure agent or uptime monitor is still running and reporting data
- The recovery threshold requires multiple consecutive good readings — wait for the full cycle
Diagnostics button is not showing
Diagnostics button is not showing
Cannot acknowledge or resolve incidents
Cannot acknowledge or resolve incidents
- Acknowledging and resolving require admin or superadmin role
- Viewers can see incidents but cannot change their status
- If you need to take action, ask your organization admin to upgrade your role

