Incidents - Upmetr

Incidents are records of problems detected in your infrastructure. They are created automatically when:

An uptime monitor detects a failure (site down, slow response, SSL expiring)
An alert rule threshold is breached (CPU, memory, disk, etc.)
A CloudWatch metric crosses a critical threshold (RDS, ALB, CloudFront, EC2)
A cost budget is approaching or exceeded

Every incident tracks who acknowledged it, who resolved it, and how long it lasted — giving your team a clear audit trail for post-mortems.

Incident Lifecycle

Incidents follow a three-state lifecycle:

State	Meaning
Open	A problem was detected and no one has responded yet. Notifications are sent.
Acknowledged	A team member has seen the incident and is working on it. Recorded with timestamp and user.
Resolved	The problem is fixed. Can happen manually or automatically when conditions return to normal.

The flow is always: Open → Acknowledged → Resolved

You can resolve an incident directly from Open — acknowledging first is recommended but not required.

Viewing Incidents

Navigate to Incidents from the sidebar. The page shows:

Summary cards at the top — total open, acknowledged, critical open, and resolved today
Incident list — sortable by date, with status badges and severity indicators

Filters

Use the filter bar to narrow down incidents:

Filter	Options
Status	All, Open, Acknowledged, Resolved
Severity	All, Critical, Warning, Info
Trigger Type	Dynamically populated from your existing incidents (e.g., monitor_down, infra_metric)
Date Range	Start and end date pickers

Filters are reflected in the URL, so you can bookmark or share filtered views.

Managing Incidents

Acknowledge

Acknowledging signals to your team that someone is looking at the problem. Click an incident to expand it, then click Acknowledge. You can optionally add notes explaining what you’re doing.

Records your name and timestamp
Sends an acknowledgement notification to configured channels (Slack, email, etc.)
Logged in the audit trail

Acknowledging incidents quickly — even before fixing them — reduces noise for the rest of the team.

Resolve

Click Resolve on an open or acknowledged incident. Add resolution notes to document what was done.

Records your name and timestamp
Resets the notification rule cooldown so future alerts on the same rule can fire immediately
Sends a resolution notification to configured channels

Delete

Superadmins can delete incidents for cleanup (e.g., test incidents or false positives). This action is permanent and logged in the audit trail.

Bulk Actions

Select multiple incidents using the checkboxes, then use the bulk action buttons to Acknowledge or Resolve all selected incidents at once. Bulk actions require admin or superadmin role.

Bulk resolve resets the cooldown on all associated notification rules. If the underlying problem persists, new incidents will be created immediately.

Diagnostics

When an incident is linked to an infrastructure agent, you can run remote diagnostic commands directly from the incident detail panel. This helps you triage without SSH-ing into the server. Available diagnostics include:

System uptime
Top processes by CPU/memory usage
Memory and swap usage
Disk space usage
Failed services
Recent error logs
Listening ports
Container status
Network connectivity

Click Run Diagnostics on an incident to dispatch commands to the agent. Results appear inline with parsed summaries highlighting critical findings (e.g., “Memory usage: 92%”).

Diagnostics require the diagnostics module on your plan, and the infrastructure agent must be online. If the agent is offline, diagnostic commands will fail.

Severity Levels

Severity	Meaning	Examples
Critical	Immediate action required. Service is down or at risk of failure.	Site unreachable, CPU > 90%, disk > 95%, RDS storage < 1 GB
Warning	Attention needed. Something is degraded but not yet critical.	Slow response time, CPU > 75%, memory > 85%, budget approaching limit
Info	Informational. No immediate action needed.	SSL certificate expiring in 30 days, monitor recovered, budget notification

Severity is set by the alert rule that created the incident.

Trigger Types

Trigger Type	Source	Description
monitor_down	Uptime Monitoring	HTTP check failed after all retry attempts
monitor_degraded	Uptime Monitoring	Response time exceeded threshold
ssl_expiring	Uptime Monitoring	SSL certificate approaching expiration
infra_metric	Infrastructure Agent	Host metric (CPU, memory, disk, network) crossed alert rule threshold
cloudwatch_metric	CloudWatch	AWS managed service metric breached threshold
budget_threshold	Cost Tracking	Budget usage approaching the limit
budget_exceeded	Cost Tracking	Budget limit exceeded

Auto-Resolution

Many incidents resolve automatically when the underlying condition returns to normal:

Uptime monitors: When a site comes back up, the incident is auto-resolved after the monitor passes the recovery threshold (default: 2 consecutive successful checks). This prevents false recoveries from a single lucky check.
Infrastructure metrics: When a metric drops back below the alert threshold for N consecutive evaluations (default: 2), the incident is auto-resolved. Evaluations run every 2 minutes.
CloudWatch metrics: Similar to infra metrics, but with a default recovery threshold of 3 consecutive evaluations.

Auto-resolved incidents show “System” as the resolver instead of a user name.

If you’re seeing incidents auto-resolve and then immediately re-open, increase the recovery threshold on the alert rule to filter out flapping.

Troubleshooting

Incidents are not being created

Verify the alert rule is enabled (not toggled off)
Check that the metric is being collected — go to the agent or monitor page and confirm data is flowing
Check the alert rule cooldown — if a recent incident already fired, the rule won’t trigger again until the cooldown expires or the previous incident is resolved
For CloudWatch metrics, ensure the cloud account has the required IAM permissions

Too many incidents for the same issue

Increase the breach threshold on the alert rule so the metric must exceed the threshold for multiple evaluations before firing
Increase the cooldown period to space out repeated alerts
Use per-agent overrides if specific servers are naturally noisy

Incidents are not auto-resolving

Verify the metric is actually back below the threshold — check the agent or monitor page
Ensure the infrastructure agent or uptime monitor is still running and reporting data
The recovery threshold requires multiple consecutive good readings — wait for the full cycle

Diagnostics button is not showing

The incident must be linked to an infrastructure agent (trigger type: infra_metric)
Your plan must include the diagnostics module
The agent must be online — check agent status at Settings > Infra Agents

Cannot acknowledge or resolve incidents

Acknowledging and resolving require admin or superadmin role
Viewers can see incidents but cannot change their status
If you need to take action, ask your organization admin to upgrade your role

​Incident Lifecycle

​Viewing Incidents

​Filters

​Managing Incidents

​Acknowledge

​Resolve

​Delete

​Bulk Actions

​Diagnostics

​Severity Levels

​Trigger Types

​Auto-Resolution

​Troubleshooting

Incident Lifecycle

Viewing Incidents

Filters

Managing Incidents

Acknowledge

Resolve

Delete

Bulk Actions

Diagnostics

Severity Levels

Trigger Types

Auto-Resolution

Troubleshooting