Overview

EdgeFlow's alert system monitors device metrics and events, triggering notifications when configurable thresholds are breached. Alert rules support multiple trigger types and can send notifications through email, Slack, webhooks, and SMS channels.

Alert Rule Types

Type	Description	Example
`metric_threshold`	Trigger when a metric exceeds a value	CPU > 90% for 5 minutes
`event_count`	Trigger when event frequency exceeds a count	More than 10 errors in 1 hour
`anomaly`	ML-based anomaly detection on metrics	Unusual memory usage pattern
`status_change`	Trigger on device or flow state change	Device goes offline

Severity Levels

Severity	Description
Critical	Immediate attention required, service impact
High	Significant issue, may cause service degradation
Medium	Notable condition, should be investigated
Low	Minor issue, informational
Info	Informational only, no action needed

Alert States

┌─────────┐   threshold   ┌─────────┐   acknowledge   ┌──────────────┐
│  (none) │──────────────>│ Firing  │────────────────>│ Acknowledged │
└─────────┘    breached   └────┬────┘                 └──────┬───────┘
                               │                             │
                          condition                     auto-resolve
                          clears                             │
                               │                             │
                               ▼                             ▼
                          ┌──────────┐                ┌──────────┐
                          │ Resolved │                │ Resolved │
                          └──────────┘                └──────────┘

Notification Channels

Each alert rule can send notifications to multiple channels simultaneously:

Channel	Description
Email	SMTP-based email notifications with templates
Slack	Slack channel messages via webhook
Webhooks	HTTP POST to custom endpoints
SMS	Text messages via Twilio

Alert Rule Configuration

# Create an alert rule
POST /api/v1/alerts
{
  "name": "High CPU Alert",
  "description": "Alert when CPU exceeds 90% for 5 minutes",
  "type": "metric_threshold",
  "severity": "high",
  "enabled": true,
  "metric": "cpu_usage_percent",
  "condition": "greater_than",
  "threshold": 90,
  "evaluation_interval": 60,
  "evaluation_window": 300,
  "threshold_count": 5,
  "device_id": "dev_abc123",
  "notification_channels": [
    {"type": "email", "config": {"to": "ops@acme.com"}},
    {"type": "slack", "config": {"webhook_url": "https://hooks.slack.com/..."}}
  ],
  "labels": {"team": "ops", "environment": "production"},
  "annotations": {"runbook": "https://wiki.acme.com/high-cpu"}
}

Alert Rule Parameters

Parameter	Description
`evaluation_interval`	How often the rule is checked (seconds)
`evaluation_window`	Time window for metric aggregation (seconds)
`threshold_count`	Number of breaches before alert fires
`labels`	Key-value metadata for filtering and grouping
`annotations`	Additional context (runbook URLs, descriptions)

Alert Management

# List all alerts
GET /api/v1/alerts

# Get alert details
GET /api/v1/alerts/:id

# Update alert rule
PUT /api/v1/alerts/:id

# Delete alert rule
DELETE /api/v1/alerts/:id

# Acknowledge a firing alert
POST /api/v1/alerts/:id/acknowledge

Throttling

Alert notifications include built-in throttling to prevent notification storms. Once an alert fires, duplicate notifications for the same rule are suppressed for a configurable period before re-alerting.