Recommended Monitoring & Alerting StrategyΒΆ
OverviewΒΆ
This document provides a tiered approach to monitoring and alerting for your OpenWrt mesh network, building on the existing infrastructure.
Current Monitoring StackΒΆ
You already have:
- β Collectd - Metrics collection (CPU, memory, disk, network, wireless)
- β vnStat - Bandwidth usage tracking
- β mesh-monitor.sh - Mesh health checks (every 5 minutes)
- β
Distributed syslog - All system logs to
/x00/logs/ - β LuCI Statistics - Web UI with graphs
- β
USB Storage - Persistent data on
/x00/
Strengths:
- Comprehensive data collection
- Persistent storage
- Web interface for viewing
- Mesh-specific health monitoring
Gap:
- β No proactive alerting
- β Must manually check logs/graphs
- β No notifications when issues occur
Tier 1: Essential Monitoring (RECOMMENDED) π―ΒΆ
Goal: Get notified about critical issues only, minimal maintenance.
What to MonitorΒΆ
| Alert | Trigger | Why Critical |
|---|---|---|
| Node Offline | Can't ping node for 5+ min | Mesh partition, hardware failure |
| Disk Full | USB /x00 > 95% | Logs/monitoring will fail |
| No Mesh Neighbors | 0 neighbors for 10+ min | Mesh network broken |
| High CPU | > 90% for 15+ min | Performance issues, runaway process |
| Low Memory | < 5MB free | System instability, OOM kills |
| Gateway Down | No WAN connectivity | Internet access lost |
Implementation Option A: Simple Email Alerts (Lightest)ΒΆ
Best for: Home users, low complexity, no additional infrastructure.
Add alerting script to mesh nodes:
#!/bin/sh
###############################################################################
# Simple Email Alerting Script
# Sends email via external SMTP when critical issues detected
###############################################################################
ALERT_EMAIL="your-email@example.com"
SMTP_SERVER="smtp.gmail.com:587"
SMTP_USER="your-email@gmail.com"
SMTP_PASS="your-app-password" # Use app password, not real password
HOSTNAME=$(uci -q get system.@system[0].hostname || hostname)
ALERT_LOG="/x00/logs/alerts-sent.log"
# Check if alert already sent recently (debounce)
check_alert_sent() {
local alert_key="$1"
local cooldown_seconds=3600 # 1 hour
if [ -f "$ALERT_LOG" ]; then
last_alert=$(grep "$alert_key" "$ALERT_LOG" | tail -1 | cut -d' ' -f1)
if [ -n "$last_alert" ]; then
current_time=$(date +%s)
time_diff=$((current_time - last_alert))
if [ $time_diff -lt $cooldown_seconds ]; then
return 0 # Alert sent recently
fi
fi
fi
return 1 # OK to send alert
}
# Send alert email
send_alert() {
local subject="$1"
local message="$2"
local alert_key="$3"
# Check cooldown
if check_alert_sent "$alert_key"; then
return
fi
# Send email using curl
echo "$message" | curl --ssl-reqd \
--url "smtp://$SMTP_SERVER" \
--user "$SMTP_USER:$SMTP_PASS" \
--mail-from "$SMTP_USER" \
--mail-rcpt "$ALERT_EMAIL" \
--upload-file - \
--header "Subject: [MESH ALERT] $HOSTNAME - $subject" \
2>/dev/null
# Log alert sent
echo "$(date +%s) $alert_key" >> "$ALERT_LOG"
}
# Check disk space
check_disk() {
usage=$(df /x00 | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$usage" -gt 95 ]; then
send_alert "Disk Full" "USB storage at ${usage}% - cleanup needed!" "disk-full"
fi
}
# Check mesh neighbors
check_neighbors() {
neighbor_count=$(batctl o | grep -v "No" | grep -c "^[0-9a-f]" || echo "0")
if [ "$neighbor_count" -eq 0 ]; then
send_alert "No Mesh Neighbors" "Node is isolated - no mesh neighbors detected!" "no-neighbors"
fi
}
# Check memory
check_memory() {
free_kb=$(free | grep Mem | awk '{print $4}')
free_mb=$((free_kb / 1024))
if [ "$free_mb" -lt 5 ]; then
send_alert "Low Memory" "Only ${free_mb}MB RAM free - system may crash!" "low-memory"
fi
}
# Check CPU
check_cpu() {
cpu_idle=$(top -bn1 | grep "CPU:" | awk '{print $8}' | sed 's/%//')
cpu_usage=$((100 - cpu_idle))
if [ "$cpu_usage" -gt 90 ]; then
send_alert "High CPU" "CPU usage at ${cpu_usage}% - performance degraded!" "high-cpu"
fi
}
# Check WAN connectivity (gateway nodes only)
check_wan() {
if batctl gw | grep -q "server"; then
if ! ping -c 3 -W 5 1.1.1.1 >/dev/null 2>&1; then
send_alert "WAN Down" "Internet connectivity lost!" "wan-down"
fi
fi
}
# Run all checks
check_disk
check_neighbors
check_memory
check_cpu
check_wan
Installation:
# Add to deployment playbook
cat > /usr/bin/mesh-alert.sh << 'EOF'
[script above]
EOF
chmod +x /usr/bin/mesh-alert.sh
# Add to cron (every 15 minutes)
(crontab -l; echo "*/15 * * * * /usr/bin/mesh-alert.sh") | crontab -
Pros:
- β Simple, self-contained
- β No external infrastructure
- β Email = accessible anywhere
- β Low resource usage
Cons:
- β Requires email credentials on nodes
- β Depends on node having WAN access
- β Email may be delayed
Implementation Option B: Webhook Alerts (Recommended)ΒΆ
Best for: Users with smartphone, more flexible than email.
Use webhook services like:
- Pushover (push notifications to phone - $5 one-time)
- Telegram Bot (free, unlimited)
- Discord Webhook (free)
- Slack Webhook (free)
Example: Telegram Bot Integration
#!/bin/sh
###############################################################################
# Telegram Webhook Alerting
###############################################################################
TELEGRAM_BOT_TOKEN="your-bot-token"
TELEGRAM_CHAT_ID="your-chat-id"
HOSTNAME=$(uci -q get system.@system[0].hostname || hostname)
send_telegram() {
local message="$1"
curl -s -X POST \
"https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d "chat_id=${TELEGRAM_CHAT_ID}" \
-d "text=π¨ *${HOSTNAME}*: ${message}" \
-d "parse_mode=Markdown" \
>/dev/null 2>&1
}
# Check disk space
usage=$(df /x00 | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$usage" -gt 95 ]; then
send_telegram "β οΈ Disk Full: ${usage}% used on /x00"
fi
# Check mesh neighbors
neighbor_count=$(batctl o | grep -v "No" | grep -c "^[0-9a-f]" || echo "0")
if [ "$neighbor_count" -eq 0 ]; then
send_telegram "β No mesh neighbors detected!"
fi
# Check memory
free_mb=$(free | awk '/Mem:/ {print int($4/1024)}')
if [ "$free_mb" -lt 5 ]; then
send_telegram "β οΈ Low Memory: Only ${free_mb}MB free"
fi
# Check gateway WAN (if gateway)
if batctl gw | grep -q "server"; then
if ! ping -c 3 -W 5 1.1.1.1 >/dev/null 2>&1; then
send_telegram "β WAN connectivity lost!"
fi
fi
Setup Telegram Bot:
# 1. Message @BotFather on Telegram
# Send: /newbot
# Follow prompts to create bot
# Save the token
# 2. Get your chat ID
# Start chat with your bot
# Send any message
# Visit: https://api.telegram.org/bot<TOKEN>/getUpdates
# Find your chat_id in response
Pros:
- β Instant push notifications
- β Free (Telegram)
- β Works from anywhere
- β Simple webhook integration
Cons:
- β Requires external service
- β Token stored on node
Implementation Option C: External Monitoring (Most Robust)ΒΆ
Best for: Users who want centralized monitoring dashboard.
Run monitoring on external device (Raspberry Pi, NAS, server):
Option C1: Simple Ping Monitor + Webhook
#!/bin/bash
###############################################################################
# External Mesh Monitor (Run on external server/Pi)
# Pings nodes and sends alerts if unreachable
###############################################################################
NODES=("10.11.12.1" "10.11.12.2" "10.11.12.3")
NODE_NAMES=("mesh-node1" "mesh-node2" "mesh-node3")
TELEGRAM_BOT_TOKEN="your-token"
TELEGRAM_CHAT_ID="your-chat-id"
STATE_FILE="/tmp/mesh-monitor-state.txt"
send_alert() {
local message="$1"
curl -s -X POST \
"https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d "chat_id=${TELEGRAM_CHAT_ID}" \
-d "text=${message}" \
-d "parse_mode=Markdown" >/dev/null
}
# Create state file if not exists
touch "$STATE_FILE"
for i in "${!NODES[@]}"; do
node_ip="${NODES[$i]}"
node_name="${NODE_NAMES[$i]}"
if ping -c 3 -W 5 "$node_ip" >/dev/null 2>&1; then
# Node is up
if grep -q "${node_name}:down" "$STATE_FILE"; then
# Node recovered
sed -i "/${node_name}:down/d" "$STATE_FILE"
send_alert "β
*${node_name}* is back online!"
fi
else
# Node is down
if ! grep -q "${node_name}:down" "$STATE_FILE"; then
# New failure
echo "${node_name}:down:$(date +%s)" >> "$STATE_FILE"
send_alert "β *${node_name}* is offline! (${node_ip})"
fi
fi
done
# Add to cron on external server: */5 * * * * /path/to/mesh-monitor.sh
Option C2: Uptime Kuma (Web Dashboard + Alerts)
Popular open-source monitoring tool with web UI:
# Install on Raspberry Pi / NAS / Server
docker run -d --restart=always \
-p 3001:3001 \
-v uptime-kuma:/app/data \
--name uptime-kuma \
louislam/uptime-kuma:1
# Access: http://your-server:3001
# Add monitors for:
# - Ping: 10.11.12.1, 10.11.12.2, 10.11.12.3
# - HTTP: http://10.11.12.1 (LuCI)
# - Port: 22 (SSH)
# Configure notifications:
# - Telegram
# - Discord
# - Email
# - Pushover
# - Slack
Features:
- β Beautiful web dashboard
- β Multi-channel alerts
- β Uptime statistics
- β Status page
- β Historical data
Cons:
- β Requires external server
- β More complex setup
Tier 2: Enhanced Monitoring (Optional) πΒΆ
Goal: Better visibility, trending, and analysis.
Daily/Weekly ReportsΒΆ
Add reporting script to send summaries:
#!/bin/sh
###############################################################################
# Daily Mesh Report
###############################################################################
TELEGRAM_BOT_TOKEN="your-token"
TELEGRAM_CHAT_ID="your-chat-id"
HOSTNAME=$(uci -q get system.@system[0].hostname || hostname)
# Gather stats
UPTIME=$(uptime | awk -F'up ' '{print $2}' | cut -d',' -f1)
LOAD=$(uptime | awk -F'load average:' '{print $2}')
MEM_FREE=$(free | awk '/Mem:/ {printf "%.1f", $4/$2*100}')
DISK_USAGE=$(df /x00 | tail -1 | awk '{print $5}')
NEIGHBOR_COUNT=$(batctl o | grep -v "No" | grep -c "^[0-9a-f]" || echo "0")
# WAN stats (gateway only)
if batctl gw | grep -q "server"; then
WAN_STATUS=$(ping -c 3 -W 5 1.1.1.1 >/dev/null 2>&1 && echo "β
Online" || echo "β Offline")
else
WAN_STATUS="N/A (client)"
fi
# Bandwidth stats
BW_TODAY=$(vnstat -i bat0 -d | grep "today" | awk '{print $2" "$3}')
# Build report
REPORT="π *Daily Report - ${HOSTNAME}*
β± Uptime: ${UPTIME}
π Load: ${LOAD}
πΎ Memory Free: ${MEM_FREE}%
πΏ Disk Usage: ${DISK_USAGE}
π Mesh Neighbors: ${NEIGHBOR_COUNT}
π WAN: ${WAN_STATUS}
π‘ Bandwidth (today): ${BW_TODAY}
_Generated: $(date)_"
# Send report
curl -s -X POST \
"https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d "chat_id=${TELEGRAM_CHAT_ID}" \
-d "text=${REPORT}" \
-d "parse_mode=Markdown" >/dev/null
# Add to cron: 0 8 * * * /usr/bin/daily-report.sh (8 AM daily)
Log Analysis AlertsΒΆ
Monitor logs for specific patterns:
#!/bin/sh
###############################################################################
# Log Analysis Alerting
###############################################################################
LOG_DIR="/x00/logs"
HOSTNAME=$(uci -q get system.@system[0].hostname || hostname)
TODAY=$(date +%Y-%m-%d)
LOG_FILE="${LOG_DIR}/${HOSTNAME}-${TODAY}.log"
# Check for errors in last capture
if [ -f "$LOG_FILE" ]; then
# Count critical errors since last check (15 min)
ERROR_COUNT=$(tail -100 "$LOG_FILE" | grep -cE "err|fail|crit" || echo "0")
if [ "$ERROR_COUNT" -gt 10 ]; then
# Extract error samples
ERRORS=$(tail -100 "$LOG_FILE" | grep -E "err|fail|crit" | head -5)
send_telegram "β οΈ *High Error Rate*
${ERROR_COUNT} errors in last 15 minutes
Sample errors:
\`\`\`
${ERRORS}
\`\`\`"
fi
# Check for specific critical events
if tail -100 "$LOG_FILE" | grep -qi "out of memory"; then
send_telegram "π¨ *OOM Event Detected!*
System is out of memory - possible crash imminent!"
fi
if tail -100 "$LOG_FILE" | grep -qi "batman.*disconnected"; then
send_telegram "β οΈ *Mesh Topology Change*
Batman-adv reported disconnection event"
fi
fi
Tier 3: Advanced Monitoring (For Power Users) πΒΆ
Goal: Professional monitoring stack with dashboards and alerting.
Option 1: Prometheus + GrafanaΒΆ
Architecture:
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β OpenWrt ββββββΆβ Prometheus ββββββΆβ Grafana β
β Nodes (3x) β β (on Pi/NAS) β β (Dashboard) β
β β β β β β
β collectd β β Scrapes β β Visualizes β
β + exporter β β metrics β β + Alerts β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Setup:
- Install collectd-mod-prometheus on nodes (exports metrics)
- Run Prometheus on external server to scrape nodes
- Grafana for dashboards and alerting
Pros:
- β Industry-standard monitoring
- β Beautiful dashboards
- β Powerful alerting
- β Historical data retention
Cons:
- β Complex setup
- β Requires external infrastructure
- β Overkill for 3-node mesh
Option 2: LibreNMSΒΆ
Full network monitoring system:
# Install on Ubuntu server/VM
curl https://raw.githubusercontent.com/librenms/librenms/master/install.sh | bash
Features:
- Auto-discovery of devices
- SNMP monitoring
- Alerting (email, Slack, webhook)
- Network maps
- Historical graphs
Cons:
- Heavy (requires server)
- Complex setup
- Way more than needed for 3 nodes
My Recommended Setup (Best Balance) β ΒΆ
Based on your 3-node home mesh, here's what I recommend:
Tier 1: Essential Alerts (IMPLEMENT THIS)ΒΆ
- On Each Node:
- Run alerting script (Telegram or email) every 15 minutes
-
Alert on: Disk full, no neighbors, low memory, high CPU
-
On External Device (Optional but Recommended):
- Simple ping monitor (checks if nodes are reachable)
-
Sends alert if node offline for 5+ minutes
-
Keep Existing Monitoring:
- β Collectd + LuCI graphs (for viewing metrics)
- β Distributed syslog (for troubleshooting)
- β mesh-monitor.sh (for health checks)
Tier 2: Weekly Summary (Nice to Have)ΒΆ
- Daily or weekly report via Telegram/email
- Shows uptime, bandwidth, neighbors, disk usage
- Runs at 8 AM daily via cron
Skip Tier 3 Unless:ΒΆ
- You're managing 10+ nodes
- You need professional dashboards
- You have compliance requirements
- You enjoy complex monitoring setups
Implementation PriorityΒΆ
Week 1: Critical AlertsΒΆ
1. Set up Telegram bot (5 minutes)
2. Deploy alert script to nodes (use playbook)
3. Test alerts (fill disk, stop batman, etc.)
4. Verify notifications working
Week 2: External MonitoringΒΆ
Week 3: ReportsΒΆ
Alert Tuning GuidelinesΒΆ
Good Alerts (Actionable)ΒΆ
β Disk > 95% - Need to clean up logs β No neighbors for 10+ min - Mesh broken, check hardware β Memory < 5MB - System about to crash β Node offline 5+ min - Hardware/network issue
Bad Alerts (Noise)ΒΆ
β CPU > 50% - Normal during updates β Any error in logs - Too sensitive β Neighbor count changed - Normal mesh behavior β Bandwidth spike - Expected during backups
Alert Fatigue PreventionΒΆ
- Use cooldown periods (1 hour between duplicate alerts)
- Alert on sustained issues (not momentary spikes)
- Combine related alerts (one "node unhealthy" vs separate CPU/memory/disk)
- Test alert thresholds (adjust based on false positives)
Monitoring ChecklistΒΆ
Daily (Automated):
- All nodes pingable
- Mesh neighbors present
- Disk usage < 95%
- Memory free > 5MB
Weekly (Manual):
- Review graphs in LuCI
- Check bandwidth usage
- Review syslog for errors
- Verify backups working
Monthly (Manual):
- Update firmware if available
- Review long-term trends
- Clean old logs/data
- Test failover scenarios
Cost ComparisonΒΆ
| Solution | Setup Time | Monthly Cost | Complexity |
|---|---|---|---|
| Telegram Alerts | 30 min | $0 | Low |
| Email Alerts | 15 min | $0 | Low |
| Uptime Kuma | 2 hours | $0 (self-hosted) | Medium |
| Prometheus + Grafana | 8+ hours | $0 (self-hosted) | High |
| Commercial (DataDog) | 1 hour | $15+/month | Low |
Recommendation: Start with Telegram alerts ($0, 30 minutes setup)
Next StepsΒΆ
- Read this guide β
- Choose alerting method (Telegram, email, or webhook)
- Review implementation section for your choice
- Set up test alerts (manually trigger conditions)
- Deploy to production (add to playbooks)
- Tune thresholds (adjust based on false positives)
- Add external ping monitor (optional but recommended)
- Schedule weekly reports (optional)
Support & ResourcesΒΆ
- Telegram Bot Guide: https://core.telegram.org/bots
- Pushover: https://pushover.net/
- Uptime Kuma: https://github.com/louislam/uptime-kuma
- Discord Webhooks: https://support.discord.com/hc/en-us/articles/228383668
- Prometheus on OpenWrt: https://openwrt.org/docs/guide-user/perf_and_log/monitoring
SummaryΒΆ
For your 3-node mesh, implement:
- β Telegram bot alerts (critical issues only)
- Disk full, no neighbors, low memory, node offline
-
15-minute checks, 1-hour cooldown
-
β External ping monitor (optional but recommended)
- Simple script on Raspberry Pi/NAS
-
Alerts if any node offline 5+ minutes
-
β Weekly summary reports (nice to have)
- Bandwidth, uptime, neighbors, disk usage
-
Sent every Monday at 8 AM
-
β Keep existing monitoring
- LuCI graphs for detailed analysis
- Distributed syslog for troubleshooting
- mesh-monitor.sh for health checks
Skip: Prometheus, Grafana, LibreNMS (overkill for 3 nodes)
Result: Professional monitoring without complexity!