Monitoring Guide for OpenWrt Mesh Nodes¶
This document describes the lightweight monitoring solution for OpenWrt mesh nodes using collectd, vnStat, and custom health checks.
Overview¶
Monitoring architecture showing data collection, storage, and access paths.
The monitoring solution provides:
- Collectd: Metrics collection (CPU, memory, disk, network, wireless)
- LuCI Statistics: Web UI with graphs for all metrics
- vnStat: Long-term bandwidth usage tracking
- Mesh Health Monitoring: Custom scripts monitoring mesh topology and node health
- USB Storage: All data stored on
/x00for persistence
Key Features¶
- Lightweight: Low resource usage suitable for embedded devices
- Independent: Each node monitors itself (no central collector)
- Persistent: Data stored on USB storage survives reboots
- Comprehensive: System, network, wireless, and mesh-specific metrics
- Automated: Health checks run every 5 minutes via cron
- Flexible: Can be deployed automatically during node setup or manually afterward
Prerequisites¶
CRITICAL: USB storage must be mounted at /x00 before deploying monitoring.
# First, ensure USB storage is configured
make usb-storage NODE=1
# Verify USB is mounted
make usb-status NODE=1
Deployment Methods¶
Monitoring can be deployed in two ways:
Method 1: Automatic Deployment (During Node Setup)¶
Monitoring is automatically deployed during make deploy-node NODE=1 if:
- USB storage is detected and mounted at
/x00 ENABLE_MONITORING=truein.envfile (default)
To enable/disable automatic monitoring deployment:
# Edit .env file
ENABLE_MONITORING=true # Deploy monitoring automatically (default)
ENABLE_MONITORING=false # Skip monitoring during deployment
Benefits:
- One-step deployment - monitoring configured along with node
- No separate manual step required
- Consistent configuration across all nodes
When monitoring auto-deploys:
- Collectd, vnStat, and health scripts are installed
- Services are enabled and started
- Data directories created on
/x00/monitoring/ - Cron jobs configured for health checks
Method 2: Manual Deployment (After Node Setup)¶
If monitoring was skipped during initial deployment, you can add it later:
# Deploy monitoring to a single node
make deploy-monitoring NODE=1
# Deploy monitoring to all nodes
make deploy-monitoring
Use cases for manual deployment:
- Monitoring was disabled during initial deployment (
ENABLE_MONITORING=false) - USB storage was added after initial node deployment
- Re-deploying monitoring after configuration changes
Deployment¶
Note: The sections below describe manual deployment. For automatic deployment, see "Method 1: Automatic Deployment" above.
Deploy Monitoring to a Single Node¶
# Deploy monitoring to node 1
make deploy-monitoring NODE=1
# With verbose output
make deploy-monitoring NODE=1 VERBOSE=1
Deploy Monitoring to All Nodes¶
What Gets Installed¶
Packages (~3-4MB):
collectd- Core metrics collectorcollectd-mod-*- Plugins for CPU, memory, disk, network, wireless, etc.luci-app-statistics- Web UI for collectd graphsluci-app-vnstat2- Web UI for vnStat bandwidth graphsluci-app-commands- Custom command shortcuts in LuCIvnstat- Bandwidth tracking daemonvnstati- Graph generation for vnStat
Configuration:
- Data storage:
/x00/monitoring/ - Collection interval: 30 seconds
- Disk write interval: 5 minutes (reduces wear)
- Log rotation: 30 days
Services:
collectd- Metrics collection daemonvnstat- Bandwidth tracking daemon
Scripts:
/usr/bin/mesh-monitor.sh- Health check script (runs every 5 min)/usr/bin/monitoring-report.sh- Status report generator
LuCI Custom Commands:
The following quick-access commands are configured in the LuCI web interface:
- Mesh Neighbors -
batctl o- Shows all visible mesh neighbors - Gateway List -
batctl gwl- Shows available gateway nodes - Mesh Status Report -
/usr/bin/monitoring-report.sh- Full monitoring status - Batman Interface -
batctl if- Shows interfaces participating in mesh - Bandwidth Stats (bat0) -
vnstat -i bat0- Bandwidth usage on mesh interface
Access these commands via: LuCI → System → Custom Commands
Accessing Monitoring Data¶
Web Interface (LuCI)¶
Access monitoring via web browser:
# Open collectd statistics for node 1
make monitoring-graphs NODE=1
# Or manually navigate to:
# Collectd Statistics: http://10.11.12.1/cgi-bin/luci/admin/statistics/graph
# vnStat Bandwidth: http://10.11.12.1/cgi-bin/luci/admin/status/vnstat
# Custom Commands: http://10.11.12.1/cgi-bin/luci/admin/system/commands
LuCI → Statistics → Graphs (Collectd):
- CPU usage (user, system, idle)
- Memory usage (free, cached, buffered)
- Load average (1, 5, 15 minutes)
- Disk usage (/x00, /overlay)
- Disk I/O (read/write operations)
- Network traffic (bat0, br-lan, wlan0, wlan1)
- Wireless stats (signal, noise, bitrate)
- Temperature sensors
- Ping latency (to other mesh nodes)
- Process count
- System uptime
LuCI → Status → vnStat Traffic Monitor:
- Real-time bandwidth usage
- Hourly, daily, monthly, yearly statistics
- Per-interface graphs (bat0, br-lan, wlan0, wlan1)
- Top traffic days/hours
LuCI → System → Custom Commands:
- Quick-access buttons for mesh status commands
- One-click access to
batctl o,batctl gwl, monitoring reports - No need to SSH for common operations
Command Line Reports¶
# View comprehensive status report
make monitoring-status NODE=1
# View mesh health logs
make monitoring-logs NODE=1
# SSH to node and run report
ssh root@10.11.12.1 monitoring-report.sh
Direct SSH Access¶
# SSH to the node
ssh root@10.11.12.1
# View monitoring status
monitoring-report.sh
# View bandwidth stats
vnstat -i bat0 # All-time stats
vnstat -i bat0 -d # Daily stats
vnstat -i bat0 -h # Hourly stats
vnstat -i bat0 -m # Monthly stats
vnstat -i bat0 -l # Live stats
# View health logs
tail -f /x00/monitoring/logs/mesh-health.log
# Check collectd status
/etc/init.d/collectd status
/etc/init.d/collectd restart
# Check vnstat status
/etc/init.d/vnstat status
# View RRD files
ls -lh /x00/monitoring/collectd/rrd/
# Check disk usage
du -sh /x00/monitoring/*
Monitored Metrics¶
System Metrics¶
- CPU: Usage per core, user/system/idle time
- Memory: Used, free, cached, buffered
- Load: 1, 5, 15 minute averages
- Uptime: System uptime
- Processes: Total count, running, sleeping
- Temperature: CPU/system temperature sensors
Storage Metrics¶
- Disk Usage: Free space on /x00 and /overlay
- Disk I/O: Read/write operations and throughput
- Alerts: Warning when /x00 usage > 90%
Network Metrics¶
- Interfaces: bat0, br-lan, wlan0, wlan1, eth0, eth1
- Traffic: Bytes/packets in/out per interface
- Errors: Packet errors and drops
- Bandwidth: Historical usage via vnStat
Wireless Metrics (via iwinfo)¶
- Signal Strength: Per-station RSSI
- Noise Level: Background noise
- Bitrate: Current transmission rate
- Channel Utilization: Airtime usage
Mesh-Specific Metrics¶
- Neighbor Count: Number of mesh neighbors
- Neighbor Connectivity: Ping latency to other nodes
- Gateway Status: Current gateway mode
- Interface Status: Bat0, wlan0, wlan1 operational state
Health Monitoring¶
Automated Health Checks¶
The mesh-monitor.sh script runs every 5 minutes and checks:
- Batman-adv Module: Verifies module is loaded
- Bat0 Interface: Confirms mesh interface is up
- Mesh Neighbors: Counts visible neighbors (expects 2 for 3-node mesh)
- USB Storage: Verifies /x00 is mounted
- Disk Space: Warns if usage > 90%
- Wireless Interfaces: Checks wlan0/wlan1 operational state
- Gateway Mode: Logs current gateway status
Health Log Format¶
2025-11-22 12:00:00 - INFO: 2 mesh neighbors detected
2025-11-22 12:00:00 - INFO: Gateway mode: client
2025-11-22 12:05:00 - WARNING: USB storage usage at 85%
2025-11-22 12:10:00 - ERROR: No mesh neighbors detected (expected 2 for 3-node mesh)
Viewing Logs¶
# Last 50 lines
make monitoring-logs NODE=1
# Follow logs in real-time
ssh root@10.11.12.1
tail -f /x00/monitoring/logs/mesh-health.log
Data Storage¶
Storage Layout¶
/x00/monitoring/
├── collectd/
│ └── rrd/ # RRD database files (time-series data)
│ ├── cpu-0/
│ ├── memory/
│ ├── interface-bat0/
│ └── ...
├── vnstat/ # vnStat database
│ ├── bat0
│ ├── br-lan
│ └── wlan0
└── logs/ # Health check logs
└── mesh-health.log
Data Retention¶
- RRD Files: 1200 rows per RRA (configurable)
- 30-second intervals = ~10 hours of detailed data
- Automatically aggregates to longer intervals
- 1 year of historical trends
- Health Logs: 30 days (auto-rotated)
- vnStat Database: Unlimited (until disk full)
Disk Usage¶
Typical disk usage after 30 days:
- Collectd RRD files: ~50-100MB
- vnStat database: ~5-10MB
- Health logs: ~1-5MB
- Total: ~100MB per node
Performance Impact¶
Resource Usage¶
Typical resource consumption per node:
- CPU: < 1% average
- Memory: ~10-15MB RSS
- Disk I/O: Minimal (5-minute write interval)
- Network: Negligible (ping monitoring only)
Optimization Settings¶
The configuration is optimized for flash storage:
- Collection Interval: 30 seconds (reduces CPU load)
- Cache Flush: 5 minutes (reduces disk writes)
- RRA Single: Enabled (one file per metric, reduces I/O)
- Background GC: Enabled on F2FS (wear leveling)
Troubleshooting¶
Monitoring Not Working¶
# Check USB storage is mounted
mount | grep /x00
# Check services running
/etc/init.d/collectd status
/etc/init.d/vnstat status
# Restart services
/etc/init.d/collectd restart
/etc/init.d/vnstat restart
# Check logs
logread | grep collectd
logread | grep vnstat
No Data in Graphs¶
# Wait 5-10 minutes for initial data collection
# Verify RRD files exist
ls -l /x00/monitoring/collectd/rrd/
# Check collectd is collecting
/etc/init.d/collectd restart
sleep 60
ls -l /x00/monitoring/collectd/rrd/
vnStat "Unable to read database" Error¶
If you see Error: Unable to read database "/x00/monitoring/vnstat/bat0": No such file or directory:
# Stop vnStat
/etc/init.d/vnstat stop
# Ensure directory exists
mkdir -p /x00/monitoring/vnstat
chmod 755 /x00/monitoring/vnstat
# Recreate databases for existing interfaces
for iface in bat0 br-lan wlan0 wlan1; do
if ip link show "$iface" >/dev/null 2>&1; then
rm -f "/x00/monitoring/vnstat/$iface"
vnstat --create -i "$iface"
fi
done
# Restart service
/etc/init.d/vnstat start
# Verify databases created
ls -lh /x00/monitoring/vnstat/
# Wait 5-10 minutes for data collection
vnstat -i bat0
Monitoring Report Script Issues¶
If /usr/bin/monitoring-report.sh shows errors:
"hostname: not found" - Fixed in updated playbooks, hostname is now read from UCI/proc
"Unknown parameter '1'" in vnStat - Fixed in updated playbooks, changed vnstat -i bat0 -d 1 to vnstat -i bat0 -d
To manually fix on deployed nodes:
Web UI Not Accessible¶
# Check LuCI is running
/etc/init.d/uhttpd status
/etc/init.d/uhttpd restart
# Check firewall allows access
uci show firewall | grep "wan.*input"
# Access from mesh network (not WAN)
# http://10.11.12.1/cgi-bin/luci/admin/statistics/graph
Disk Full on /x00¶
# Check current usage
df -h /x00
# Clean old health logs
find /x00/monitoring/logs -name "*.log" -mtime +30 -delete
# Reduce RRD retention (if needed)
# Edit /etc/config/luci_statistics
uci set luci_statistics.collectd_rrdtool.RRARows='600'
uci commit luci_statistics
/etc/init.d/collectd restart
Health Checks Not Running¶
# Check cron job exists
crontab -l | grep mesh-monitor
# Run manually to test
/usr/bin/mesh-monitor.sh
# Check logs
tail /x00/monitoring/logs/mesh-health.log
Customization¶
Adjust Collection Interval¶
Edit /etc/config/luci_statistics:
# Change from 30 to 60 seconds
uci set luci_statistics.collectd.Interval='60'
uci commit luci_statistics
/etc/init.d/collectd restart
Add Custom Metrics¶
Create a custom collectd plugin:
# Example: Monitor specific process
cat > /etc/collectd/conf.d/custom.conf << 'EOF'
LoadPlugin processes
<Plugin processes>
Process "hostapd"
Process "dnsmasq"
</Plugin>
EOF
/etc/init.d/collectd restart
Change Monitored Interfaces¶
# Edit interface list
uci set luci_statistics.collectd_interface.Interfaces='bat0 br-lan wlan0'
uci commit luci_statistics
/etc/init.d/collectd restart
Modify Health Check Frequency¶
Integration with External Systems¶
Exporting to Grafana (Future Enhancement)¶
To send metrics to a central Grafana instance:
- Install
collectd-mod-networkon nodes - Configure network plugin to forward to Grafana server
- Set up Grafana with collectd data source
Alerting (Future Enhancement)¶
Options for alerts:
- Email: Configure collectd notification plugin
- Webhook: Use collectd exec plugin to call webhook on threshold
- MQTT: Publish health status to MQTT broker
- Custom Script: Extend
mesh-monitor.shto send notifications
Best Practices¶
- Deploy After USB Storage: Always ensure USB is mounted before deploying monitoring
- Monitor Disk Space: Keep /x00 usage under 80%
- Review Health Logs: Check logs weekly for issues
- Backup RRD Data: Periodically backup /x00/monitoring/ directory
- Test Failover: Verify monitoring works after node reboots