Monitoring Guide for OpenWrt Mesh Nodes¶

This document describes the lightweight monitoring solution for OpenWrt mesh nodes using collectd, vnStat, and custom health checks.

Overview¶

Monitoring architecture showing data collection, storage, and access paths.

The monitoring solution provides:

Collectd: Metrics collection (CPU, memory, disk, network, wireless)
LuCI Statistics: Web UI with graphs for all metrics
vnStat: Long-term bandwidth usage tracking
Mesh Health Monitoring: Custom scripts monitoring mesh topology and node health
USB Storage: All data stored on /x00 for persistence

Key Features¶

Lightweight: Low resource usage suitable for embedded devices
Independent: Each node monitors itself (no central collector)
Persistent: Data stored on USB storage survives reboots
Comprehensive: System, network, wireless, and mesh-specific metrics
Automated: Health checks run every 5 minutes via cron
Flexible: Can be deployed automatically during node setup or manually afterward

Prerequisites¶

CRITICAL: USB storage must be mounted at /x00 before deploying monitoring.

# First, ensure USB storage is configured
make usb-storage NODE=1

# Verify USB is mounted
make usb-status NODE=1

Deployment Methods¶

Monitoring can be deployed in two ways:

Method 1: Automatic Deployment (During Node Setup)¶

Monitoring is automatically deployed during make deploy-node NODE=1 if:

USB storage is detected and mounted at /x00
ENABLE_MONITORING=true in .env file (default)

To enable/disable automatic monitoring deployment:

# Edit .env file
ENABLE_MONITORING=true   # Deploy monitoring automatically (default)
ENABLE_MONITORING=false  # Skip monitoring during deployment

Benefits:

One-step deployment - monitoring configured along with node
No separate manual step required
Consistent configuration across all nodes

When monitoring auto-deploys:

Collectd, vnStat, and health scripts are installed
Services are enabled and started
Data directories created on /x00/monitoring/
Cron jobs configured for health checks

Method 2: Manual Deployment (After Node Setup)¶

If monitoring was skipped during initial deployment, you can add it later:

# Deploy monitoring to a single node
make deploy-monitoring NODE=1

# Deploy monitoring to all nodes
make deploy-monitoring

Use cases for manual deployment:

Monitoring was disabled during initial deployment (ENABLE_MONITORING=false)
USB storage was added after initial node deployment
Re-deploying monitoring after configuration changes

Deployment¶

Note: The sections below describe manual deployment. For automatic deployment, see "Method 1: Automatic Deployment" above.

Deploy Monitoring to a Single Node¶

# Deploy monitoring to node 1
make deploy-monitoring NODE=1

# With verbose output
make deploy-monitoring NODE=1 VERBOSE=1

Deploy Monitoring to All Nodes¶

# Deploy to all nodes at once
make deploy-monitoring

What Gets Installed¶

Packages (~3-4MB):

collectd - Core metrics collector
collectd-mod-* - Plugins for CPU, memory, disk, network, wireless, etc.
luci-app-statistics - Web UI for collectd graphs
luci-app-vnstat2 - Web UI for vnStat bandwidth graphs
luci-app-commands - Custom command shortcuts in LuCI
vnstat - Bandwidth tracking daemon
vnstati - Graph generation for vnStat

Configuration:

Data storage: /x00/monitoring/
Collection interval: 30 seconds
Disk write interval: 5 minutes (reduces wear)
Log rotation: 30 days

Services:

collectd - Metrics collection daemon
vnstat - Bandwidth tracking daemon

Scripts:

/usr/bin/mesh-monitor.sh - Health check script (runs every 5 min)
/usr/bin/monitoring-report.sh - Status report generator

LuCI Custom Commands:

The following quick-access commands are configured in the LuCI web interface:

Mesh Neighbors - batctl o - Shows all visible mesh neighbors
Gateway List - batctl gwl - Shows available gateway nodes
Mesh Status Report - /usr/bin/monitoring-report.sh - Full monitoring status
Batman Interface - batctl if - Shows interfaces participating in mesh
Bandwidth Stats (bat0) - vnstat -i bat0 - Bandwidth usage on mesh interface

Access these commands via: LuCI → System → Custom Commands

Accessing Monitoring Data¶

Web Interface (LuCI)¶

Access monitoring via web browser:

# Open collectd statistics for node 1
make monitoring-graphs NODE=1

# Or manually navigate to:
# Collectd Statistics: http://10.11.12.1/cgi-bin/luci/admin/statistics/graph
# vnStat Bandwidth:     http://10.11.12.1/cgi-bin/luci/admin/status/vnstat
# Custom Commands:      http://10.11.12.1/cgi-bin/luci/admin/system/commands

LuCI → Statistics → Graphs (Collectd):

CPU usage (user, system, idle)
Memory usage (free, cached, buffered)
Load average (1, 5, 15 minutes)
Disk usage (/x00, /overlay)
Disk I/O (read/write operations)
Network traffic (bat0, br-lan, wlan0, wlan1)
Wireless stats (signal, noise, bitrate)
Temperature sensors
Ping latency (to other mesh nodes)
Process count
System uptime

LuCI → Status → vnStat Traffic Monitor:

Real-time bandwidth usage
Hourly, daily, monthly, yearly statistics
Per-interface graphs (bat0, br-lan, wlan0, wlan1)
Top traffic days/hours

LuCI → System → Custom Commands:

Quick-access buttons for mesh status commands
One-click access to batctl o, batctl gwl, monitoring reports
No need to SSH for common operations

Command Line Reports¶

# View comprehensive status report
make monitoring-status NODE=1

# View mesh health logs
make monitoring-logs NODE=1

# SSH to node and run report
ssh root@10.11.12.1 monitoring-report.sh

Direct SSH Access¶

# SSH to the node
ssh root@10.11.12.1

# View monitoring status
monitoring-report.sh

# View bandwidth stats
vnstat -i bat0              # All-time stats
vnstat -i bat0 -d           # Daily stats
vnstat -i bat0 -h           # Hourly stats
vnstat -i bat0 -m           # Monthly stats
vnstat -i bat0 -l           # Live stats

# View health logs
tail -f /x00/monitoring/logs/mesh-health.log

# Check collectd status
/etc/init.d/collectd status
/etc/init.d/collectd restart

# Check vnstat status
/etc/init.d/vnstat status

# View RRD files
ls -lh /x00/monitoring/collectd/rrd/

# Check disk usage
du -sh /x00/monitoring/*

Monitored Metrics¶

System Metrics¶

CPU: Usage per core, user/system/idle time
Memory: Used, free, cached, buffered
Load: 1, 5, 15 minute averages
Uptime: System uptime
Processes: Total count, running, sleeping
Temperature: CPU/system temperature sensors

Storage Metrics¶

Disk Usage: Free space on /x00 and /overlay
Disk I/O: Read/write operations and throughput
Alerts: Warning when /x00 usage > 90%

Network Metrics¶

Interfaces: bat0, br-lan, wlan0, wlan1, eth0, eth1
Traffic: Bytes/packets in/out per interface
Errors: Packet errors and drops
Bandwidth: Historical usage via vnStat

Wireless Metrics (via iwinfo)¶

Signal Strength: Per-station RSSI
Noise Level: Background noise
Bitrate: Current transmission rate
Channel Utilization: Airtime usage

Mesh-Specific Metrics¶

Neighbor Count: Number of mesh neighbors
Neighbor Connectivity: Ping latency to other nodes
Gateway Status: Current gateway mode
Interface Status: Bat0, wlan0, wlan1 operational state

Health Monitoring¶

Automated Health Checks¶

The mesh-monitor.sh script runs every 5 minutes and checks:

Batman-adv Module: Verifies module is loaded
Bat0 Interface: Confirms mesh interface is up
Mesh Neighbors: Counts visible neighbors (expects 2 for 3-node mesh)
USB Storage: Verifies /x00 is mounted
Disk Space: Warns if usage > 90%
Wireless Interfaces: Checks wlan0/wlan1 operational state
Gateway Mode: Logs current gateway status

Health Log Format¶

2025-11-22 12:00:00 - INFO: 2 mesh neighbors detected
2025-11-22 12:00:00 - INFO: Gateway mode: client
2025-11-22 12:05:00 - WARNING: USB storage usage at 85%
2025-11-22 12:10:00 - ERROR: No mesh neighbors detected (expected 2 for 3-node mesh)

Viewing Logs¶

# Last 50 lines
make monitoring-logs NODE=1

# Follow logs in real-time
ssh root@10.11.12.1
tail -f /x00/monitoring/logs/mesh-health.log

Data Storage¶

Storage Layout¶

/x00/monitoring/
├── collectd/
│   └── rrd/              # RRD database files (time-series data)
│       ├── cpu-0/
│       ├── memory/
│       ├── interface-bat0/
│       └── ...
├── vnstat/               # vnStat database
│   ├── bat0
│   ├── br-lan
│   └── wlan0
└── logs/                 # Health check logs
    └── mesh-health.log

Data Retention¶

RRD Files: 1200 rows per RRA (configurable)
30-second intervals = ~10 hours of detailed data
Automatically aggregates to longer intervals
1 year of historical trends
Health Logs: 30 days (auto-rotated)
vnStat Database: Unlimited (until disk full)

Disk Usage¶

Typical disk usage after 30 days:

Collectd RRD files: ~50-100MB
vnStat database: ~5-10MB
Health logs: ~1-5MB
Total: ~100MB per node

Performance Impact¶

Resource Usage¶

Typical resource consumption per node:

CPU: < 1% average
Memory: ~10-15MB RSS
Disk I/O: Minimal (5-minute write interval)
Network: Negligible (ping monitoring only)

Optimization Settings¶

The configuration is optimized for flash storage:

Collection Interval: 30 seconds (reduces CPU load)
Cache Flush: 5 minutes (reduces disk writes)
RRA Single: Enabled (one file per metric, reduces I/O)
Background GC: Enabled on F2FS (wear leveling)

Troubleshooting¶

Monitoring Not Working¶

# Check USB storage is mounted
mount | grep /x00

# Check services running
/etc/init.d/collectd status
/etc/init.d/vnstat status

# Restart services
/etc/init.d/collectd restart
/etc/init.d/vnstat restart

# Check logs
logread | grep collectd
logread | grep vnstat

No Data in Graphs¶

# Wait 5-10 minutes for initial data collection

# Verify RRD files exist
ls -l /x00/monitoring/collectd/rrd/

# Check collectd is collecting
/etc/init.d/collectd restart
sleep 60
ls -l /x00/monitoring/collectd/rrd/

vnStat "Unable to read database" Error¶

If you see Error: Unable to read database "/x00/monitoring/vnstat/bat0": No such file or directory:

# Stop vnStat
/etc/init.d/vnstat stop

# Ensure directory exists
mkdir -p /x00/monitoring/vnstat
chmod 755 /x00/monitoring/vnstat

# Recreate databases for existing interfaces
for iface in bat0 br-lan wlan0 wlan1; do
  if ip link show "$iface" >/dev/null 2>&1; then
    rm -f "/x00/monitoring/vnstat/$iface"
    vnstat --create -i "$iface"
  fi
done

# Restart service
/etc/init.d/vnstat start

# Verify databases created
ls -lh /x00/monitoring/vnstat/

# Wait 5-10 minutes for data collection
vnstat -i bat0

Monitoring Report Script Issues¶

If /usr/bin/monitoring-report.sh shows errors:

"hostname: not found" - Fixed in updated playbooks, hostname is now read from UCI/proc

"Unknown parameter '1'" in vnStat - Fixed in updated playbooks, changed vnstat -i bat0 -d 1 to vnstat -i bat0 -d

To manually fix on deployed nodes:

# Redeploy monitoring (will update scripts)
make deploy-monitoring NODE=1

Web UI Not Accessible¶

# Check LuCI is running
/etc/init.d/uhttpd status
/etc/init.d/uhttpd restart

# Check firewall allows access
uci show firewall | grep "wan.*input"

# Access from mesh network (not WAN)
# http://10.11.12.1/cgi-bin/luci/admin/statistics/graph

Disk Full on /x00¶

# Check current usage
df -h /x00

# Clean old health logs
find /x00/monitoring/logs -name "*.log" -mtime +30 -delete

# Reduce RRD retention (if needed)
# Edit /etc/config/luci_statistics
uci set luci_statistics.collectd_rrdtool.RRARows='600'
uci commit luci_statistics
/etc/init.d/collectd restart

Health Checks Not Running¶

# Check cron job exists
crontab -l | grep mesh-monitor

# Run manually to test
/usr/bin/mesh-monitor.sh

# Check logs
tail /x00/monitoring/logs/mesh-health.log

Customization¶

Adjust Collection Interval¶

Edit /etc/config/luci_statistics:

# Change from 30 to 60 seconds
uci set luci_statistics.collectd.Interval='60'
uci commit luci_statistics
/etc/init.d/collectd restart

Add Custom Metrics¶

Create a custom collectd plugin:

# Example: Monitor specific process
cat > /etc/collectd/conf.d/custom.conf << 'EOF'
LoadPlugin processes
<Plugin processes>
  Process "hostapd"
  Process "dnsmasq"
</Plugin>
EOF

/etc/init.d/collectd restart

Change Monitored Interfaces¶

# Edit interface list
uci set luci_statistics.collectd_interface.Interfaces='bat0 br-lan wlan0'
uci commit luci_statistics
/etc/init.d/collectd restart

Modify Health Check Frequency¶

# Change from 5 minutes to 10 minutes
crontab -e
# Change: */5 * * * * to */10 * * * *

Integration with External Systems¶

Exporting to Grafana (Future Enhancement)¶

To send metrics to a central Grafana instance:

Install collectd-mod-network on nodes
Configure network plugin to forward to Grafana server
Set up Grafana with collectd data source

Alerting (Future Enhancement)¶

Options for alerts:

Email: Configure collectd notification plugin
Webhook: Use collectd exec plugin to call webhook on threshold
MQTT: Publish health status to MQTT broker
Custom Script: Extend mesh-monitor.sh to send notifications

Best Practices¶

Deploy After USB Storage: Always ensure USB is mounted before deploying monitoring
Monitor Disk Space: Keep /x00 usage under 80%
Review Health Logs: Check logs weekly for issues
Backup RRD Data: Periodically backup /x00/monitoring/ directory
Test Failover: Verify monitoring works after node reboots