Skip to content

Common Issues

This page covers frequently encountered problems and their solutions.

SSH Connection Issues

Can't SSH to Node

Symptom: ssh root@10.11.12.1 times out or refuses connection.

Solutions:

  1. Check network connectivity:
ping 10.11.12.1
  1. Verify you're on the right network:
ip addr show | grep 10.11.12
# You should have an IP in 10.11.12.x range
  1. Check SSH key is loaded:
ssh-add -l
# Should show your key
ssh-add ~/.ssh/openwrt_mesh
  1. Try with password (if key auth not yet configured):
ssh -o PreferredAuthentications=password root@10.11.12.1
  1. Check firewall on your machine:
sudo iptables -L | grep ssh

SSH Key Rejected

Symptom: Permission denied (publickey).

Solutions:

  1. Verify key is deployed:
ssh root@10.11.12.1 "cat /etc/dropbear/authorized_keys"
# Or for OpenSSH:
ssh root@10.11.12.1 "cat ~/.ssh/authorized_keys"
  1. Redeploy key:
ssh-copy-id -i ~/.ssh/openwrt_mesh root@10.11.12.1
  1. Check key permissions (on your machine):
chmod 600 ~/.ssh/openwrt_mesh
chmod 644 ~/.ssh/openwrt_mesh.pub

Connection Drops During Deployment

Symptom: SSH disconnects mid-playbook, deployment incomplete.

Solutions:

  1. Run playbook with SKIP_REBOOT:
SKIP_REBOOT=true make deploy-node NODE=1
  1. After network config changes, reconnect to new IP:
# Node changed from 192.168.1.1 → 10.11.12.1
ssh root@10.11.12.1
  1. Use console access for initial setup:
  2. Connect serial cable
  3. 115200 baud, 8N1
  4. Configure basic networking first

Mesh Not Forming

Nodes Don't See Each Other

Symptom: batctl n shows no neighbors.

Solutions:

  1. Check physical connections:
# On each node
ip link show | grep -E "(lan3|lan4)"
# Should show "state UP"
  1. Verify VLAN interfaces exist:
ip link show | grep "lan3.100\|lan4.100"
  1. Check batman interfaces:
batctl if
# Should show mesh interfaces with status "active"
  1. Verify batman is running:
lsmod | grep batman
# Should show batman_adv module
  1. Check for MTU issues:
ip link show bat0 | grep mtu
# Should be 1500 (after batman overhead)

Wireless Mesh Not Working

Symptom: Wired mesh works but wireless backup doesn't.

Solutions:

  1. Check 802.11s interface:
iw dev mesh0 info
# Should show type "mesh point"
  1. Verify mesh is in batman:
batctl if | grep mesh0
  1. Check wireless is on correct channel:
iw dev mesh0 info | grep channel
# All nodes must be on same channel
  1. Verify mesh ID matches:
uci get wireless.mesh.mesh_id
# Must match on all nodes
  1. Reload wireless:
wifi reload
sleep 5
batctl n

Poor Mesh Quality (Low TQ)

Symptom: batctl o shows TQ values below 200.

Solutions:

  1. Check for interference (wireless):
iw dev wlan0 survey dump
# Look for high "noise" values
  1. Check cable quality (wired):
ethtool lan3 | grep -i speed
# Should show 1000Mb/s
  1. Verify VLAN tagging:
# On switch, check VLAN 100 tagging
# On node:
tcpdump -i lan3 -e | grep vlan

WiFi Issues

5GHz AP Not Visible

Symptom: Can't see the client SSID.

Solutions:

  1. Check radio is enabled:
uci get wireless.radio1.disabled
# Should be 0 or not set
  1. Check AP interface:
iw dev
# Should show wlan1 with type "AP"
  1. Verify channel is valid for your region:
iw reg get
# Check if channel 36 is allowed
  1. Restart wireless:
wifi reload

Clients Can't Connect

Symptom: SSID visible but authentication fails.

Solutions:

  1. Verify password (on node):
uci get wireless.client.key
  1. Check encryption matches:
uci get wireless.client.encryption
# Usually "psk2+ccmp" for WPA2
  1. Check hostapd is running:
pgrep hostapd
ps | grep hostapd
  1. Review hostapd logs:
logread | grep hostapd | tail -20

Clients Not Getting DHCP

Symptom: Connected but no IP address.

Solutions:

  1. Check dnsmasq is running:
pgrep dnsmasq
  1. Verify DHCP pool:
uci show dhcp.lan
  1. Check bridge configuration:
brctl show br-lan
# wlan1 should be listed
  1. Restart DHCP server:
/etc/init.d/dnsmasq restart
  1. Check DHCP leases:
cat /tmp/dhcp.leases

VLAN Issues

VLAN Interfaces Missing

Symptom: ip link show doesn't show VLAN interfaces.

Solutions:

  1. Check 8021q module:
lsmod | grep 8021q
modprobe 8021q
  1. Verify network config:
uci show network | grep vlan
  1. Recreate VLAN interfaces:
/etc/init.d/network restart

VLAN Tagging Mismatch

Symptom: Traffic not reaching destination, works without VLANs.

Solutions:

  1. Verify switch VLAN config matches node config
  2. Check PVID settings on switch
  3. Use tcpdump to verify tagging:
tcpdump -i lan3 -e vlan

IoT Devices Can Reach Main Network

Symptom: VLAN isolation not working.

Solutions:

  1. Check firewall zones:
uci show firewall | grep iot
  1. Verify forward policy:
uci get firewall.@zone[X].forward
# Should be "REJECT" for IoT
  1. Check inter-zone rules:
iptables -L FORWARD -v

Gateway Issues

All Traffic Goes Through One Node

Symptom: Gateway list shows only one gateway selected.

Solutions:

  1. Check gateway mode on all nodes:
batctl gw_mode
# Should be "server" on all nodes
  1. Verify gateway bandwidth configured:
uci get network.bat0.gw_bandwidth
  1. Check if WAN is up on all nodes:
ping -I wan 1.1.1.1

Internet Not Working

Symptom: Can ping mesh IPs but not internet.

Solutions:

  1. Check default route:
ip route show default
  1. Verify NAT rules:
nft list table nat
# Or: iptables -t nat -L
  1. Check WAN interface:
ip addr show wan
  1. Test DNS:
nslookup google.com 1.1.1.1

Management Network Issues

Intermittent Connectivity to Nodes

Symptom: Pings to node management IPs (10.11.10.x) sometimes fail, then work again. SSH sessions drop randomly.

Cause: In multi-switch topologies, short ARP cache times (default 30-60 seconds) can cause race conditions during MAC/ARP relearning, leading to brief connectivity outages.

Solution: Increase ARP cache times on all mesh nodes:

# Check current settings
cat /proc/sys/net/ipv4/neigh/br-mgmt/gc_stale_time
cat /proc/sys/net/ipv4/neigh/br-mgmt/base_reachable_time_ms

# Apply fix (if not deployed via Ansible)
sysctl -w net.ipv4.neigh.br-mgmt.gc_stale_time=300
sysctl -w net.ipv4.neigh.br-mgmt.base_reachable_time_ms=120000

# Make persistent
cat >> /etc/sysctl.conf << 'EOF'
# ARP cache settings for management network (br-mgmt)
net.ipv4.neigh.br-mgmt.gc_stale_time = 300
net.ipv4.neigh.br-mgmt.base_reachable_time_ms = 120000
EOF

Note: This fix is automatically applied by Ansible during deployment (see group_vars/all.yml for configuration).

Verification:

# Test all nodes from management network
for ip in 10.11.10.1 10.11.10.2 10.11.10.3; do
  ping -c 10 $ip
done
# All should show 0% packet loss

Can't Reach Node from Different Switch

Symptom: Devices on Switch B can't reach Node 1 (connected to Switch A), but can reach other nodes.

Solutions:

  1. Check switch VLAN 10 configuration:
  2. VLAN 10 must be properly trunked between switches
  3. Management traffic uses VLAN 10

  4. Verify BLA (Bridge Loop Avoidance) is working:

batctl cl   # Check claim table
batctl bl   # Check backbone table
  1. Check ARP cache settings (see above)

  2. Verify the path:

# On the affected device
ip neigh show | grep 10.11.10
# Check if MAC addresses are correct

Performance Issues

Slow Network Speeds

Solutions:

  1. Check mesh TQ values:
batctl o
# Low values indicate poor link quality
  1. Test direct link speed:
# On one node:
nc -l -p 5001 > /dev/null
# On another:
dd if=/dev/zero bs=1M count=100 | nc 10.11.12.1 5001
  1. Check for CPU overload:
top
  1. Verify Gigabit negotiation:
cat /sys/class/net/lan3/speed
# Should show 1000

High Latency

Solutions:

  1. Check hop count:
batctl traceroute <destination-MAC>
  1. Look for routing loops:
batctl o
# Check for inconsistent next hops
  1. Check for interference (wireless):
iw dev wlan0 survey dump

Getting More Help

If these solutions don't resolve your issue:

  1. Gather diagnostic info:
# Run audit playbook
make audit-node NODE=1
  1. Check logs:
logread | tail -100
dmesg | tail -50
  1. Open a GitHub issue with:
  2. OpenWrt version
  3. Exact error messages
  4. Output of diagnostic commands
  5. Steps to reproduce

See also: Debugging Guide for advanced troubleshooting techniques.