Common Issues¶

This page covers frequently encountered problems and their solutions.

SSH Connection Issues¶

Can't SSH to Node¶

Symptom: ssh root@10.11.12.1 times out or refuses connection.

Solutions:

Check network connectivity:

ping 10.11.12.1

Verify you're on the right network:

ip addr show | grep 10.11.12
# You should have an IP in 10.11.12.x range

Check SSH key is loaded:

ssh-add -l
# Should show your key
ssh-add ~/.ssh/openwrt_mesh

Try with password (if key auth not yet configured):

ssh -o PreferredAuthentications=password root@10.11.12.1

Check firewall on your machine:

sudo iptables -L | grep ssh

SSH Key Rejected¶

Symptom: Permission denied (publickey).

Solutions:

Verify key is deployed:

ssh root@10.11.12.1 "cat /etc/dropbear/authorized_keys"
# Or for OpenSSH:
ssh root@10.11.12.1 "cat ~/.ssh/authorized_keys"

Redeploy key:

ssh-copy-id -i ~/.ssh/openwrt_mesh root@10.11.12.1

Check key permissions (on your machine):

chmod 600 ~/.ssh/openwrt_mesh
chmod 644 ~/.ssh/openwrt_mesh.pub

Connection Drops During Deployment¶

Symptom: SSH disconnects mid-playbook, deployment incomplete.

Solutions:

Run playbook with SKIP_REBOOT:

SKIP_REBOOT=true make deploy-node NODE=1

After network config changes, reconnect to new IP:

# Node changed from 192.168.1.1 → 10.11.12.1
ssh root@10.11.12.1

Use console access for initial setup:
Connect serial cable
115200 baud, 8N1
Configure basic networking first

Mesh Not Forming¶

Nodes Don't See Each Other¶

Symptom: batctl n shows no neighbors.

Solutions:

Check physical connections:

# On each node
ip link show | grep -E "(lan3|lan4)"
# Should show "state UP"

Verify VLAN interfaces exist:

ip link show | grep "lan3.100\|lan4.100"

Check batman interfaces:

batctl if
# Should show mesh interfaces with status "active"

Verify batman is running:

lsmod | grep batman
# Should show batman_adv module

Check for MTU issues:

ip link show bat0 | grep mtu
# Should be 1500 (after batman overhead)

Wireless Mesh Not Working¶

Symptom: Wired mesh works but wireless backup doesn't.

Solutions:

Check 802.11s interface:

iw dev mesh0 info
# Should show type "mesh point"

Verify mesh is in batman:

batctl if | grep mesh0

Check wireless is on correct channel:

iw dev mesh0 info | grep channel
# All nodes must be on same channel

Verify mesh ID matches:

uci get wireless.mesh.mesh_id
# Must match on all nodes

Reload wireless:

wifi reload
sleep 5
batctl n

Poor Mesh Quality (Low TQ)¶

Symptom: batctl o shows TQ values below 200.

Solutions:

Check for interference (wireless):

iw dev wlan0 survey dump
# Look for high "noise" values

Check cable quality (wired):

ethtool lan3 | grep -i speed
# Should show 1000Mb/s

Verify VLAN tagging:

# On switch, check VLAN 100 tagging
# On node:
tcpdump -i lan3 -e | grep vlan

WiFi Issues¶

5GHz AP Not Visible¶

Symptom: Can't see the client SSID.

Solutions:

Check radio is enabled:

uci get wireless.radio1.disabled
# Should be 0 or not set

Check AP interface:

iw dev
# Should show wlan1 with type "AP"

Verify channel is valid for your region:

iw reg get
# Check if channel 36 is allowed

Restart wireless:

wifi reload

Clients Can't Connect¶

Symptom: SSID visible but authentication fails.

Solutions:

Verify password (on node):

uci get wireless.client.key

Check encryption matches:

uci get wireless.client.encryption
# Usually "psk2+ccmp" for WPA2

Check hostapd is running:

pgrep hostapd
ps | grep hostapd

Review hostapd logs:

logread | grep hostapd | tail -20

Clients Not Getting DHCP¶

Symptom: Connected but no IP address.

Solutions:

Check dnsmasq is running:

pgrep dnsmasq

Verify DHCP pool:

uci show dhcp.lan

Check bridge configuration:

brctl show br-lan
# wlan1 should be listed

Restart DHCP server:

/etc/init.d/dnsmasq restart

Check DHCP leases:

cat /tmp/dhcp.leases

VLAN Issues¶

VLAN Interfaces Missing¶

Symptom: ip link show doesn't show VLAN interfaces.

Solutions:

Check 8021q module:

lsmod | grep 8021q
modprobe 8021q

Verify network config:

uci show network | grep vlan

Recreate VLAN interfaces:

/etc/init.d/network restart

VLAN Tagging Mismatch¶

Symptom: Traffic not reaching destination, works without VLANs.

Solutions:

Verify switch VLAN config matches node config
Check PVID settings on switch
Use tcpdump to verify tagging:

tcpdump -i lan3 -e vlan

IoT Devices Can Reach Main Network¶

Symptom: VLAN isolation not working.

Solutions:

Check firewall zones:

uci show firewall | grep iot

Verify forward policy:

uci get firewall.@zone[X].forward
# Should be "REJECT" for IoT

Check inter-zone rules:

iptables -L FORWARD -v

Gateway Issues¶

All Traffic Goes Through One Node¶

Symptom: Gateway list shows only one gateway selected.

Solutions:

Check gateway mode on all nodes:

batctl gw_mode
# Should be "server" on all nodes

Verify gateway bandwidth configured:

uci get network.bat0.gw_bandwidth

Check if WAN is up on all nodes:

ping -I wan 1.1.1.1

Internet Not Working¶

Symptom: Can ping mesh IPs but not internet.

Solutions:

Check default route:

ip route show default

Verify NAT rules:

nft list table nat
# Or: iptables -t nat -L

Check WAN interface:

ip addr show wan

Test DNS:

nslookup google.com 1.1.1.1

Management Network Issues¶

Intermittent Connectivity to Nodes¶

Symptom: Pings to node management IPs (10.11.10.x) sometimes fail, then work again. SSH sessions drop randomly.

Cause: In multi-switch topologies, short ARP cache times (default 30-60 seconds) can cause race conditions during MAC/ARP relearning, leading to brief connectivity outages.

Solution: Increase ARP cache times on all mesh nodes:

# Check current settings
cat /proc/sys/net/ipv4/neigh/br-mgmt/gc_stale_time
cat /proc/sys/net/ipv4/neigh/br-mgmt/base_reachable_time_ms

# Apply fix (if not deployed via Ansible)
sysctl -w net.ipv4.neigh.br-mgmt.gc_stale_time=300
sysctl -w net.ipv4.neigh.br-mgmt.base_reachable_time_ms=120000

# Make persistent
cat >> /etc/sysctl.conf << 'EOF'
# ARP cache settings for management network (br-mgmt)
net.ipv4.neigh.br-mgmt.gc_stale_time = 300
net.ipv4.neigh.br-mgmt.base_reachable_time_ms = 120000
EOF

Note: This fix is automatically applied by Ansible during deployment (see group_vars/all.yml for configuration).

Verification:

# Test all nodes from management network
for ip in 10.11.10.1 10.11.10.2 10.11.10.3; do
  ping -c 10 $ip
done
# All should show 0% packet loss

Can't Reach Node from Different Switch¶

Symptom: Devices on Switch B can't reach Node 1 (connected to Switch A), but can reach other nodes.

Solutions:

Check switch VLAN 10 configuration:
VLAN 10 must be properly trunked between switches
Management traffic uses VLAN 10
Verify BLA (Bridge Loop Avoidance) is working:

batctl cl   # Check claim table
batctl bl   # Check backbone table

Check ARP cache settings (see above)
Verify the path:

# On the affected device
ip neigh show | grep 10.11.10
# Check if MAC addresses are correct

Performance Issues¶

Slow Network Speeds¶

Solutions:

Check mesh TQ values:

batctl o
# Low values indicate poor link quality

Test direct link speed:

# On one node:
nc -l -p 5001 > /dev/null
# On another:
dd if=/dev/zero bs=1M count=100 | nc 10.11.12.1 5001

Check for CPU overload:

top

Verify Gigabit negotiation:

cat /sys/class/net/lan3/speed
# Should show 1000

High Latency¶

Solutions:

Check hop count:

batctl traceroute <destination-MAC>

Look for routing loops:

batctl o
# Check for inconsistent next hops

Check for interference (wireless):

iw dev wlan0 survey dump

Getting More Help¶

If these solutions don't resolve your issue:

Gather diagnostic info:

# Run audit playbook
make audit-node NODE=1

Check logs:

logread | tail -100
dmesg | tail -50

Open a GitHub issue with:
OpenWrt version
Exact error messages
Output of diagnostic commands
Steps to reproduce

See also: Debugging Guide for advanced troubleshooting techniques.