Notes on this session are here
 
I have a question about a problem I'm having with wireguard used between two routers to implement a tunnel. The scenario is quite complex but the basic idea is that the wireguard session between router a and router b should always be active, regardless of the underlying IP route used to carry the wireguard UDP packets. Specifically what I have seen is that when routes change such that router a ends up using a different source IP address than before, the wireguard session becomes a "zombie" -- the session stays up as far as both ends are concerned but no traffic passes. Presumably this is because router b tries to send any wireguard packets to router a's old IP address and so they never arrive. In this state I've found that if I manually terminate the wireguard session then re-start it, everything works as expected. My question is: is there a way that I can have router a re-start the wireguard session it has with router b in the event that its routing topology changes? The routers are running RouterOS by the way.

This is a very well-known WireGuard behavior, and what you’re seeing is not a bug so much as a consequence of how WireGuard is deliberately designed. Let me first ground the behavior, then get specific about what can and cannot be automated on RouterOS, and finally give you practical RouterOS patterns that actually work.


Why this happens (and why WireGuard calls it “working as designed”)

WireGuard is stateless with respect to connectivity, not to cryptography.

Key points:

  1. Peers are identified by public keys, not IPs

  2. The “endpoint” (IP:port) is learned dynamically from the last authenticated packet received

  3. WireGuard does not actively probe reachability

  4. WireGuard does not reset a session just because packets stop flowing

So in your scenario:

  • Router A’s routing changes → source IP changes

  • Router A continues to send nothing (or only sends data packets that never arrive)

  • Router B keeps the last-known endpoint for A (old IP)

  • Router B’s replies go to the old IP → black hole

  • Handshake timers never trigger because no new handshake is initiated

  • Result: cryptographically valid, operationally dead (“zombie”) tunnel

This is exactly what you’re observing.


Important RouterOS-specific clarification

On RouterOS:

  • WireGuard will not automatically re-handshake just because the local source address changes

  • There is no native “on route change” hook that restarts WG

  • There is no built-in keepalive failure detection

So you must force a new handshake.


What actually forces WireGuard to recover

There are only three things that reliably fix this situation:

1. A new handshake initiated from Router A

2. Router B receiving any authenticated packet from Router A

3. Restarting or disabling/enabling the peer or interface

The good news: you can trigger #1 and #3 automatically on RouterOS


The single most important fix: PersistentKeepalive

If you are not using this already, start here:

/interface wireguard peers set [find where comment="to-router-b"] persistent-keepalive=25

Why this helps

  • Forces Router A to send authenticated packets every 25s

  • When routing changes, those packets go out via the new source IP

  • Router B updates the endpoint automatically

  • No zombie state

When this is sufficient

✔ NAT
✔ Multiple uplinks
✔ Route failover
✔ Dynamic source IP changes

In most deployments, this alone solves your problem.

If you’re already using PersistentKeepalive and still seeing zombies → continue.


Forcing a re-handshake on route changes (RouterOS patterns)

RouterOS doesn’t expose routing events directly, but you can infer route changes.

Option 1: Script-based peer reset (most reliable)

Create a script that:

  1. Checks the current source IP used to reach router B

  2. Compares it to the last known source IP

  3. If changed → reset the peer

Example logic (simplified)

:local peerName "to-router-b" :local dstIP "B.PUBLIC.IP" :local currentSrc [/ip route get [find dst-address=$dstIP/32 active=yes] pref-src] :if ($currentSrc != [/system script environment get wg_last_src]) do={ /interface wireguard peers disable [find comment=$peerName] :delay 2 /interface wireguard peers enable [find comment=$peerName] /system script environment set wg_last_src $currentSrc }

Run this every 10–30 seconds via scheduler.

✔ Brutally effective
✔ RouterOS-friendly
✔ No WireGuard internals needed


Option 2: Netwatch-triggered restart

If Router B has a stable address:

/tool netwatch add host=B.PUBLIC.IP interval=10s down-script="\ /interface wireguard peers disable [find comment=to-router-b]; \ :delay 2; \ /interface wireguard peers enable [find comment=to-router-b]"

This works because failed routing causes netwatch to flap, which resets the tunnel.


Option 3: Periodic handshake reset (crude but safe)

If you want absolute reliability:

/system scheduler add interval=5m on-event="\ /interface wireguard peers disable [find comment=to-router-b]; \ :delay 2; \ /interface wireguard peers enable [find comment=to-router-b]"

WireGuard tolerates this extremely well.


What you cannot do (important)

❌ Force WireGuard to “notice” routing changes automatically
❌ Hook directly into routing table changes
❌ Make WireGuard re-resolve endpoints unless traffic flows

This is by design and applies to Linux, BSD, RouterOS, etc.


Best-practice summary (what I recommend)

For RouterOS in complex routing environments:

  1. Always set persistent-keepalive

  2. Add a scripted peer reset tied to routing health

  3. Treat WireGuard as cryptographic plumbing, not a routing protocol

  4. Let RouterOS scripts handle topology awareness

This combination gives you carrier-grade stability, even with:

  • ECMP

  • VRFs

  • Multiple uplinks

  • Policy routing

  • Dynamic IPs


If you want, you can describe:

  • How many uplinks

  • Whether NAT is involved

  • Whether Router B is static or dynamic

  • RouterOS version

…and I can give you a drop-in RouterOS script tailored exactly to your topology.

Thank you. I am a bit confused by your answer. On the one hand you said that RouterOS has no automatic keepalive failure detection, but then you say to enable persistent keepalives. For your information I do have persistent keepalive enabled with a 25 second interval. That doesn't seem to help with the problem.

You’re totally right to call that out — I used “keepalive” in two different senses.

Two different “keepalive” concepts

1) WireGuard PersistentKeepalive
This is not a health check. It does not detect failure and it does not restart anything. It only sends an authenticated packet periodically so that:

  • NAT mappings stay open

  • the other side can learn a new “current endpoint” if it receives that packet

WireGuard itself still happily considers the peer “configured” even if nothing is getting through. (That’s the “no failure detection” part.)

WireGuard’s own docs describe PersistentKeepalive as simply making WireGuard “more chatty,” mainly for NAT traversal.

2) A “tunnel is dead” detector + recovery action
That’s what RouterOS does not have built-in for WireGuard. You have to add it (script/netwatch/routing rules).


Why PersistentKeepalive might not help in your case

PersistentKeepalive only fixes the zombie state if Router B actually receives those packets from Router A. If B never receives them, B can’t update its “current endpoint,” so it keeps sending to the old IP.

Common reasons B might not be receiving them when A’s routing changes:

  1. Your policy routing / marks aren’t applied to WireGuard handshake/keepalive traffic, so RouterOS may emit the handshake out the “wrong” WAN/interface after a topology change. MikroTik explicitly warns about this class of issue and mentions using the peer “responder” option and correct routing-mark handling so handshakes don’t leave via the wrong route.

  2. The new path changes source NAT behavior (different public IP/CGNAT mapping/firewall), and inbound UDP from B to A’s new source IP/port doesn’t make it back.

  3. B has an endpoint-address pinned and A’s packets aren’t arriving to override it (same root cause: B isn’t seeing A). RouterOS shows both “endpoint” and “current endpoint” for exactly this reason — they can diverge.


What to do on RouterOS when PersistentKeepalive isn’t enough

At that point, you need a RouterOS-side “dead tunnel detector” + “force new handshake” action.

The most RouterOS-native approach is:

A) Detect: “tunnel can’t carry traffic”

Ping something across the tunnel (the peer’s WG IP is ideal), or check handshake age + a ping.

B) Recover: bounce the peer (or interface)

Disable/enable the peer (or the WG interface) to force a fresh handshake.

This is a common operational workaround on MikroTik and there are multiple community scripts built around “if last-handshake is too old / ping fails → disable+enable peer.”

Note: there are also reports that “disable/enable peer from script” can behave differently than clicking in WinBox in some RouterOS versions. If you hit that, bouncing the interface instead of the peer is often more reliable.


Practical recovery pattern (what I’d implement)

1) Add a watchdog that pings Router B’s WireGuard IP

  • If ping fails for (say) 3 intervals → bounce WG peer or WG interface.

2) (If you use multiple WANs / PBR) ensure WireGuard traffic is routed consistently

This is the “handshake left via wrong uplink” class of issue MikroTik warns about.


The key troubleshooting check (quick, high-signal)

When it’s “zombie,” look at Router B:

  • Current Endpoint for Router A’s peer — does it still show the old IP?

  • Latest handshake — is it updating every ~25s or not?

If handshake isn’t updating, PersistentKeepalive packets from A are not reaching B (routing/NAT/PBR issue).
If handshake is updating but data doesn’t pass, that’s a different problem (routes/allowed-address/firewall/NAT).


If you paste (redact public IPs if you want):

  • /interface wireguard peers print detail from both routers (the peer in question)

  • how you do multi-WAN / PBR (routing marks? check-gateway? recursive routes?)

…I can tell you which of the two failure modes you’re in and give you the exact RouterOS script + rule placement to make it self-heal.

Ok thanks. I understand the concept that the keepalive packets may be black-holed but I don't see how that can happen but packets from a newly initiated wireguard session do get through. I'd expect both kinds of packet to be subject to the same routing regime. But you have given me some ideas for further diagnosis. The next time I see this failure condition I will take some packet traces to attempt to diagnose further.

That’s a very reasonable expectation — and your intuition is basically right. If keepalive packets are black-holed but a fresh session works, something subtle and RouterOS-specific is going on. Let me try to reconcile that apparent contradiction and give you some concrete things to look for in your traces.


Why this can happen even though it “shouldn’t”

From a pure IP routing perspective you’re correct:

A WireGuard keepalive and a WireGuard handshake packet are both just UDP packets and should follow the same routing logic.

However, in practice on RouterOS there are three differences that can matter.


1. “Current endpoint” vs “endpoint-address” asymmetry

When a peer is established, RouterOS stores:

  • endpoint-address (configured)

  • current-endpoint (learned from last valid packet)

If the tunnel is already “up” (zombie state):

  • RouterOS continues to send data/keepalives to current-endpoint

  • It does not re-evaluate the endpoint until a handshake is forced

When you manually disable/enable the peer:

  • RouterOS discards current-endpoint

  • It re-resolves and re-evaluates routing for endpoint-address

  • A fresh handshake is generated

So although both packets are UDP, they are addressed to different destination IPs in the failure case. That alone is enough to make routing differ.

👉 In traces
Check whether keepalives are still destined to the old peer IP, while the new session starts with packets to the configured endpoint.


2. Source-address selection differences

RouterOS chooses a source address at send time, not at peer configuration time.

In a zombie state:

  • Keepalives may continue using the old preferred source IP

  • Especially if pref-src, routing marks, or VRFs were involved when the peer first came up

When you restart the peer:

  • RouterOS recomputes source address selection

  • New source IP matches the new routing topology

This behavior has been observed with policy routing and recursive routes on RouterOS even when the FIB looks correct.

👉 In traces
Compare:

  • Source IP of keepalive packets

  • Source IP of packets in the newly initiated session

If they differ, you’ve found your explanation.


3. Connection tracking & NAT state (very common)

If NAT is involved anywhere (even upstream):

  • Keepalives reuse the same UDP 5-tuple

  • NAT state may still point to the old path or interface

  • Replies go nowhere

A fresh session:

  • Uses a new source port

  • Creates a fresh NAT mapping

  • Works immediately

This is especially common with:

  • Multiple WANs

  • CGNAT

  • Asymmetric routing

  • FastTrack (worth disabling for WG traffic)

👉 In traces
Look for:

  • Same source port reused during zombie state

  • Different source port after restart


What to capture when it happens (high-value traces)

When you catch it again, I’d suggest capturing four things:

  1. On Router A (egress interface)

    • Do keepalive packets leave?

    • What source IP/port?

    • What destination IP?

  2. On Router A (any other WAN interface)

    • Make sure they’re not exiting the “wrong” interface

  3. On Router B

    • Are any packets from Router A arriving?

    • Is current-endpoint updating?

  4. Immediately after peer restart

    • Compare handshake packet headers vs keepalive headers

You don’t need a long capture — 10–20 seconds is plenty.


A key RouterOS knob to double-check

If you use policy routing:

  • Make sure WireGuard traffic is explicitly routed

    • Either via routing marks

    • Or by forcing the peer endpoint into a specific routing table

Many MikroTik multi-WAN zombie WG issues ultimately come down to handshake packets not matching the intended routing rules. MikroTik themselves call this out in RouterOS WG examples and forum threads.


Bottom line

You’re not missing something obvious — this is one of those cases where:

  • IP theory says “these packets should behave the same”

  • RouterOS implementation details make them behave differently

Your plan to take packet traces is exactly the right next step. Once you have them, the explanation almost always becomes obvious within a few minutes.

If you want, feel free to paste a redacted snippet of:

  • packet headers (src/dst IP+port)

  • current-endpoint vs endpoint-address

  • routing table at the time

…and I’m happy to help you interpret what you see.