That’s a very reasonable expectation — and your intuition is basically right. If keepalive packets are black-holed but a fresh session works, something subtle and RouterOS-specific is going on. Let me try to reconcile that apparent contradiction and give you some concrete things to look for in your traces.
Why this can happen even though it “shouldn’t”
From a pure IP routing perspective you’re correct:
A WireGuard keepalive and a WireGuard handshake packet are both just UDP packets and should follow the same routing logic.
However, in practice on RouterOS there are three differences that can matter.
1. “Current endpoint” vs “endpoint-address” asymmetry
When a peer is established, RouterOS stores:
If the tunnel is already “up” (zombie state):
When you manually disable/enable the peer:
-
RouterOS discards current-endpoint
-
It re-resolves and re-evaluates routing for endpoint-address
-
A fresh handshake is generated
So although both packets are UDP, they are addressed to different destination IPs in the failure case. That alone is enough to make routing differ.
👉 In traces
Check whether keepalives are still destined to the old peer IP, while the new session starts with packets to the configured endpoint.
2. Source-address selection differences
RouterOS chooses a source address at send time, not at peer configuration time.
In a zombie state:
-
Keepalives may continue using the old preferred source IP
-
Especially if pref-src, routing marks, or VRFs were involved when the peer first came up
When you restart the peer:
This behavior has been observed with policy routing and recursive routes on RouterOS even when the FIB looks correct.
👉 In traces
Compare:
If they differ, you’ve found your explanation.
3. Connection tracking & NAT state (very common)
If NAT is involved anywhere (even upstream):
A fresh session:
This is especially common with:
👉 In traces
Look for:
What to capture when it happens (high-value traces)
When you catch it again, I’d suggest capturing four things:
-
On Router A (egress interface)
-
On Router A (any other WAN interface)
-
On Router B
-
Immediately after peer restart
You don’t need a long capture — 10–20 seconds is plenty.
A key RouterOS knob to double-check
If you use policy routing:
Many MikroTik multi-WAN zombie WG issues ultimately come down to handshake packets not matching the intended routing rules. MikroTik themselves call this out in RouterOS WG examples and forum threads.
Bottom line
You’re not missing something obvious — this is one of those cases where:
Your plan to take packet traces is exactly the right next step. Once you have them, the explanation almost always becomes obvious within a few minutes.
If you want, feel free to paste a redacted snippet of:
-
packet headers (src/dst IP+port)
-
current-endpoint vs endpoint-address
-
routing table at the time
…and I’m happy to help you interpret what you see.