Add multi-hop NextHop recovery tests and unit tests for routing reliability

- Introduced a new test suite for multi-hop NextHop directed-message delivery and relay recovery in `test_nexthop_multihop_recovery.py`. This includes tests for end-to-end delivery and recovery after relay drop.
- Implemented unit tests in `test_main.cpp` for NextHop routing reliability mitigations, covering:
  - M1: Ambiguity-aware last-byte resolution.
  - M2: NextHopRouter's strict-neighbor gate and hop limit checks.
  - M3: Route-health freshness and failure decay.
- Enhanced mock classes to facilitate controlled testing of node behaviors and routing logic.
This commit is contained in:
Ben Meadors
2026-06-13 16:31:13 -05:00
parent 9ea1f0065a
commit abd66bfca3
11 changed files with 1604 additions and 34 deletions

View File

@@ -0,0 +1,456 @@
# NextHop direct-message reliability on dense meshes — findings & plan
**Status:** Proposal / design doc for branch handoff (no code changes yet)
**Date:** 2026-06-13
**Area:** `src/mesh` router stack (`NextHopRouter`, `ReliableRouter`, `FloodingRouter`, `Router`, `NodeDB`, `PacketHistory`)
**Constraint:** No over-the-air / wire-format changes — `next_hop` and `relay_node` stay 1 byte, no `PacketHeader` changes, no breaking protobuf changes. All new state is RAM-only.
This document captures the analysis and the proposed mitigations so the work can be
continued on this branch by anyone. It is intentionally code-grounded (file:line
references throughout) and standalone — you should not need the original investigation
context to pick it up.
---
## TL;DR
NextHop routing for direct messages (DMs) is unreliable on dense meshes. The headline
cause is the **birthday problem**: `next_hop` and `relay_node` are each a single byte
(the last byte of a 32-bit node number), so on a mesh of N nodes the probability that
two share the same byte hits ~50% at **~19 nodes** and is near-certain by 50100. But
there are **other, equally important issues**: that single byte is trusted blindly at
five different code sites, learned routes **never decay**, routes are learned from the
**reverse (ACK) path** (asymmetric-link hazard), and collision-driven spurious
rebroadcasts **amplify congestion** exactly when the mesh is busy.
Because we can't widen the on-wire field, the fix is **interpretation-side** ("don't
trust a byte that doesn't map to a unique reachable neighbor — flood instead") plus
**recovery-side** ("decay stale/failing routes so they get re-discovered"). Four
mitigations, M1M4, all RAM-only. The net behavioral change: on dense/mobile meshes a
DM that today silently misroutes or black-holes instead falls back to managed flooding
(which still delivers) and re-learns a fresh route quickly. Sparse-mesh happy paths are
unchanged.
---
## How NextHop routing works today (mechanics)
Inheritance chain: `Router``FloodingRouter``NextHopRouter``ReliableRouter`.
**The single-byte identifiers.** Both routing bytes come from one helper:
```cpp
// src/mesh/NodeDB.h:255
uint8_t getLastByteOfNodeNum(NodeNum num) { return (uint8_t)((num & 0xFF) ? (num & 0xFF) : 0xFF); }
```
It projects a 32-bit node number onto 255 values (`0x00` is remapped to `0xFF` so it
never collides with the `0`-valued sentinels `NO_NEXT_HOP_PREFERENCE` / `NO_RELAY_NODE`,
`src/mesh/MeshTypes.h:44-46`). `next_hop` and `relay_node` in the packet header are
`uint8_t` (`src/mesh/mesh.pb.h`, comments "Last byte of the node number…"). The learned
route stored per destination, `meshtastic_NodeInfoLite::next_hop`, is also a single byte
(`src/mesh/generated/meshtastic/deviceonly.pb.h:83`).
**Sending a DM**`NextHopRouter::send` (`src/mesh/NextHopRouter.cpp:23`):
1. `p->relay_node = getLastByteOfNodeNum(getNodeNum())` (mark ourselves as relayer).
2. `p->next_hop = getNextHop(p->to, p->relay_node)` (`src/mesh/NextHopRouter.cpp:192`):
look up `nodeDB->getMeshNode(to)->next_hop`; return it unless it equals the relayer
byte; otherwise `NO_NEXT_HOP_PREFERENCE` (→ flood).
**Relaying**`NextHopRouter::perhapsRebroadcast` (`src/mesh/NextHopRouter.cpp:133`):
rebroadcast iff `next_hop == NO_NEXT_HOP_PREFERENCE` (flood) **or**
`next_hop == getLastByteOfNodeNum(getNodeNum())` (we are the addressed next hop)
(`:147`). Each node only ever compares against **its own** byte.
**Learning**`NextHopRouter::sniffReceived` (`src/mesh/NextHopRouter.cpp:89`): on an
ACK/reply (`request_id`/`reply_id` set), if the relayer of the ACK was also a relayer of
the original packet (validated via `PacketHistory::checkRelayers`), set
`origTx->next_hop = p->relay_node` (`:114`). I.e. the **forward** next-hop is learned
from the **reverse** path's relayer.
**Retransmission / fallback**`NextHopRouter::doRetransmissions`
(`src/mesh/NextHopRouter.cpp:284`). Budgets: `NUM_RELIABLE_RETX=3` (originator: initial
- 2 retries), `NUM_INTERMEDIATE_RETX=2` (relayer: 1 retry). On the **last** retry
(`numRetransmissions==1`) it resets `next_hop` to `NO_NEXT_HOP_PREFERENCE` on the packet
**and** clears `sentTo->next_hop` in NodeDB, then floods (`:313-321`). Retransmit timing
comes from `iface->getRetransmissionMsec`, whose contention window **grows with channel
utilization** (`src/mesh/RadioInterface.cpp` `getTxDelayMsec`/`getTxDelayMsecWeighted`).
**Dedup / relayer history**`PacketHistory` (`src/mesh/PacketHistory.cpp`): a bounded
ring (`PACKETHISTORY_MAX = max(MAX_NUM_NODES*2, 100)`, 20 B/record) keyed by
`(sender,id)`, tracking up to `NUM_RELAYERS=6` relayer **bytes** per packet in
`relayed_by[]`. `wasRelayer` (`:490`) and `checkRelayers` (`:517`) match bytes against
that array.
---
## Root-cause analysis
### 1. The single byte is trusted blindly at five sites (the birthday problem)
| # | Site | File:line | Failure on collision |
| --- | -------------------------------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| 1 | Rebroadcast self-check | `NextHopRouter.cpp:147` | A remote "impostor" node sharing the intended next-hop's byte also rebroadcasts → wasted airtime / congestion. |
| 2 | Route learning | `NextHopRouter.cpp:111-114` | Stores an ambiguous byte as the route; later resolves to the wrong physical node. |
| 3 | Relayer validation | `PacketHistory.cpp:490-538` | `wasRelayer(byte)` returns true for the wrong node → mis-validated ACK / mis-learn. |
| 4 | Favorite-router hop preservation | `Router.cpp:120-145` | **First** NodeDB node whose last byte matches wins — non-deterministic; can preserve hops for the wrong relay (hop leak). |
| 5 | Send-path lookup | `NextHopRouter.cpp:192-207` | Emits a byte that may address the wrong node; no check it still maps to a reachable neighbor. |
Collision math (uniform last byte over 255 buckets): P(collision) ≈ 50% at ~19 nodes,
> 99% by ~75 nodes. Dense meshes are squarely in the "always colliding" regime.
### 2. Stale routes never decay
The learned `next_hop` byte is cleared only on the **current DM's** last retry
(`NextHopRouter.cpp:313-321`). A route learned hours ago that has since gone dead is
still trusted on the **next** DM's first attempt — which on a congested mesh is also the
slowest attempt. Result: silent black-hole at a dead hop until the retransmission budget
drains, then a late flood. Intermediate nodes hold stale routes indefinitely.
### 3. Reverse-path (asymmetric-link) learning
`origTx->next_hop` is learned from the ACK's relayer (`NextHopRouter.cpp:110-114`) — the
**reverse** direction. RF links are frequently asymmetric, so the best reverse relay can
be a poor forward relay. Worse, the next reverse ACK immediately re-learns the same bad
hop, so the route **flaps** back to the bad value even after a failure reset.
### 4. Congestion amplification
Collision-driven impostor rebroadcasts (issue 1) add airtime; the contention window
grows with channel utilization, so retransmit intervals **lengthen** exactly when the
mesh is busy. The 3-try reliable budget can then expire before delivery. On dense
meshes, efficiency _is_ reliability.
### Note: pubkey-derived node numbers (develop / 2.8) — does not change the plan
develop derives the node number from the public key:
`my_node_num = crc32Buffer(public_key)` (`src/mesh/NodeDB.cpp:481`), re-derived on key
change in `createNewIdentity()` (`src/mesh/NodeDB.cpp:3113`). This **reinforces** the
plan rather than changing it:
- **Birthday problem unchanged and now textbook-exact.** CRC32 mixes well → the last
byte is uniformly distributed over 256 values. Derivation adds no wire bits.
- **Node numbers are now immutable / identity-bound.** Pre-2.8 `pickNewNodeNum()` could
renumber a node to dodge a conflict; now the number is fixed by the key, so a last-byte
collision **cannot be resolved operationally by renumbering** → M1/M2/M3 become _more_
necessary.
- **Resolver gets cleaner inputs.** Stable node numbers keep a learned byte bound to one
identity (good for M3 freshness). `createNewIdentity()` retires the old entry by marking
it **ignored** and clearing its pubkey (`src/mesh/NodeDB.cpp:3123-3125`), which M1's
candidate gate already skips — so key rotation can't pollute resolution.
- **No wire-free disambiguation unlocked.** A receiver still gets only 1 byte and cannot
recover which full node number a colliding value meant — so "detect ambiguity → flood"
remains the correct strategy.
---
## Proposed mitigations
Key insight for all of M1/M2: **a 1-byte ID only needs to be unique among a node's
direct neighbors / plausible relays, not the whole mesh.** That candidate set is small
(typically 515), so a byte usually resolves unambiguously there; when it doesn't, fall
back to the _safe_ behavior (flood / decrement / don't-learn).
### M1 — Ambiguity-aware last-byte resolution (new NodeDB primitive)
New types + methods in `src/mesh/NodeDB.h` (near line 255) / `src/mesh/NodeDB.cpp`
(near `getMeshNode`, ~2936):
```cpp
enum class LastByteResolution : uint8_t { None, Unique, Ambiguous };
struct ResolvedNode { LastByteResolution status = LastByteResolution::None; NodeNum num = 0; };
// Resolve a single on-wire last-byte to a unique full NodeNum among relevant candidates.
ResolvedNode resolveLastByte(uint8_t lastByte, bool requireDirectNeighbor);
// Convenience: true iff exactly one relevant candidate (Ambiguous and None both -> false = SAFE).
bool resolveUniqueLastByte(uint8_t lastByte, bool requireDirectNeighbor, NodeNum *outNum = nullptr);
```
- **One linear pass** over `meshNodes`, reusing `getNumMeshNodes()`/`getMeshNodeByIndex()`,
the bitfield helpers (`nodeInfoLiteIsFavorite/HasUser/IsIgnored`), `sinceLastSeen()`,
and `getLastByteOfNodeNum()`. **Early-exit** on the 2nd match (return `Ambiguous`).
- **Guard:** `if (lastByte == 0) return {None, 0};` (covers `NO_RELAY_NODE` / MQTT-invalid).
- **Candidate gate** (skip): `num == getNodeNum()` (never resolve to ourselves), `num == 0`,
`num == NODENUM_BROADCAST`, `nodeInfoLiteIsIgnored`. Then match
`getLastByteOfNodeNum(node->num) == lastByte` (cheapest test last, mirroring `Router.cpp:119`).
- **Relevance gate:**
- `requireDirectNeighbor == true` (strict, for SEND): `has_hops_away && hops_away == 0`
**and** `sinceLastSeen(node) < NEXTHOP_NEIGHBOR_FRESH_SECS`.
- `requireDirectNeighbor == false` (lenient, for learn / hop-preserve): accept if direct
neighbor **or** `nodeInfoLiteIsFavorite` **or** role ∈ {ROUTER, ROUTER_LATE, CLIENT_BASE}.
- **No tie-break.** A collision must return `Ambiguous` — picking "best SNR" would
resurrect the silent-misroute bug. (Deliberate non-goal; document in code.)
New constant in `src/mesh/MeshTypes.h` (near line 44):
`#define NEXTHOP_NEIGHBOR_FRESH_SECS (60 * 60 * 2)` (mirrors `NUM_ONLINE_SECS`).
### M2 — Only route on bytes that resolve to a unique, reachable neighbor
In `getNextHop` (`src/mesh/NextHopRouter.cpp:192-207`), after the existing split-horizon
check (`node->next_hop != relay_node`), require the stored byte to resolve to a **unique,
currently-fresh direct neighbor**; else flood:
```cpp
if (node->next_hop != relay_node) {
ResolvedNode r = nodeDB->resolveLastByte(node->next_hop, /*requireDirectNeighbor=*/true);
if (r.status == LastByteResolution::Unique) return node->next_hop;
LOG_WARN("Next hop 0x%x for 0x%x %s -> flood", node->next_hop, to,
r.status == LastByteResolution::Ambiguous ? "ambiguous among neighbors" : "no longer a neighbor");
return std::nullopt;
}
```
This self-heals when a neighbor goes away (unicast-into-a-void becomes a flood). It
applies to originating, relaying, and retrying, since all route through `getNextHop`.
Apply M1's safe fallback at the other sites:
- **Learning** (`NextHopRouter.cpp:111-114`): gate `origTx->next_hop = p->relay_node` on
`resolveUniqueLastByte(p->relay_node, /*direct=*/false)`. Ambiguous/unknown → don't
learn (leave route unset → flood).
- **Favorite-router preservation** (`Router.cpp:120-145`): replace the "first match wins"
loop with `resolveUniqueLastByte(p->relay_node, /*direct=*/false)` + a re-check that the
resolved node is favorite/has_user/router. Ambiguous/none/not-favorite → **decrement**
(safe). Net: removes one full DB scan, adds one resolver scan (wash).
**Left unchanged, by design (document why in code):**
- **Site 1** rebroadcast self-check (`NextHopRouter.cpp:147`) and self-identity checks
(`ReliableRouter.cpp:127`): a node matches its **own** byte — no DB resolution helps. A
remote impostor sharing the intended next-hop's byte will still rebroadcast. M1/M2
shrink the blast radius by reducing how often an ambiguous byte is ever stored or
originated; a true fix needs a wider field (out of scope). **This is the one residual
the plan cannot fully close.**
- **Site 3** `wasRelayer`/`checkRelayers` (`PacketHistory.cpp:490-538`): intentionally
byte-domain (both sides are on-wire bytes); the consumer (learning) is now hardened.
Add a one-line comment; do not change.
### M3 — Route freshness / failure memory (RAM table on NextHopRouter)
A bounded, LRU-evicted table keyed by destination, mirroring `PacketHistory`'s
reuse-oldest discipline (not an unbounded map) to cap RAM.
`src/mesh/NextHopRouter.h` (near `pending`, line 99):
```cpp
struct RouteHealth {
NodeNum dest = 0; // 0 == empty slot
uint32_t learnedAtMsec = 0; // millis() at last (re)learn; rollover-aware
uint8_t consecutiveFailures = 0;
uint8_t lastNextHop = NO_NEXT_HOP_PREFERENCE; // byte this health refers to
};
static constexpr uint8_t ROUTE_HEALTH_MAX = 32; // ~384B; drop to 16 if RAM-tight
RouteHealth routeHealth[ROUTE_HEALTH_MAX] = {};
// Helpers take `now` (pure/testable): findRouteHealth, getOrAllocRouteHealth,
// noteRouteLearned, noteRouteSuccess, noteRouteFailure, isRouteStale, clearRouteHealth
```
Policy:
| Constant | Value | Rationale |
| ------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `ROUTE_TTL_MSEC` | 30 min | Survives a normal conversation; re-discovers a moved node within a telemetry interval. |
| `ROUTE_FAILURE_THRESHOLD` | 3 | 12 consecutive failures are transient LoRa collisions; 3 to the same hop = dead. Accumulates **across** DMs (independent of the per-DM 3-try budget). |
`isRouteStale(h, now)` = `(now - h.learnedAtMsec) >= ROUTE_TTL_MSEC || h.consecutiveFailures >= ROUTE_FAILURE_THRESHOLD`.
All age math uses **unsigned subtraction** (rollover-safe, matching
`PacketHistory.cpp:364`); treat `learnedAtMsec == 0` as "set now".
Wiring (as built — `src/mesh/NextHopRouter.cpp`, `src/mesh/ReliableRouter.cpp`):
- `getNextHop`: if a health record matches the stored byte and `isRouteStale`, clear
`node->next_hop` (NodeDB) **and** `clearRouteHealth`, return `nullopt` (flood). No
record yet (cold path, first DM after boot) → trust NodeDB, but the M2 strict-neighbor
gate still applies.
- `sniffReceived` learn: gate the write through `resolveUniqueLastByte` (M2), then
`noteRouteLearned(p->from, p->relay_node, millis())` — resets `consecutiveFailures`
**only if the hop changed** (anti-flap for asymmetric re-learn); otherwise just refreshes
`learnedAtMsec`. (No success signal is taken on the intermediate reverse-pass: an ACK
merely passing through us is not proof that _we_ delivered, and resetting failures there
would reintroduce the asymmetric flap.)
- `doRetransmissions`: on the last-retransmission branch (`numRetransmissions == 1`, the
point a directed delivery has gone un-ACKed for both originator and intermediate) →
`noteRouteFailure(to)`, then the existing NodeDB `next_hop` reset + flood. We deliberately
do **not** `clearRouteHealth` here: keeping the record is what lets the failure count
accumulate across DMs so a flapping reverse-path-relearned dead hop eventually ages out.
- `ReliableRouter::sniffReceived` ACK path → `noteRouteSuccess(getFrom(p), millis())`
(an end-to-end ACK addressed to us is genuine forward-delivery proof; clears failures and
refreshes freshness). `noteRouteSuccess`/`noteRouteFailure` are no-ops when no record
exists, so flood-only destinations never pollute the table.
**Reconciliation (no double-handling):** `doRetransmissions` owns _in-flight_ failure of
the current DM (reset NodeDB `next_hop` + flood, and bump the cross-DM failure counter);
`getNextHop` owns _between-DM_ staleness (TTL or failure-threshold → flood + clear). The
only place that erases a health record is the `getNextHop` decay path; the retransmission
path leaves it intact so the counter survives a reverse-path re-learn.
### M4 — Earlier flood for unverified routes (gated, off by default)
Compile-gated so healthy sparse meshes are untouched. **Default is off** — the define
lives in `NextHopRouter.h` and must be flipped to measure:
`#define NEXTHOP_EARLY_FLOOD_ON_UNVERIFIED 1`.
In `doRetransmissions`, the directed-retry `else` branch: if the route is **not verified**
(`!findRouteHealth(to) || consecutiveFailures > 0 || isRouteStale`), reset `next_hop` and
flood on this attempt instead of spending another directed try. A **verified** route
(record present, `consecutiveFailures == 0`, within TTL — i.e. recently ACKed) takes the
unchanged directed-retry path, so the sparse-mesh happy path is untouched. Trade-off:
airtime ↔ latency; the gate ensures we never pay the flood cost on a proven route, only on
one we already distrust. Off by default precisely so it can be A/B-measured on the
simulator before broad enable.
---
## Files to modify
| File | Change |
| ------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `src/mesh/MeshTypes.h` | `NEXTHOP_NEIGHBOR_FRESH_SECS`, `ROUTE_TTL_MSEC`, `ROUTE_FAILURE_THRESHOLD`, `NEXTHOP_EARLY_FLOOD_ON_UNVERIFIED` |
| `src/mesh/NodeDB.h` / `src/mesh/NodeDB.cpp` | `LastByteResolution`, `ResolvedNode`, `resolveLastByte`, `resolveUniqueLastByte` |
| `src/mesh/NextHopRouter.h` | `RouteHealth` + array + helpers; `#ifdef PIO_UNIT_TESTING public:` for helpers and `getNextHop` |
| `src/mesh/NextHopRouter.cpp` | `getNextHop` (M2 gate + M3 decay); `sniffReceived` (learn gate + health seed + success); `doRetransmissions` (failure counting + M4); comment site 1 |
| `src/mesh/Router.cpp` | `shouldDecrementHopLimit` → resolver + favorite/router re-check |
| `src/mesh/ReliableRouter.cpp` | ACK path → `noteRouteSuccess` |
| `test/test_nexthop_routing/test_main.cpp` | **new** unit suite (auto-built under `[env:native]`) |
**Reuse, don't reinvent:** `getLastByteOfNodeNum`, `sinceLastSeen`, the bitfield helpers,
`getMeshNodeByIndex`/`getNumMeshNodes`, PacketHistory's reuse-oldest eviction shape, and
`MockNodeDB::addTestNode` (from `test/test_hop_scaling/test_main.cpp`).
---
## Edge cases
- **`0x00``0xFF` projection:** the resolver compares via `getLastByteOfNodeNum` on both
sides, so a `…00` node and a `…FF` node correctly collide on `0xFF``Ambiguous`. Test
explicitly.
- **MQTT packets:** `relay_node`/`next_hop` are forced invalid when `hop_start == 0`
(`src/mesh/RadioLibInterface.cpp:603-605`) → byte 0 → resolver `None` → don't learn
(correct).
- **`has_hops_away == false`** nodes are excluded from the strict gate (never fabricate a
Unique neighbor for M2); admitted to the lenient gate only via favorite/router role.
Safe; self-corrects once `hops_away` is learned.
- **Self / broadcast:** the resolver skips `getNodeNum()` and `NODENUM_BROADCAST`;
`getNextHop` already early-returns for broadcast.
- **Perf:** M2 adds one O(N) resolver scan per directed send/relay (early-exit on the 2nd
match), cheaper than the crypto already on that path; site-4 is a wash. If ever hot, a
future 256-entry last-byte index is the optimization (not now — RAM).
---
## Verification (all tiers)
### 1. Native unit tests — new `test/test_nexthop_routing/test_main.cpp`
`pio test -e native -f test_nexthop_routing`; on macOS `./bin/test-native-docker.sh -f test_nexthop_routing`.
Design the RouteHealth helpers to take `now` as a parameter so the 30-min TTL logic is
testable without a clock mock.
- **Resolver:** None / Unique / **Ambiguous (birthday collision)** / strict-excludes-stale /
strict-excludes-far / lenient-includes-favorite-router / lenient-collision / skips-self /
skips-ignored / **`0x00``0xFF` collision** / early-exit.
- **`getNextHop`:** unique→byte, **ambiguous→nullopt**, stale-neighbor→nullopt,
split-horizon (relay==next_hop)→nullopt, broadcast→nullopt.
- **RouteHealth:** TTL boundary, **rollover** (learn near `0xFFFFFFFF`, check after wrap),
failure threshold, success-resets, **re-learn-same-hop keeps fails (anti-flap)**,
re-learn-new-hop resets, LRU eviction bound, clear.
- **Site-4:** preserve on unique favorite router; **decrement on two colliding favorites**;
decrement when the resolved node is not a favorite.
- **Sparse-mesh regression:** all-distinct last bytes → every resolve Unique, `getNextHop`
returns the stored byte unchanged (proves no happy-path change).
- Re-run `test_packet_history` and `test_hop_scaling` for no regression.
### 2. portduino SimRadio simulator
`pio run -e native && ./bin/test-simulator.sh`. Best vehicle for the **intermediate-node**
path the 2-device bench can't reach. Line topology A — B — C: establish A→C (B learns a
directed route), stop B relaying that dest, confirm A re-discovers via flood within
`ROUTE_FAILURE_THRESHOLD` and that B's `noteRouteFailure`/`clearRouteHealth` fires (visible
via the `LOG_INFO "Route to … stale"` / "Resetting next hop" lines). Use this to A/B M4
(attempts-to-delivery, total airtime).
### 3. Hardware via meshtastic MCP (auto-detect; 3+ devices for a real hop)
- `mcp-server/tests/mesh/test_nexthop_multihop_recovery.py` — **the multi-hop validator
for this work** (added on this branch). Self-discovers an A — relay — C line, asserts a
directed DM is delivered across the relay (next_hop + M1/M2/M3 engaged), and asserts
delivery recovers after the relay is power-cycled (M3). Skips unless the bench is a true
multi-hop line (≥3 roles via `--hub-profile`, endpoints out of direct RF range).
- `mcp-server/tests/mesh/test_direct_with_ack.py` — happy-path regression: a fresh/unique
route still delivers a want_ack DM on the first/second try (M4's gate must keep this
green).
- `mcp-server/tests/mesh/test_peer_offline_recovery.py` — 2-device recovery validator: peer
off mid-conversation then back. Must stay green and ideally recover in fewer attempts.
### 4. Build / format sanity
native-macos **and** Docker both ways; trunk clang-format@16.0.3; a release `pio run` to
confirm the `#ifdef PIO_UNIT_TESTING` visibility widening does **not** leak into
production; sanity-check RAM headroom on the smallest nRF52 build for the ~384 B table.
---
## Verification status (as built on `nexthop-redux`)
| Tier | What ran | Result |
| -------------------------------- | ----------------------------------------------------------------------------------- | ------------------- |
| Unit (native-macos) | `test_nexthop_routing` (31 cases) | ✅ 31/31 |
| Unit (Docker / Linux, CI parity) | `test_nexthop_routing` | ✅ 31/31 |
| Regression | `test_packet_history`, `test_hop_scaling`, `test_mqtt`, `test_traffic_management` | ✅ 105/105 |
| Build | `pio run -e native-macos` (M4 off) and with `-DNEXTHOP_EARLY_FLOOD_ON_UNVERIFIED=1` | ✅ both link |
| Format | trunk `clang-format@16.0.3` | ✅ no issues |
| Simulator (CI `simulator-tests`) | `meshtasticd -s` + `meshtastic.test.testSimulator()` on native-macos | ✅ exit 0, no crash |
**Pending (environment-blocked, not yet run):**
- **Multi-hop ABC recovery sim** — the `simulator/` broker hub is **not git-tracked**
(only stale local `.pyc`), and two `meshtasticd -s` instances can't hear each other
without it. The intermediate-node failure-count path and the M4 A/B therefore have unit
coverage of their logic but no end-to-end multi-node run yet.
- **Hardware / multi-hop tier** — a committable bench test now exists:
`mcp-server/tests/mesh/test_nexthop_multihop_recovery.py`. It self-discovers a real
multi-hop pair (A — relay — C), asserts a directed DM is delivered across the relay, and
asserts delivery recovers after the relay is power-cycled (the M3 path). It
`pytest.skip`s cleanly unless the bench is a true line with endpoints out of direct RF
range (≥3 roles via `--hub-profile`), so it's safe to commit and only asserts when the
NextHop path is genuinely exercised. Collected + verified to skip without hardware;
not yet run on a bench. `test_direct_with_ack.py` / `test_peer_offline_recovery.py`
remain the 2-device happy-path/recovery regressions.
---
## Risks & limitations
- **Site-1 impostor rebroadcast** is unfixable without a wider field — documented; M1/M2
only shrink its frequency.
- **Dense meshes flood DMs more often** — intended (a flooded DM arrives; a mis-unicast one
black-holes). Call out in the PR so reviewers expect a slightly higher DM flood rate on
very dense meshes.
- **M4 airtime** if the gate is too loose → default conservative + compile-gated +
simulator A/B before broad enable.
- **RAM** ~384 B (32 slots); 16 slots (~192 B) with graceful LRU degradation if tight.
- **Asymmetric flap** not fully closed (a _new_ bad hop resets the counter); the TTL
backstop bounds it. Per-hop failure history is future work (more RAM).
---
## How to continue this work (commit sequencing)
Each step is independently testable; land them as separate commits.
1. **M1 resolver + unit tests**`NodeDB` only; no behavior change until wired. Lands the
`resolveLastByte`/`resolveUniqueLastByte` primitive and its full unit-test matrix.
2. **M2 + wiring + tests**`getNextHop` strict gate, learning gate, favorite-router
preservation rewrite. Adds the `getNextHop` and site-4 tests.
3. **M3 health table + decay + tests** — RAM `RouteHealth` table, decay-on-read, failure/
success accounting, reconciliation with the existing last-retry reset. Adds the
route-health unit tests and the simulator recovery check.
4. **M4 gated tuning** — early-flood-on-unverified behind the compile flag; simulator A/B
and hardware regression.
Reference plan (with the same content) was developed at
`~/.claude/plans/nexthop-routing-for-direct-lexical-shell.md` on the author's machine; this
in-repo doc is the canonical handoff copy.

View File

@@ -0,0 +1,347 @@
"""Multi-hop NextHop directed-message delivery + relay-recovery (bench test).
This is the hardware/tier-3 validator for the NextHop DM reliability work
(see `docs/nexthop-routing-reliability.md`). The unit suite
`test/test_nexthop_routing` covers the routing *logic* exhaustively; this test
covers the *end-to-end* multi-hop behavior that only a real (or RF-separated)
mesh exercises:
* a directed DM that must traverse a relay is delivered (next_hop routing +
the M1/M2 ambiguity gate + M3 route learning all engage), and
* when the established relay drops and returns, delivery recovers rather than
black-holing (the M3 stale-route decay / re-learn path).
TOPOLOGY REQUIREMENT — why this usually SKIPS:
A NextHop relay only happens when the two endpoints are NOT direct neighbors.
Three co-located radios all hear each other, so A→C is a single direct hop and
next_hop never engages. To run this test the bench must be a *line* — A — B — C
— with the endpoints out of each other's direct RF range (physical distance or
attenuators). The `multihop_topology` fixture detects this automatically: it
warms the mesh, looks for a pair that is ≥1 hop apart, confirms the relay via
traceroute, and `pytest.skip`s cleanly when the bench is all-direct. So this
file is safe to commit and run anywhere — it only *asserts* when the topology
genuinely requires a relay.
REQUIREMENTS:
* ≥3 baked devices. The default hub profile is 2 roles (nrf52, esp32s3); add a
third via `--hub-profile=path/to/hub.yaml` (see conftest `hub_profile`).
* The relay-recovery test additionally needs uhubctl + a power-controllable
relay port (same gate the other power tests use).
"""
from __future__ import annotations
import time
from typing import Any
import pytest
from meshtastic_mcp.connection import connect
from tests import _power
from tests._port_discovery import resolve_port_by_role
from ._receive import ReceiveCollector, nudge_nodeinfo, nudge_nodeinfo_port
def _hops_away(rec: dict[str, Any]) -> int | None:
"""Read a node's hop distance from a `nodesByNum` entry, tolerating either
the camelCase (`hopsAway`) or snake_case (`hops_away`) spelling depending on
the meshtastic-python version."""
for key in ("hopsAway", "hops_away"):
val = rec.get(key)
if isinstance(val, int):
return val
return None
def _warm_mesh(ports: list[str], rounds: int = 2, settle: float = 6.0) -> None:
"""Flood a fresh NodeInfo from every node so the whole mesh (including
multi-hop pairs, reached via relayed broadcasts) populates pubkeys and hop
distances. Best-effort — a single node failing to nudge shouldn't abort."""
for _ in range(rounds):
for port in ports:
try:
nudge_nodeinfo_port(port)
except Exception: # noqa: BLE001 — warmup is best-effort
pass
time.sleep(0.5)
time.sleep(settle)
def _wait_for_pubkey(
tx_iface: Any, rx_num: int, rx_port: str, deadline_s: float = 90.0
) -> bool:
"""Block until `tx_iface` holds `rx_num`'s public key (directed PKI sends
NAK without it). Re-nudges both sides periodically; multi-hop warmup is
slower than the 2-device case because NodeInfo must be relayed, hence the
longer default deadline."""
deadline = time.monotonic() + deadline_s
last_nudge = time.monotonic()
while time.monotonic() < deadline:
rec = (tx_iface.nodesByNum or {}).get(rx_num, {})
if rec.get("user", {}).get("publicKey"):
return True
if time.monotonic() - last_nudge > 20.0:
nudge_nodeinfo_port(rx_port)
nudge_nodeinfo(tx_iface)
last_nudge = time.monotonic()
time.sleep(1.0)
return False
def _traceroute_route(tx_port: str, rx_num: int, rx_port: str) -> list[int] | None:
"""Run a traceroute TX→RX and return the forward `route` (list of relay node
numbers), or None if it couldn't be obtained. Mirrors test_traceroute's
request/PKI/retry pattern."""
from meshtastic.mesh_interface import MeshInterface
with ReceiveCollector(tx_port, topic="meshtastic.receive.traceroute") as tx:
nudge_nodeinfo_port(rx_port)
tx.broadcast_nodeinfo_ping()
if not _wait_for_pubkey(tx._iface, rx_num, rx_port, 60.0):
return None
for _attempt in range(2):
try:
tx._iface.sendTraceRoute(dest=rx_num, hopLimit=5)
break
except MeshInterface.MeshInterfaceError:
time.sleep(5.0)
else:
return None
pkt = tx.wait_for(lambda p: p.get("from") == rx_num, timeout=8.0)
if pkt is None:
return None
tr = (pkt.get("decoded", {}) or {}).get("traceroute") or {}
return [int(n) for n in (tr.get("route") or [])]
@pytest.fixture(scope="session")
def multihop_topology(baked_mesh: dict[str, Any]) -> dict[str, Any]:
"""Discover a real multi-hop pier (tx → relay → rx) on the bench, or skip.
Returns {tx_role, tx_port, rx_role, rx_port, rx_num, relay_role, relay_num}.
"""
roles = sorted(baked_mesh)
if len(roles) < 3:
pytest.skip(
"multi-hop NextHop test needs ≥3 baked devices arranged as a line "
"(endpoints out of direct RF range). Add a third role via "
f"--hub-profile. Detected roles: {roles}"
)
by_role = {r: (baked_mesh[r]["port"], baked_mesh[r]["my_node_num"]) for r in roles}
if any(num is None for _, num in by_role.values()):
pytest.skip("a baked device is missing my_node_num; can't map the topology")
_warm_mesh([port for port, _ in by_role.values()])
# Find an ordered pair that is ≥1 hop apart, using each node's own nodeDB
# (cheap — no traceroute yet). On an all-direct bench nothing qualifies.
multihop_pair: tuple[str, str] | None = None
for a_role in roles:
a_port, _ = by_role[a_role]
try:
with connect(port=a_port) as a_iface:
nodes = a_iface.nodesByNum or {}
except Exception: # noqa: BLE001
continue
for c_role in roles:
if c_role == a_role:
continue
_, c_num = by_role[c_role]
hops = _hops_away(nodes.get(c_num, {}))
if hops is not None and hops >= 1:
multihop_pair = (a_role, c_role)
break
if multihop_pair:
break
if not multihop_pair:
pytest.skip(
"no multi-hop pair found — every device appears to be a direct "
"neighbor. Arrange the bench as a line (A — B — C) with the "
"endpoints out of direct RF range (distance or attenuators) so a "
"relay is actually required, then re-run."
)
a_role, c_role = multihop_pair
a_port, _ = by_role[a_role]
c_port, c_num = by_role[c_role]
route = _traceroute_route(a_port, c_num, c_port)
if not route:
pytest.skip(
f"{a_role}{c_role} looked multi-hop but traceroute returned no "
"intermediate relay; can't identify the relay node to drive the "
"recovery test"
)
relay_num = route[0]
relay_role = next((r for r in roles if by_role[r][1] == relay_num), None)
return {
"tx_role": a_role,
"tx_port": a_port,
"rx_role": c_role,
"rx_port": c_port,
"rx_num": c_num,
"relay_role": relay_role,
"relay_num": relay_num,
}
@pytest.mark.timeout(300)
def test_multihop_dm_delivers(multihop_topology: dict[str, Any]) -> None:
"""A directed wantAck DM that must traverse the relay is delivered.
Exercises the NextHop routing path end-to-end: TX picks a next hop toward
RX (M2 gate), the relay resolves the next_hop byte and forwards (M1), and
the route is learned from the returning ACK (M3). Retries absorb transient
LoRa loss; the assertion is on eventual delivery.
"""
tx_port = multihop_topology["tx_port"]
rx_port = multihop_topology["rx_port"]
rx_num = multihop_topology["rx_num"]
tx_role = multihop_topology["tx_role"]
rx_role = multihop_topology["rx_role"]
relay_role = multihop_topology["relay_role"]
unique = f"nexthop-mh-{tx_role}-to-{rx_role}-{int(time.time())}"
with ReceiveCollector(rx_port, topic="meshtastic.receive.text") as rx:
rx.broadcast_nodeinfo_ping()
with connect(port=tx_port) as tx_iface:
nudge_nodeinfo(tx_iface)
if not _wait_for_pubkey(tx_iface, rx_num, rx_port, 90.0):
pytest.skip(
f"{tx_role} never learned {rx_role}'s pubkey over the relay; "
"multi-hop PKI warmup didn't complete"
)
got = None
for _attempt in range(3):
pkt = tx_iface.sendText(unique, destinationId=rx_num, wantAck=True)
assert pkt is not None
got = rx.wait_for(
lambda p: p.get("decoded", {}).get("text") == unique,
timeout=45,
)
if got is not None:
break
rx.broadcast_nodeinfo_ping()
nudge_nodeinfo(tx_iface)
time.sleep(5.0)
assert got is not None, (
f"multi-hop directed DM {tx_role}{rx_role} via relay "
f"{relay_role!r} never landed — NextHop multi-hop delivery is broken"
)
@pytest.mark.timeout(600)
def test_multihop_relay_recovery(
multihop_topology: dict[str, Any],
power_cycle, # noqa: ARG001 — forces the uhubctl-availability skip
) -> None:
"""Delivery recovers after the established relay drops and returns.
Establishes a baseline DM (route via relay learned), powers the relay OFF
(confirming TX survives sending across a downed relay), then powers it back
ON and asserts directed delivery resumes — the M3 stale-route decay /
re-learn path. With a strict A — B — C line there is no path while B is down,
so we only assert TX doesn't crash during the outage; the delivery assertion
is after B returns.
"""
relay_role = multihop_topology["relay_role"]
if not relay_role:
pytest.skip(
"relay node isn't one of the baked hub roles, so it can't be "
"power-cycled; recovery test needs a controllable relay"
)
tx_port = multihop_topology["tx_port"]
rx_port = multihop_topology["rx_port"]
rx_num = multihop_topology["rx_num"]
tx_role = multihop_topology["tx_role"]
rx_role = multihop_topology["rx_role"]
base = f"mh-recover-base-{int(time.time())}"
post = f"mh-recover-post-{int(time.time())}"
# Baseline: confirm delivery works (so the route via the relay is learned)
# before we perturb anything — otherwise a later failure is ambiguous.
with ReceiveCollector(rx_port, topic="meshtastic.receive.text") as rx:
rx.broadcast_nodeinfo_ping()
with connect(port=tx_port) as tx_iface:
nudge_nodeinfo(tx_iface)
if not _wait_for_pubkey(tx_iface, rx_num, rx_port, 90.0):
pytest.skip("multi-hop PKI warmup failed; can't run recovery test")
tx_iface.sendText(base, destinationId=rx_num, wantAck=True)
assert (
rx.wait_for(
lambda p: p.get("decoded", {}).get("text") == base, timeout=45
)
is not None
), "baseline multi-hop delivery failed — skipping recovery to avoid a false result"
# Power the relay OFF.
try:
_power.power_off(relay_role)
_power.wait_for_absence(relay_role, timeout_s=15.0)
except Exception as exc: # noqa: BLE001
try:
_power.power_on(relay_role)
resolve_port_by_role(relay_role, timeout_s=30.0)
except Exception: # noqa: BLE001
pass
pytest.skip(f"can't power-control relay {relay_role!r}: {exc}")
# With the only relay down there's no path; we just confirm TX accepts the
# send and survives its internal retries (it must not crash / wedge).
try:
with connect(port=tx_port) as tx_iface:
pkt = tx_iface.sendText(
f"mh-while-down-{int(time.time())}",
destinationId=rx_num,
wantAck=True,
)
assert pkt is not None
time.sleep(8.0) # let retransmissions + route decay run
except Exception as exc: # noqa: BLE001 — restore bench state before failing
_power.power_on(relay_role)
resolve_port_by_role(relay_role, timeout_s=30.0)
raise AssertionError(
f"TX crashed sending across a downed relay: {exc}"
) from exc
# Power the relay back ON and let it re-enumerate + boot.
_power.power_on(relay_role)
time.sleep(0.5)
try:
resolve_port_by_role(relay_role, timeout_s=30.0)
except Exception: # noqa: BLE001 — relay port isn't one we connect to directly
pass
time.sleep(8.0)
_warm_mesh([tx_port, rx_port], rounds=1) # re-flood so the relay re-learns
# Delivery should resume once the relay is back (M3 re-learn / decay path).
got = None
with ReceiveCollector(rx_port, topic="meshtastic.receive.text") as rx:
rx.broadcast_nodeinfo_ping()
with connect(port=tx_port) as tx_iface:
nudge_nodeinfo(tx_iface)
_wait_for_pubkey(tx_iface, rx_num, rx_port, 90.0)
for _attempt in range(4):
pkt = tx_iface.sendText(post, destinationId=rx_num, wantAck=True)
assert pkt is not None
got = rx.wait_for(
lambda p: p.get("decoded", {}).get("text") == post,
timeout=45,
)
if got is not None:
break
rx.broadcast_nodeinfo_ping()
nudge_nodeinfo(tx_iface)
time.sleep(6.0)
assert got is not None, (
f"after relay {relay_role!r} returned, multi-hop DM {tx_role}{rx_role} "
"never resumed — stale-route recovery (M3) may be broken"
)

View File

@@ -45,6 +45,11 @@ enum RxSource {
// For old firmware there is no relay node set
#define NO_RELAY_NODE 0
// How recently we must have heard a direct neighbor for its single-byte relay id to be trusted as a
// unique next hop. Mirrors NUM_ONLINE_SECS (NodeDB.cpp). Used by NodeDB::resolveLastByte() to scope
// last-byte collision resolution to currently-reachable neighbors.
#define NEXTHOP_NEIGHBOR_FRESH_SECS (60 * 60 * 2) // 2 hrs
typedef int ErrorCode;
/// Alloc and free packets to our global, ISR safe pool

View File

@@ -109,9 +109,18 @@ void NextHopRouter::sniffReceived(const meshtastic_MeshPacket *p, const meshtast
&weWereSoleRelayer);
if ((weWereRelayer && wasAlreadyRelayer) || (getHopsAway(*p) == 0 && weWereSoleRelayer)) {
if (origTx->next_hop != p->relay_node) { // Not already set
LOG_INFO("Update next hop of 0x%x to 0x%x based on ACK/reply (was relayer %d we were sole %d)", p->from,
p->relay_node, wasAlreadyRelayer, weWereSoleRelayer);
origTx->next_hop = p->relay_node;
// M1/M2: only learn a next hop whose last byte maps to a single plausible relay. On a
// dense mesh the byte may be ambiguous; storing it would aim future DMs at the wrong
// node. If ambiguous/unknown, leave the route unset so we keep flooding (safe).
if (nodeDB->resolveUniqueLastByte(p->relay_node, /*requireDirectNeighbor=*/false)) {
LOG_INFO("Update next hop of 0x%x to 0x%x based on ACK/reply (was relayer %d we were sole %d)",
p->from, p->relay_node, wasAlreadyRelayer, weWereSoleRelayer);
origTx->next_hop = p->relay_node;
noteRouteLearned(p->from, p->relay_node, millis()); // M3: anchor freshness
} else {
LOG_DEBUG("Not learning next hop for 0x%x: relay byte 0x%x ambiguous/unknown; keep flooding", p->from,
p->relay_node);
}
}
}
}
@@ -144,6 +153,11 @@ bool NextHopRouter::perhapsRebroadcast(const meshtastic_MeshPacket *p)
if (!isToUs(p) && !isFromUs(p) && (p->hop_limit > 0 || exhaustHops)) {
if (p->id != 0) {
if (isRebroadcaster()) {
// NOTE: this is a self-identity match (is the addressed next_hop OUR last byte?), so it
// cannot be hardened with resolveLastByte() — a remote node that legitimately shares our
// last byte will also match here and rebroadcast. That residual collision needs a wider
// on-wire field to fix. M1/M2 instead shrink the blast radius by reducing how often an
// ambiguous next_hop byte is ever learned (sniffReceived) or originated (getNextHop).
if (p->next_hop == NO_NEXT_HOP_PREFERENCE || p->next_hop == nodeDB->getLastByteOfNodeNum(getNodeNum())) {
meshtastic_MeshPacket *tosend = packetPool.allocCopy(*p); // keep a copy because we will be sending it
LOG_INFO("Rebroadcast received message coming from %x", p->relay_node);
@@ -196,10 +210,29 @@ std::optional<uint8_t> NextHopRouter::getNextHop(NodeNum to, uint8_t relay_node)
meshtastic_NodeInfoLite *node = nodeDB->getMeshNode(to);
if (node && node->next_hop) {
// M3: proactively decay a stale or repeatedly-failing route back to flooding, so a dead hop
// isn't trusted on the next DM's first (and on dense meshes, slowest) attempt. We only act on
// a health record that still matches the stored byte; a next_hop set by another path (e.g.
// TraceRouteModule) with no matching record is left authoritative.
RouteHealth *h = findRouteHealth(to);
if (h && h->lastNextHop == node->next_hop && isRouteStale(*h, millis())) {
LOG_INFO("Next hop 0x%x for 0x%x is stale (age/fails); flood and clear", node->next_hop, to);
node->next_hop = NO_NEXT_HOP_PREFERENCE; // clear persisted route
clearRouteHealth(to); // clear RAM health
return std::nullopt;
}
// We are careful not to return the relay node as the next hop
if (node->next_hop != relay_node) {
// LOG_DEBUG("Next hop for 0x%x is 0x%x", to, node->next_hop);
return node->next_hop;
// M1/M2: only emit a stored next_hop if its last byte still maps to a UNIQUE, currently
// reachable direct neighbor. On a dense mesh the last byte collides, so an ambiguous byte
// would unicast a hint toward the wrong physical node; if the neighbor has gone away we'd
// unicast into a void. In both cases flood instead (managed flooding still delivers).
ResolvedNode r = nodeDB->resolveLastByte(node->next_hop, /*requireDirectNeighbor=*/true);
if (r.status == LastByteResolution::Unique)
return node->next_hop;
LOG_WARN("Next hop 0x%x for 0x%x %s; set no pref", node->next_hop, to,
r.status == LastByteResolution::Ambiguous ? "ambiguous among neighbors" : "not a known neighbor");
} else
LOG_WARN("Next hop for 0x%x is 0x%x, same as relayer; set no pref", to, node->next_hop);
}
@@ -311,7 +344,10 @@ int32_t NextHopRouter::doRetransmissions()
if (!isBroadcast(p.packet->to)) {
if (p.numRetransmissions == 1) {
// Last retransmission, reset next_hop (fallback to FloodingRouter)
// Last retransmission: this directed delivery went un-ACKed. Record the failure
// (M3 — accumulates across DMs to age out a flapping/dead route) and reset
// next_hop so the final try falls back to FloodingRouter.
noteRouteFailure(p.packet->to);
p.packet->next_hop = NO_NEXT_HOP_PREFERENCE;
// Also reset it in the nodeDB
meshtastic_NodeInfoLite *sentTo = nodeDB->getMeshNode(p.packet->to);
@@ -321,7 +357,25 @@ int32_t NextHopRouter::doRetransmissions()
}
FloodingRouter::send(packetPool.allocCopy(*p.packet));
} else {
#if NEXTHOP_EARLY_FLOOD_ON_UNVERIFIED
// M4 (gated): if the route isn't proven healthy, don't spend a second directed
// attempt — start flooding one retry sooner to cut recovery latency. A verified
// route (fresh, zero recent failures) keeps the unchanged directed-retry path so
// the sparse-mesh happy path is untouched.
RouteHealth *h = findRouteHealth(p.packet->to);
bool verified = h && h->consecutiveFailures == 0 && !isRouteStale(*h, now);
if (!verified) {
p.packet->next_hop = NO_NEXT_HOP_PREFERENCE;
meshtastic_NodeInfoLite *sentTo = nodeDB->getMeshNode(p.packet->to);
if (sentTo)
sentTo->next_hop = NO_NEXT_HOP_PREFERENCE;
FloodingRouter::send(packetPool.allocCopy(*p.packet));
} else {
NextHopRouter::send(packetPool.allocCopy(*p.packet));
}
#else
NextHopRouter::send(packetPool.allocCopy(*p.packet));
#endif
}
} else {
// Note: we call the superclass version because we don't want to have our version of send() add a new
@@ -355,3 +409,96 @@ void NextHopRouter::setNextTx(PendingPacket *pending)
printPacket("", pending->packet);
setReceivedMessage(); // Run ASAP, so we can figure out our correct sleep time
}
// ---------------------------------------------------------------------------
// M3: RAM route-health table. Bounded array with reuse-oldest eviction (same discipline as
// PacketHistory). All age comparisons use unsigned subtraction so they survive the 49.7-day millis()
// rollover. dest == 0 marks an empty slot; learnedAtMsec is normalized to 1 on write so an occupied
// slot is never read as infinitely old.
// ---------------------------------------------------------------------------
RouteHealth *NextHopRouter::findRouteHealth(NodeNum dest)
{
if (dest == 0)
return nullptr;
for (auto &h : routeHealth)
if (h.dest == dest)
return &h;
return nullptr;
}
RouteHealth *NextHopRouter::getOrAllocRouteHealth(NodeNum dest, uint32_t now)
{
if (dest == 0)
return nullptr;
RouteHealth *oldest = &routeHealth[0];
RouteHealth *freeSlot = nullptr;
for (auto &h : routeHealth) {
if (h.dest == dest)
return &h; // existing record
if (h.dest == 0) {
if (!freeSlot)
freeSlot = &h; // remember the first free slot; prefer it over evicting
continue;
}
// Track the oldest occupied slot in case the table is full (rollover-safe).
if ((uint32_t)(now - h.learnedAtMsec) > (uint32_t)(now - oldest->learnedAtMsec))
oldest = &h;
}
// Claim the free slot if there is one, else reuse the oldest. Reset before use and stamp the dest
// so the record is findable.
RouteHealth *slot = freeSlot ? freeSlot : oldest;
*slot = RouteHealth{};
slot->dest = dest;
return slot;
}
void NextHopRouter::noteRouteLearned(NodeNum dest, uint8_t nextHop, uint32_t now)
{
if (dest == 0 || nextHop == NO_NEXT_HOP_PREFERENCE)
return;
RouteHealth *h = getOrAllocRouteHealth(dest, now);
if (!h)
return;
// A genuinely new next hop earns a clean slate; re-learning the SAME hop keeps the accumulated
// failure count so an asymmetric reverse path that keeps re-teaching a dead forward hop still ages
// out instead of resetting the counter every time.
if (h->lastNextHop != nextHop) {
h->lastNextHop = nextHop;
h->consecutiveFailures = 0;
}
h->learnedAtMsec = now ? now : 1;
}
void NextHopRouter::noteRouteSuccess(NodeNum dest, uint32_t now)
{
RouteHealth *h = findRouteHealth(dest);
if (!h)
return; // only routes we actually learned have health to refresh
h->consecutiveFailures = 0;
h->learnedAtMsec = now ? now : 1;
}
void NextHopRouter::noteRouteFailure(NodeNum dest)
{
RouteHealth *h = findRouteHealth(dest);
if (!h)
return; // nothing to penalize (we were flooding, or never learned a route here)
if (h->consecutiveFailures < 255)
h->consecutiveFailures++;
}
bool NextHopRouter::isRouteStale(const RouteHealth &h, uint32_t now) const
{
if (h.consecutiveFailures >= ROUTE_FAILURE_THRESHOLD)
return true;
return (uint32_t)(now - h.learnedAtMsec) >= ROUTE_TTL_MSEC;
}
void NextHopRouter::clearRouteHealth(NodeNum dest)
{
RouteHealth *h = findRouteHealth(dest);
if (h)
*h = RouteHealth{};
}

View File

@@ -43,6 +43,28 @@ struct PendingPacket {
explicit PendingPacket(meshtastic_MeshPacket *p, uint8_t numRetransmissions);
};
/**
* RAM-only per-destination route health. Tracks how fresh a learned next_hop is and how many
* consecutive directed deliveries to it have failed, so getNextHop() can proactively decay a stale or
* repeatedly-failing route back to flooding instead of trusting a dead hop on the next (and on dense
* meshes, slowest) attempt. Not persisted: the learned next_hop itself lives in NodeInfoLite; this is
* just freshness/failure metadata.
*/
struct RouteHealth {
NodeNum dest = 0; ///< destination this record describes; 0 == empty slot
uint32_t learnedAtMsec = 0; ///< millis() when next_hop was last (re)learned (rollover-aware)
uint8_t consecutiveFailures = 0; ///< directed deliveries to `dest` that went un-ACKed
uint8_t lastNextHop = NO_NEXT_HOP_PREFERENCE; ///< the relay byte this health refers to
};
// M4 (optional, off by default): when a route is not proven healthy, fall back to flooding one retry
// earlier instead of spending a second directed attempt. Trades airtime for recovery latency on dense
// meshes; leaves the sparse-mesh happy path (fresh, verified routes) unchanged. Measure on the
// simulator before enabling broadly.
#ifndef NEXTHOP_EARLY_FLOOD_ON_UNVERIFIED
#define NEXTHOP_EARLY_FLOOD_ON_UNVERIFIED 0
#endif
class GlobalPacketIdHashFunction
{
public:
@@ -92,12 +114,22 @@ class NextHopRouter : public FloodingRouter
// The number of retransmissions the original sender will do
constexpr static uint8_t NUM_RELIABLE_RETX = 3;
// M3: bounded RAM route-health table (reuse-oldest eviction, like PacketHistory)
constexpr static uint8_t ROUTE_HEALTH_MAX = 32; // ~12B/slot -> ~384B
constexpr static uint32_t ROUTE_TTL_MSEC = 30UL * 60 * 1000; // re-discover a route unconfirmed for 30 min
constexpr static uint8_t ROUTE_FAILURE_THRESHOLD = 3; // consecutive un-ACKed directed deliveries -> dead
protected:
/**
* Pending retransmissions
*/
std::unordered_map<GlobalPacketId, PendingPacket, GlobalPacketIdHashFunction> pending;
/**
* Per-destination route health (M3). Bounded array, reuse-oldest eviction. RAM-only.
*/
RouteHealth routeHealth[ROUTE_HEALTH_MAX] = {};
/**
* Should this incoming filter be dropped?
*
@@ -142,13 +174,38 @@ class NextHopRouter : public FloodingRouter
void setNextTx(PendingPacket *pending);
// --- M3 route-health helpers (RAM-only). Protected so ReliableRouter (a subclass) can record
// delivery success, and so the unit-test shim can reach them via `using`. All take `now` where
// time matters so the decay logic is pure and testable without a clock mock. ---
/// @return the health record for `dest`, or nullptr if we hold none.
RouteHealth *findRouteHealth(NodeNum dest);
/// @return an existing record for `dest`, else a freshly claimed slot (reuse-oldest on overflow).
RouteHealth *getOrAllocRouteHealth(NodeNum dest, uint32_t now);
/// Record that we (re)learned `nextHop` for `dest`. Resets the failure count only when the hop
/// changed (so a flapping reverse-path re-learn of the same dead hop still ages out).
void noteRouteLearned(NodeNum dest, uint8_t nextHop, uint32_t now);
/// Record an end-to-end delivery success to `dest` (clears failures, refreshes freshness).
void noteRouteSuccess(NodeNum dest, uint32_t now);
/// Record that a directed delivery to `dest` went un-ACKed (no-op if we hold no record).
void noteRouteFailure(NodeNum dest);
/// @return true if the route is too old (TTL) or has failed too many times in a row.
bool isRouteStale(const RouteHealth &h, uint32_t now) const;
/// Forget any health record for `dest`.
void clearRouteHealth(NodeNum dest);
#ifdef PIO_UNIT_TESTING
public: // expose getNextHop to the test shim without widening production visibility
#else
private:
#endif
/**
* Get the next hop for a destination, given the relay node
* @return the node number of the next hop, 0 if no preference (fallback to FloodingRouter)
*/
std::optional<uint8_t> getNextHop(NodeNum to, uint8_t relay_node);
private:
/** Check if we should be rebroadcasting this packet if so, do so.
* @return true if we did rebroadcast */
bool perhapsRebroadcast(const meshtastic_MeshPacket *p) override;

View File

@@ -2942,6 +2942,73 @@ meshtastic_NodeInfoLite *NodeDB::getMeshNode(NodeNum n)
return NULL;
}
ResolvedNode NodeDB::resolveLastByte(uint8_t lastByte, bool requireDirectNeighbor)
{
ResolvedNode result; // defaults to {None, 0}
// 0 is the NO_RELAY_NODE / NO_NEXT_HOP_PREFERENCE sentinel (also what MQTT-sourced packets carry
// when hop_start==0). getLastByteOfNodeNum() never yields 0, so nothing can legitimately match.
if (lastByte == 0)
return result;
const NodeNum self = getNodeNum();
NodeNum firstMatch = 0;
uint8_t matches = 0;
for (size_t i = 0; i < numMeshNodes; i++) {
const meshtastic_NodeInfoLite *node = &meshNodes->at(i);
// Candidate gate: never resolve to ourselves, the sentinels, or an ignored node.
if (node->num == self || node->num == 0 || node->num == NODENUM_BROADCAST)
continue;
if (nodeInfoLiteIsIgnored(node))
continue;
if (getLastByteOfNodeNum(node->num) != lastByte) // cheapest discriminator last
continue;
// Relevance gate: is this node a plausible relay for the requested scope?
bool relevant;
if (requireDirectNeighbor) {
relevant = node->has_hops_away && node->hops_away == 0 && sinceLastSeen(node) < NEXTHOP_NEIGHBOR_FRESH_SECS;
} else {
const bool directNeighbor = node->has_hops_away && node->hops_away == 0;
const bool routerRole =
IS_ONE_OF(node->role, meshtastic_Config_DeviceConfig_Role_ROUTER, meshtastic_Config_DeviceConfig_Role_ROUTER_LATE,
meshtastic_Config_DeviceConfig_Role_CLIENT_BASE);
relevant = directNeighbor || nodeInfoLiteIsFavorite(node) || routerRole;
}
if (!relevant)
continue;
if (++matches == 1) {
firstMatch = node->num;
} else {
// A second relevant candidate shares this byte: ambiguous. No further scanning can
// change that, so stop early and report the collision.
result.status = LastByteResolution::Ambiguous;
result.num = 0;
return result;
}
}
if (matches == 1) {
result.status = LastByteResolution::Unique;
result.num = firstMatch;
}
return result;
}
bool NodeDB::resolveUniqueLastByte(uint8_t lastByte, bool requireDirectNeighbor, NodeNum *outNum)
{
ResolvedNode r = resolveLastByte(lastByte, requireDirectNeighbor);
if (r.status == LastByteResolution::Unique) {
if (outNum)
*outNum = r.num;
return true;
}
return false;
}
// returns true if the maximum number of nodes is reached or we are running low on memory
bool NodeDB::isFull()
{

View File

@@ -114,6 +114,20 @@ uint32_t sinceLastSeen(const meshtastic_NodeInfoLite *n);
/// Given a packet, return how many seconds in the past (vs now) it was received
uint32_t sinceReceived(const meshtastic_MeshPacket *p);
/// Outcome of mapping a single on-wire last-byte (next_hop / relay_node) back to a full NodeNum.
/// Because the wire only carries the last byte of a 32-bit node number, the mapping is ambiguous on
/// dense meshes (the "birthday problem"). Callers must treat Ambiguous and None as "don't trust it".
enum class LastByteResolution : uint8_t {
None, ///< no relevant candidate node has this last byte
Unique, ///< exactly one relevant candidate -> `num` is valid
Ambiguous, ///< two or more relevant candidates collide on this byte
};
struct ResolvedNode {
LastByteResolution status = LastByteResolution::None;
NodeNum num = 0; ///< valid only when status == Unique
};
/// Given a packet, return the number of hops used to reach this node.
/// Returns defaultIfUnknown if the number of hops couldn't be determined.
int8_t getHopsAway(const meshtastic_MeshPacket &p, int8_t defaultIfUnknown = -1);
@@ -296,6 +310,23 @@ class NodeDB
virtual meshtastic_NodeInfoLite *getMeshNode(NodeNum n);
size_t getNumMeshNodes() { return numMeshNodes; }
/**
* Resolve a single on-wire last-byte (e.g. next_hop / relay_node) back to a unique full NodeNum,
* detecting last-byte collisions instead of silently picking the first match. A 1-byte id only
* needs to be unique among a node's plausible relays, not the whole mesh, so we scope the search:
* - requireDirectNeighbor == true : candidates are direct neighbors (hops_away==0) heard within
* NEXTHOP_NEIGHBOR_FRESH_SECS. Use on the SEND path.
* - requireDirectNeighbor == false : also accept favorites and router-role nodes (unknown hop
* distance allowed). Use when learning / preserving hops.
* Ignored nodes, our own node, and the broadcast/0 sentinels are never candidates. On a tie the
* result is Ambiguous (no tie-break) so callers fall back to flooding rather than misroute.
*/
ResolvedNode resolveLastByte(uint8_t lastByte, bool requireDirectNeighbor);
/// Convenience wrapper around resolveLastByte(): true iff exactly one relevant candidate matches.
/// Ambiguous and None both return false (the safe answer for learning / hop preservation).
bool resolveUniqueLastByte(uint8_t lastByte, bool requireDirectNeighbor, NodeNum *outNum = nullptr);
// Thread-safe satellite-map accessors. Return false if absent or the
// corresponding DB is compiled out.
bool copyNodePosition(NodeNum n, meshtastic_PositionLite &out) const;

View File

@@ -486,7 +486,11 @@ bool PacketHistory::wasRelayer(const uint8_t relayer, const uint32_t id, const N
}
/* Check if a certain node was a relayer of a packet in the history given iterator
* @return true if node was indeed a relayer, false if not */
* @return true if node was indeed a relayer, false if not
* NOTE: intentionally byte-domain. Both `relayer` and relayed_by[] are on-wire last bytes, so this
* answers "did a relayer with this byte touch the packet" — correct without resolving to a NodeNum.
* The collision risk is neutralized where the result is consumed (route learning in
* NextHopRouter::sniffReceived now gates the write through NodeDB::resolveUniqueLastByte). */
bool PacketHistory::wasRelayer(const uint8_t relayer, const PacketRecord &r, bool *wasSole)
{
bool found = false;

View File

@@ -151,6 +151,10 @@ void ReliableRouter::sniffReceived(const meshtastic_MeshPacket *p, const meshtas
LOG_DEBUG("Received a %s for 0x%x, stopping retransmissions", ackId ? "ACK" : "NAK", ackId);
if (ackId) {
stopRetransmission(p->to, ackId);
// M3: an end-to-end ACK proves the directed route to the ACK's sender currently works,
// so clear its failure count and refresh freshness (keeps a good route pinned).
if (!isBroadcast(getFrom(p)))
noteRouteSuccess(getFrom(p), millis());
} else {
stopRetransmission(p->to, nakId);
}

View File

@@ -114,37 +114,24 @@ bool Router::shouldDecrementHopLimit(const meshtastic_MeshPacket *p)
}
#endif
// For subsequent hops, check if previous relay is a favorite router
// Optimized search for favorite routers with matching last byte
// Check ordering optimized for IoT devices (cheapest checks first)
for (size_t i = 0; i < nodeDB->getNumMeshNodes(); i++) {
meshtastic_NodeInfoLite *node = nodeDB->getMeshNodeByIndex(i);
if (!node)
continue;
// Check 1: is_favorite (cheapest - single bit test)
if (!nodeInfoLiteIsFavorite(node))
continue;
// Check 2: has_user (cheap - single bit test)
if (!nodeInfoLiteHasUser(node))
continue;
// Check 3: role check (moderate cost - multiple comparisons)
if (!IS_ONE_OF(node->role, meshtastic_Config_DeviceConfig_Role_ROUTER, meshtastic_Config_DeviceConfig_Role_ROUTER_LATE,
meshtastic_Config_DeviceConfig_Role_CLIENT_BASE)) {
continue;
}
// Check 4: last byte extraction and comparison (most expensive)
if (nodeDB->getLastByteOfNodeNum(node->num) == p->relay_node) {
// Found a favorite router match
LOG_DEBUG("Identified favorite relay router 0x%x from last byte 0x%x", node->num, p->relay_node);
// For subsequent hops, preserve hop_limit only when the previous relay is UNAMBIGUOUSLY a favorite
// router. The relay_node byte is just the last byte of a 32-bit node number, so on a dense mesh it
// collides; the old "first matching node wins" scan could preserve hops for the wrong node
// (non-deterministic, depends on NodeDB order). resolveLastByte() reports a collision instead, and
// we re-check the favorite/router predicate on the single resolved node. On ambiguity/none we
// decrement (the safe default).
NodeNum resolved = 0;
if (nodeDB->resolveUniqueLastByte(p->relay_node, /*requireDirectNeighbor=*/false, &resolved)) {
meshtastic_NodeInfoLite *node = nodeDB->getMeshNode(resolved);
if (node && nodeInfoLiteIsFavorite(node) && nodeInfoLiteHasUser(node) &&
IS_ONE_OF(node->role, meshtastic_Config_DeviceConfig_Role_ROUTER, meshtastic_Config_DeviceConfig_Role_ROUTER_LATE,
meshtastic_Config_DeviceConfig_Role_CLIENT_BASE)) {
LOG_DEBUG("Identified unique favorite relay router 0x%x from last byte 0x%x", resolved, p->relay_node);
return false; // Don't decrement hop_limit
}
}
// No favorite router match found, decrement hop_limit
// No unambiguous favorite router match found, decrement hop_limit
return true;
}

View File

@@ -0,0 +1,465 @@
// Unit tests for NextHop direct-message reliability mitigations (see docs/nexthop-routing-reliability.md):
// M1 - NodeDB::resolveLastByte / resolveUniqueLastByte (ambiguity-aware last-byte resolution)
// M2 - NextHopRouter::getNextHop strict-neighbor gate + Router::shouldDecrementHopLimit favorite check
// M3 - NextHopRouter route-health freshness / failure decay
//
// Time handling: the route-health helpers take `now` as a parameter so the 30-minute TTL logic is
// pure and testable without a clock mock. getNextHop()/sinceLastSeen() use the real native clock;
// we back-date timestamps relative to it, and the unsigned-subtraction age math is rollover-safe.
#include "MeshTypes.h" // before TestUtil.h: provides NodeNum etc.
#include "TestUtil.h"
#include <unity.h>
#include "configuration.h"
#include "gps/RTC.h"
#include "mesh/NextHopRouter.h"
#include "mesh/NodeDB.h"
#include <cstdio>
#include <cstring>
#include <memory>
#define MSG_BUF_LEN 200
#define TEST_MSG_FMT(fmt, ...) \
do { \
char _buf[MSG_BUF_LEN]; \
snprintf(_buf, sizeof(_buf), fmt, __VA_ARGS__); \
TEST_MESSAGE(_buf); \
} while (0)
static constexpr NodeNum kLocalNode = 0x11111111; // last byte 0x11
// ---------------------------------------------------------------------------
// MockNodeDB — inject nodes with controlled last byte, hop distance, age, role, favorite flag.
// ---------------------------------------------------------------------------
class MockNodeDB : public NodeDB
{
public:
void clearTestNodes()
{
testNodes.clear();
meshNodes = &testNodes;
numMeshNodes = 0;
}
// ageSecs is how long ago we last heard the node; getTime() returns a large Unix timestamp on
// native, so getTime()-ageSecs does not underflow for the ranges used here.
void addNode(NodeNum num, uint8_t hopsAway, bool hasHops, uint32_t ageSecs,
meshtastic_Config_DeviceConfig_Role role = meshtastic_Config_DeviceConfig_Role_CLIENT, bool favorite = false,
bool ignored = false, uint8_t nextHop = NO_NEXT_HOP_PREFERENCE)
{
meshtastic_NodeInfoLite node = meshtastic_NodeInfoLite_init_zero;
node.num = num;
node.has_hops_away = hasHops;
node.hops_away = hopsAway;
node.role = role;
node.next_hop = nextHop;
node.last_heard = getTime() - ageSecs;
nodeInfoLiteSetBit(&node, NODEINFO_BITFIELD_IS_FAVORITE_MASK, favorite);
nodeInfoLiteSetBit(&node, NODEINFO_BITFIELD_IS_IGNORED_MASK, ignored);
nodeInfoLiteSetBit(&node, NODEINFO_BITFIELD_HAS_USER_MASK, true);
testNodes.push_back(node);
meshNodes = &testNodes;
numMeshNodes = testNodes.size();
}
std::vector<meshtastic_NodeInfoLite> testNodes;
};
// ---------------------------------------------------------------------------
// Test shim — expose getNextHop and the route-health helpers; reset health between tests.
// Nulls cryptLock so the Router base can be (re)constructed (same pattern as test_mqtt MockRouter).
// ---------------------------------------------------------------------------
class NextHopRouterTestShim : public NextHopRouter
{
public:
NextHopRouterTestShim() : NextHopRouter()
{
delete cryptLock;
cryptLock = nullptr;
}
using NextHopRouter::clearRouteHealth;
using NextHopRouter::findRouteHealth;
using NextHopRouter::getNextHop;
using NextHopRouter::getOrAllocRouteHealth;
using NextHopRouter::isRouteStale;
using NextHopRouter::noteRouteFailure;
using NextHopRouter::noteRouteLearned;
using NextHopRouter::noteRouteSuccess;
using Router::shouldDecrementHopLimit; // protected in Router
void resetRouteHealthForTest()
{
for (auto &h : routeHealth)
h = RouteHealth{};
}
};
static MockNodeDB *mockNodeDB = nullptr;
static NextHopRouterTestShim *shim = nullptr;
static constexpr uint32_t TTL = NextHopRouter::ROUTE_TTL_MSEC;
static constexpr uint8_t THRESH = NextHopRouter::ROUTE_FAILURE_THRESHOLD;
static constexpr uint8_t HEALTH_MAX = NextHopRouter::ROUTE_HEALTH_MAX;
// Helper: a decoded packet whose hops-away is `hopsAway`, relayed by last byte `relay`.
static meshtastic_MeshPacket makeRelayedPacket(uint8_t relay, uint8_t hopsAway)
{
meshtastic_MeshPacket p = meshtastic_MeshPacket_init_zero;
p.which_payload_variant = meshtastic_MeshPacket_decoded_tag;
p.relay_node = relay;
p.hop_start = 4;
p.hop_limit = 4 - hopsAway; // getHopsAway() == hop_start - hop_limit
return p;
}
void setUp(void)
{
myNodeInfo.my_node_num = kLocalNode;
config.device.role = meshtastic_Config_DeviceConfig_Role_CLIENT;
mockNodeDB->clearTestNodes();
shim->resetRouteHealthForTest();
}
void tearDown(void) {}
// ===========================================================================
// Group 1 — resolveLastByte (M1)
// ===========================================================================
void test_resolve_none_when_empty(void)
{
ResolvedNode r = mockNodeDB->resolveLastByte(0xAB, true);
TEST_ASSERT_EQUAL(LastByteResolution::None, r.status);
}
void test_resolve_zero_byte_is_none(void)
{
// 0 is the NO_RELAY_NODE / NO_NEXT_HOP_PREFERENCE sentinel — never resolves.
mockNodeDB->addNode(0x22222200, 0, true, 60); // last byte maps to 0xFF, not 0
TEST_ASSERT_EQUAL(LastByteResolution::None, mockNodeDB->resolveLastByte(0x00, true).status);
TEST_ASSERT_EQUAL(LastByteResolution::None, mockNodeDB->resolveLastByte(0x00, false).status);
}
void test_resolve_unique_neighbor(void)
{
mockNodeDB->addNode(0x000005AB, 0, true, 60); // direct, fresh, last byte 0xAB
ResolvedNode r = mockNodeDB->resolveLastByte(0xAB, true);
TEST_ASSERT_EQUAL(LastByteResolution::Unique, r.status);
TEST_ASSERT_EQUAL_HEX32(0x000005AB, r.num);
}
void test_resolve_collision_is_ambiguous(void)
{
// Birthday collision: two fresh direct neighbors share last byte 0xAB.
mockNodeDB->addNode(0x000005AB, 0, true, 60);
mockNodeDB->addNode(0x000006AB, 0, true, 60);
ResolvedNode r = mockNodeDB->resolveLastByte(0xAB, true);
TEST_ASSERT_EQUAL(LastByteResolution::Ambiguous, r.status);
TEST_ASSERT_EQUAL_HEX32(0, r.num); // never silently picks one
}
void test_resolve_strict_excludes_stale(void)
{
mockNodeDB->addNode(0x000005AB, 0, true, 60); // fresh
mockNodeDB->addNode(0x000006AB, 0, true, NEXTHOP_NEIGHBOR_FRESH_SECS + 100); // stale
ResolvedNode r = mockNodeDB->resolveLastByte(0xAB, true);
TEST_ASSERT_EQUAL(LastByteResolution::Unique, r.status);
TEST_ASSERT_EQUAL_HEX32(0x000005AB, r.num);
}
void test_resolve_strict_excludes_far(void)
{
mockNodeDB->addNode(0x000005AB, 0, true, 60); // direct neighbor
mockNodeDB->addNode(0x000006AB, 2, true, 60); // 2 hops away -> not a direct neighbor
ResolvedNode r = mockNodeDB->resolveLastByte(0xAB, true);
TEST_ASSERT_EQUAL(LastByteResolution::Unique, r.status);
TEST_ASSERT_EQUAL_HEX32(0x000005AB, r.num);
}
void test_resolve_lenient_includes_favorite_router(void)
{
// Unknown hop distance, but a favorite ROUTER: lenient gate accepts, strict gate does not.
mockNodeDB->addNode(0x000007AB, 0, false, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/true);
TEST_ASSERT_EQUAL(LastByteResolution::Unique, mockNodeDB->resolveLastByte(0xAB, false).status);
TEST_ASSERT_EQUAL(LastByteResolution::None, mockNodeDB->resolveLastByte(0xAB, true).status);
}
void test_resolve_lenient_collision_favorite_plus_neighbor(void)
{
mockNodeDB->addNode(0x000007AB, 0, false, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/true);
mockNodeDB->addNode(0x000005AB, 0, true, 60); // direct neighbor, same byte
TEST_ASSERT_EQUAL(LastByteResolution::Ambiguous, mockNodeDB->resolveLastByte(0xAB, false).status);
}
void test_resolve_skips_self(void)
{
// A node equal to us (last byte 0x11) must never resolve to ourselves.
mockNodeDB->addNode(kLocalNode, 0, true, 60);
TEST_ASSERT_EQUAL(LastByteResolution::None, mockNodeDB->resolveLastByte(0x11, true).status);
}
void test_resolve_skips_ignored(void)
{
mockNodeDB->addNode(0x000005AB, 0, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, /*ignored=*/true);
TEST_ASSERT_EQUAL(LastByteResolution::None, mockNodeDB->resolveLastByte(0xAB, true).status);
}
void test_resolve_0x00_maps_to_0xFF_and_collides(void)
{
// getLastByteOfNodeNum() maps ...00 -> 0xFF, so a ...00 node and a ...FF node collide on 0xFF.
TEST_ASSERT_EQUAL_HEX8(0xFF, mockNodeDB->getLastByteOfNodeNum(0x11111100));
mockNodeDB->addNode(0x11111100, 0, true, 60); // last byte 0xFF (remapped)
TEST_ASSERT_EQUAL(LastByteResolution::Unique, mockNodeDB->resolveLastByte(0xFF, true).status);
mockNodeDB->addNode(0x222222FF, 0, true, 60); // genuine ...FF
TEST_ASSERT_EQUAL(LastByteResolution::Ambiguous, mockNodeDB->resolveLastByte(0xFF, true).status);
}
void test_resolve_unique_helper(void)
{
mockNodeDB->addNode(0x000005AB, 0, true, 60);
NodeNum out = 0;
TEST_ASSERT_TRUE(mockNodeDB->resolveUniqueLastByte(0xAB, true, &out));
TEST_ASSERT_EQUAL_HEX32(0x000005AB, out);
mockNodeDB->addNode(0x000006AB, 0, true, 60); // create collision
TEST_ASSERT_FALSE(mockNodeDB->resolveUniqueLastByte(0xAB, true));
}
// ===========================================================================
// Group 2 — getNextHop (M2 send-path gate + M3 decay)
// ===========================================================================
static constexpr NodeNum DEST = 0x000000B0; // DM destination (last byte 0xB0, distinct from 0xAB)
void test_getnexthop_unique_returns_byte(void)
{
mockNodeDB->addNode(DEST, 2, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, false, /*nextHop=*/0xAB);
mockNodeDB->addNode(0x000005AB, 0, true, 60); // unique fresh neighbor with byte 0xAB
auto nh = shim->getNextHop(DEST, /*relay=*/0x11);
TEST_ASSERT_TRUE(nh.has_value());
TEST_ASSERT_EQUAL_HEX8(0xAB, nh.value());
}
void test_getnexthop_ambiguous_floods(void)
{
mockNodeDB->addNode(DEST, 2, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, false, /*nextHop=*/0xAB);
mockNodeDB->addNode(0x000005AB, 0, true, 60);
mockNodeDB->addNode(0x000006AB, 0, true, 60); // collision -> ambiguous
TEST_ASSERT_FALSE(shim->getNextHop(DEST, 0x11).has_value());
}
void test_getnexthop_vanished_neighbor_floods(void)
{
mockNodeDB->addNode(DEST, 2, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, false, /*nextHop=*/0xAB);
// The only 0xAB node is stale -> strict gate yields None -> flood.
mockNodeDB->addNode(0x000005AB, 0, true, NEXTHOP_NEIGHBOR_FRESH_SECS + 100);
TEST_ASSERT_FALSE(shim->getNextHop(DEST, 0x11).has_value());
}
void test_getnexthop_split_horizon_floods(void)
{
mockNodeDB->addNode(DEST, 2, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, false, /*nextHop=*/0xAB);
mockNodeDB->addNode(0x000005AB, 0, true, 60);
// relay_node == stored next_hop -> don't send it back the way it came.
TEST_ASSERT_FALSE(shim->getNextHop(DEST, /*relay=*/0xAB).has_value());
}
void test_getnexthop_broadcast_is_nullopt(void)
{
TEST_ASSERT_FALSE(shim->getNextHop(NODENUM_BROADCAST, 0x11).has_value());
}
void test_getnexthop_decays_stale_route(void)
{
mockNodeDB->addNode(DEST, 2, true, 60, meshtastic_Config_DeviceConfig_Role_CLIENT, false, false, /*nextHop=*/0xAB);
mockNodeDB->addNode(0x000005AB, 0, true, 60); // a valid unique neighbor exists...
// ...but the health record is older than the TTL, so the route should decay to flooding.
shim->noteRouteLearned(DEST, 0xAB, millis() - (TTL + 5000));
TEST_ASSERT_FALSE(shim->getNextHop(DEST, 0x11).has_value());
// Decay also clears the persisted next_hop and the RAM health record.
TEST_ASSERT_EQUAL_HEX8(NO_NEXT_HOP_PREFERENCE, mockNodeDB->getMeshNode(DEST)->next_hop);
TEST_ASSERT_NULL(shim->findRouteHealth(DEST));
}
// ===========================================================================
// Group 3 — route-health helpers (M3)
// ===========================================================================
void test_health_fresh_not_stale(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
RouteHealth *h = shim->findRouteHealth(DEST);
TEST_ASSERT_NOT_NULL(h);
TEST_ASSERT_FALSE(shim->isRouteStale(*h, 1000 + TTL - 1));
}
void test_health_ttl_expiry(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
RouteHealth *h = shim->findRouteHealth(DEST);
TEST_ASSERT_TRUE(shim->isRouteStale(*h, 1000 + TTL)); // boundary is inclusive (>=)
}
void test_health_ttl_rollover_safe(void)
{
const uint32_t learnAt = 0xFFFFFFFFu - 1000; // learned just before the millis() rollover
shim->noteRouteLearned(DEST, 0xAB, learnAt);
RouteHealth *h = shim->findRouteHealth(DEST);
// 1500 ms later (wrapped to now=500): unsigned subtraction yields ~1500 ms, not "stale".
TEST_ASSERT_FALSE(shim->isRouteStale(*h, 500));
}
void test_health_failure_threshold(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
for (uint8_t i = 1; i < THRESH; i++)
shim->noteRouteFailure(DEST);
TEST_ASSERT_FALSE(shim->isRouteStale(*shim->findRouteHealth(DEST), 1000)); // THRESH-1 failures: ok
shim->noteRouteFailure(DEST);
TEST_ASSERT_TRUE(shim->isRouteStale(*shim->findRouteHealth(DEST), 1000)); // THRESH failures: stale
}
void test_health_success_resets_failures(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
shim->noteRouteFailure(DEST);
shim->noteRouteFailure(DEST);
shim->noteRouteSuccess(DEST, 2000);
RouteHealth *h = shim->findRouteHealth(DEST);
TEST_ASSERT_EQUAL_UINT8(0, h->consecutiveFailures);
TEST_ASSERT_FALSE(shim->isRouteStale(*h, 2000));
}
void test_health_relearn_same_hop_keeps_failures(void)
{
// Anti-flap: an asymmetric reverse path re-teaching the same dead hop must not reset failures.
shim->noteRouteLearned(DEST, 0xAB, 1000);
shim->noteRouteFailure(DEST);
shim->noteRouteFailure(DEST);
shim->noteRouteLearned(DEST, 0xAB, 2000); // same hop
TEST_ASSERT_EQUAL_UINT8(2, shim->findRouteHealth(DEST)->consecutiveFailures);
}
void test_health_relearn_new_hop_resets_failures(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
shim->noteRouteFailure(DEST);
shim->noteRouteFailure(DEST);
shim->noteRouteLearned(DEST, 0xCD, 2000); // genuinely new hop -> clean slate
RouteHealth *h = shim->findRouteHealth(DEST);
TEST_ASSERT_EQUAL_UINT8(0, h->consecutiveFailures);
TEST_ASSERT_EQUAL_HEX8(0xCD, h->lastNextHop);
}
void test_health_failure_without_record_is_noop(void)
{
shim->noteRouteFailure(DEST); // no record yet
TEST_ASSERT_NULL(shim->findRouteHealth(DEST));
}
void test_health_clear(void)
{
shim->noteRouteLearned(DEST, 0xAB, 1000);
TEST_ASSERT_NOT_NULL(shim->findRouteHealth(DEST));
shim->clearRouteHealth(DEST);
TEST_ASSERT_NULL(shim->findRouteHealth(DEST));
}
void test_health_lru_eviction_bounds_table(void)
{
// Fill every slot with increasing learn times, then add one more: the oldest must be evicted.
for (uint8_t i = 0; i < HEALTH_MAX; i++)
shim->noteRouteLearned(0x1000 + i, 0xAB, 1000 + (uint32_t)i * 1000);
NodeNum oldest = 0x1000;
TEST_ASSERT_NOT_NULL(shim->findRouteHealth(oldest));
shim->noteRouteLearned(0x2000, 0xAB, 1000 + (uint32_t)HEALTH_MAX * 1000); // overflow
TEST_ASSERT_NULL(shim->findRouteHealth(oldest)); // evicted
TEST_ASSERT_NOT_NULL(shim->findRouteHealth(0x2000)); // newest present
}
// ===========================================================================
// Group 4 — shouldDecrementHopLimit favorite-router resolution (M2, site 4)
// ===========================================================================
void test_hoplimit_preserve_unique_favorite_router(void)
{
config.device.role = meshtastic_Config_DeviceConfig_Role_ROUTER;
mockNodeDB->addNode(0x000007AB, 0, true, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/true);
meshtastic_MeshPacket p = makeRelayedPacket(/*relay=*/0xAB, /*hopsAway=*/1);
TEST_ASSERT_FALSE(shim->shouldDecrementHopLimit(&p)); // preserve
}
void test_hoplimit_decrement_on_colliding_favorites(void)
{
// Headline regression: two favorite routers share the relay byte -> ambiguous -> decrement
// (the old "first NodeDB match wins" scan would non-deterministically preserve).
config.device.role = meshtastic_Config_DeviceConfig_Role_ROUTER;
mockNodeDB->addNode(0x000007AB, 0, true, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/true);
mockNodeDB->addNode(0x000008AB, 0, true, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/true);
meshtastic_MeshPacket p = makeRelayedPacket(0xAB, 1);
TEST_ASSERT_TRUE(shim->shouldDecrementHopLimit(&p)); // decrement
}
void test_hoplimit_decrement_when_resolved_not_favorite(void)
{
config.device.role = meshtastic_Config_DeviceConfig_Role_ROUTER;
mockNodeDB->addNode(0x000007AB, 0, true, 60, meshtastic_Config_DeviceConfig_Role_ROUTER, /*favorite=*/false);
meshtastic_MeshPacket p = makeRelayedPacket(0xAB, 1);
TEST_ASSERT_TRUE(shim->shouldDecrementHopLimit(&p)); // unique but not a favorite -> decrement
}
// ===========================================================================
void setup()
{
initializeTestEnvironment();
UNITY_BEGIN();
mockNodeDB = new MockNodeDB();
shim = new NextHopRouterTestShim();
nodeDB = mockNodeDB;
printf("\n=== resolveLastByte (M1) ===\n");
RUN_TEST(test_resolve_none_when_empty);
RUN_TEST(test_resolve_zero_byte_is_none);
RUN_TEST(test_resolve_unique_neighbor);
RUN_TEST(test_resolve_collision_is_ambiguous);
RUN_TEST(test_resolve_strict_excludes_stale);
RUN_TEST(test_resolve_strict_excludes_far);
RUN_TEST(test_resolve_lenient_includes_favorite_router);
RUN_TEST(test_resolve_lenient_collision_favorite_plus_neighbor);
RUN_TEST(test_resolve_skips_self);
RUN_TEST(test_resolve_skips_ignored);
RUN_TEST(test_resolve_0x00_maps_to_0xFF_and_collides);
RUN_TEST(test_resolve_unique_helper);
printf("\n=== getNextHop (M2 + M3 decay) ===\n");
RUN_TEST(test_getnexthop_unique_returns_byte);
RUN_TEST(test_getnexthop_ambiguous_floods);
RUN_TEST(test_getnexthop_vanished_neighbor_floods);
RUN_TEST(test_getnexthop_split_horizon_floods);
RUN_TEST(test_getnexthop_broadcast_is_nullopt);
RUN_TEST(test_getnexthop_decays_stale_route);
printf("\n=== route-health helpers (M3) ===\n");
RUN_TEST(test_health_fresh_not_stale);
RUN_TEST(test_health_ttl_expiry);
RUN_TEST(test_health_ttl_rollover_safe);
RUN_TEST(test_health_failure_threshold);
RUN_TEST(test_health_success_resets_failures);
RUN_TEST(test_health_relearn_same_hop_keeps_failures);
RUN_TEST(test_health_relearn_new_hop_resets_failures);
RUN_TEST(test_health_failure_without_record_is_noop);
RUN_TEST(test_health_clear);
RUN_TEST(test_health_lru_eviction_bounds_table);
printf("\n=== shouldDecrementHopLimit (M2 site 4) ===\n");
RUN_TEST(test_hoplimit_preserve_unique_favorite_router);
RUN_TEST(test_hoplimit_decrement_on_colliding_favorites);
RUN_TEST(test_hoplimit_decrement_when_resolved_not_favorite);
exit(UNITY_END());
}
void loop() {}