Files
tailscale/wgengine/netstack
Josef Bacik b0e63cbeb9 wgengine/netstack: add TS_NETSTACK_KEEPALIVE_{IDLE,INTERVAL} envknobs
Adds envknobs to override the netstack default TCP keepalive idle time
(~2h) and probe interval (75s) for forwarded connections.

When a tailnet peer goes away without closing its connections (pod
deleted, peer removed from the netmap, silent network partition), the
forwardTCP io.Copy goroutines block until keepalive fires: the
gvisor-side Read waits on a peer that will never send again, and the
backend-side Read waits on a backend that is alive and idle. With the
netstack default of 7200s idle + 9×75s probes, dead-peer detection
takes a little over two hours. Under high-churn forwarding — many
short-lived peers, or peers holding thousands of proxied connections
that drop at once — stuck goroutines accumulate faster than they clear.

The existing SetKeepAlive(true) at this site enables keepalive without
setting the timers; the TODO above it noted "a shorter default might
be better" and "might be a useful user-tunable". This makes both
timers tunable without changing the defaults: unset preserves the ~2h
behavior, which is the right trade-off for battery-powered peers.

The two knobs are independent — setting one leaves the other at the
netstack default. The options are set before SetKeepAlive(true) so the
timer arms with the configured values rather than the defaults —
matches the order in ipnlocal/local.go for SSH keepalive.

Updates #4522

Signed-off-by: Josef Bacik <josefbacik@anthropic.com>
2026-03-17 13:44:11 -07:00
..