@illotum, drop localapi/routecheck_disabled.go because these methods
don’t need to be there if the client is built with ts_omit_routecheck.
Signed-off-by: Simon Law <sfllaw@tailscale.com>
In order to support a `tailscale routecheck` command, we introduce the
`/localapi/v0/routecheck` endpoint to the local API. This endpoint
returns the most recent report collected by the routecheck client.
If `force=true` is an argument in the query string, then this endpoint
will actively probe before returning the report.
Updates #17366
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
@illotum points out that there was a race between atomic.Pointer.Load
and Store. Of course, we should have used Swap.
Signed-off-by: Simon Law <sfllaw@tailscale.com>
@amalscale points out that we already have pingTimeout in magicsock,
so I’ve extracted the default out as a tsconst.
Signed-off-by: Simon Law <sfllaw@tailscale.com>
@bradfitz suggests signalling using channels is better than sync.Cond,
because mutexes and channels don’t play nicely.
Signed-off-by: Simon Law <sfllaw@tailscale.com>
routecheck.Extension.onNetMapToggle doesn’t need to query for a netmap
when it has just been passed a fresh one.
Signed-off-by: Simon Law <sfllaw@tailscale.com>
The routecheck package parallels the netcheck package, where the
former checks routes and routers while the latter checks networks.
Like netcheck, it compiles reports for other systems to consume.
Historically, the client has never known whether a peer is actually
reachable. Most of the time this doesn’t matter, since the client will
want to establish a WireGuard tunnel to any given destination.
However, if the client needs to choose between two or more nodes,
then it should try to choose a node that it can reach.
Suggested exit nodes are one such example, where the client filters
out any nodes that aren’t connected to the control plane. Sometimes an
exit node will get disconnected from the control plane: when the
network between the two is unreliable or when the exit node is too
busy to keep its control connection alive. In these cases, Control
disables the Node.Online flag for the exit node and broadcasts this
across the tailnet. Arguably, the client should never have relied on
this flag, since it only makes sense in the admin console.
This patch implements an initial routecheck client that can probe
every node that your client knows about. You should not ping scan your
visible tailnet, this method is for debugging only.
This patch also introduces a new OnNetMapToggle hook, which fires when
the netmap transitions from nil to non-nil, or vice versa. This
happens either when the client receives its first MapResponse after
connecting to the control plane, or when it clears the netmap while it
is disconnecting. Routecheck uses this to wait for a valid netmap
so it knows which peers to probe.
Updates #17366
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
There are two places where tailscaled transitions into a paused state:
1. tailscaled’s controlclient is initially created,
2. tailscale down, or the GUI equivalent, commands it to.
This patch unifies the implementation of both scenarios into
LocalBackend.shouldPauseControlClientLocked to prevent the
implementation from drifting.
The flaky tstest/integration.TestNoControlConnWhenDown test exposed
this mismatch, but only by accident. This patch also changes
TestNode.MustDown so that it runs `tailscale down` and then waits for
the testcontrol server to finish handling any associated /machine/map
requests.
Fixes#19831
Signed-off-by: Simon Law <sfllaw@tailscale.com>
In aa5da2e5f2 we made the IPN bus include deltas, including the
PeersRemoved, sending a slice of integer NodeIDs that were
removed. But when updating xcode, I realized there was no way to map
those integers to the stable node IDs used in other places.
I was consdering changing the just-added ipn.Notify.PeersRemoved from
an IntID to a string StableID, but then it doesn't match the MapResponse
wire protocol, which we've tried to match so far.
Instead, just add the integer ID as well. Callers can use whichever
world they want, having both. It's a little regrettable that we still
have two worlds of IDs, but oh well. Neither is really suitable to a
hypothetical future fully federated world of control servers anyway,
so we'll need a third type later anyway, so just live with the two we
have for now.
Updates #12542
Change-Id: Ib8fd48a265e1da1f8779152f141f624a7f7260e9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Holding an exclusive lock while writing to the unbuffered changequeue chan
is likely going to deadlock when the run() path may try to grab the same lock
before reading from the chan to drain it (on map session close). This causes
the client to stop processing new map responses and TSMP disco key advertisements.
There is a good probability of inducing this deadlock using the old code and new
test added in this commit: TestUpdateDiscoForNodeCallback/test_deadlock.
Also fix an unintentional regression in how the client responds to a mapResponse sleep
command. 85bb5f84a5 moved the processing of mapResponses into a new goroutine,
serialized via mapSession's changequeue. Thus, controlclient stopped sleeping in the
same goroutine servicing mapResponses/control connections. This commit brings us back
to sleeping synchronously in the same goroutine as controlclient.
Updates #12639
Signed-off-by: Amal Bansode <amal@tailscale.com>
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Co-authored-by: Claus Lensbøl <claus@tailscale.com>
In PR tailscale/corp#30448, we originally decided to break ties using
SHA256 for our rendezvous hashing algorithm. Now that we’ve had some
experience with it, we think that FNV-1a is a better choice. It
distributes bits evenly, it’s much faster, and it doesn’t need to be
cryptographically secure. The FNV designers recommend FNV-1a over the
deprecated FNV-1.
This PR makes the switch and updates the related tests, since changing
the algorithm changes which stable pick gets selected. As of 2026-05,
this is the best time to make this change, since there are almost no
clients in the wild with traffic steering enabled.
Updates #17366
Updates tailscale/corp#29964
Updates tailscale/corp#29966
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
For large tailnets (~50k+ nodes) with frequent peer churn (ephemeral
GitHub Actions workers etc.), tailscaled used to rebuild the full
netmap and fan it out on the IPN bus on every MapResponse that
added or removed a peer. There were two O(N) costs per delta: the
full netmap rebuild + every Notify.NetMap encode to every bus watcher.
This change tackles both:
1. Plumb O(1) peer add/remove through the delta path. PeersChanged
and PeersRemoved no longer prevent the delta happy path; instead,
they mutate the per-node-backend peer map in place.
2. Restrict ipn.Notify.NetMap emission to the platforms whose host
GUIs still depend on it (Windows, macOS, iOS) and migrate
in-tree consumers off it everywhere else:
- Migrate reactive consumers (containerboot, kube agents,
sniproxy, tsconsensus, etc.) off Notify.NetMap to the
previously-added Notify.SelfChange signal so they no longer
have to subscribe to the full netmap.
- Add ipn.NotifyNoNetMap so GUI clients on "legacy-emit" platforms
that have already migrated can opt out of the per-watcher
NetMap encode.
- Gate Notify.NetMap emission on the producer side by a compile-
time GOOS check, so the supporting code is dead-code-eliminated
on Linux and other geese where no GUI consumer needs it.
Re-running BenchmarkGiantTailnet from tstest/largetailnet, which was
added along with baseline numbers on unmodified main in ad5436af0d,
the per-delta cost (one peer add+remove pair) is now ~O(1) regardless
of tailnet size N:
N no-watcher (ms/op) bus-watcher (ms/op)
before now factor before now factor
10000 32 0.11 300x 166 0.13 1300x
50000 222 0.11 2000x 865 0.13 6700x
100000 504 0.12 4100x 1765 0.13 13400x
250000 1551 0.12 12500x 4696 0.15 32400x
Updates #12542
Change-Id: I94e34b37331d1a8ec74c299deffadf4d061fda9e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
SetDERPMap spawns a goroutine that calls ReSTUN, which logs via the
test logger. If the test returns before that goroutine logs, the
goroutine races with testing cleanup.
Use tstest.WhileTestRunningLogger so the goroutine's logf call becomes
a no-op once the test finishes.
Fixes#19829
Change-Id: I1097f98e40ffd1c5dd7fb7a715c918255853e3c6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The traffic package contains helpers for evaluating traffic steering
scores and picking appropriate nodes. These were extracted from
ipnlocal.suggestExitNodeUsingTrafficSteering so they can be reused by
the new routecheck package to probe exit nodes in priority order.
Updates #17366
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
When tailscaled is running in userspace-networking mode behind an
exit node (e.g. as a SOCKS5 proxy), it resolves a hostname and then
dials a single resolved IP through the tunnel. If the name has both
A and AAAA, Go's net.Resolver merges them and we pick ips[0], which
on an IPv6-native host is usually AAAA. If the exit node has no IPv6
egress (or vice versa), the dial fails silently through the tunnel
and the user sees a hang.
Resolve all candidates and race connect attempts across address
families with a 300ms happy-eyeballs delay, matching Go's net.Dialer
default and the existing pattern in net/dnscache (commit ee0a03b14).
First success wins; losers are cancelled and any conns they produce
are closed. A failBoost channel wakes the launcher when a connect
fails fast (e.g. ICMP "no route" via the tunnel) so we don't sit on
the 300ms timer when the answer is already known.
userDialResolve is refactored into userDialResolveAll (returns the
full candidate list) plus a thin single-IP wrapper for callers like
UserDialPlan that don't race. UserDial's per-IP dispatch (netstack
vs peer dialer vs SystemDial vs std) is extracted to dialOneUser so
each candidate can route correctly on its own merits.
Also fix serveDial in localapi to pass the original hostname to
UserDial rather than a pre-resolved IP, so the race can fire.
This fix is single-ended: it works against any exit node, including
old ones, with no protocol changes. The trade-off versus filtering
on the exit-node side via PeerAPI DoH is that every dial through an
unreachable-family exit node costs one failed connect attempt per
cache window, rather than zero, which is acceptable given the
simplicity.
Fixes#19792Fixes#13257
Change-Id: I9d7645d0034caf3ee22ecdd8070798353f77e94b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
serveMap cloned s.nodes[nk], mutated the clone outside the mutex,
then wrote it back via updateNodeLocked. A concurrent UpdateNode,
SetNodeCapMap, or other writer landing between the clone and the
writeback would be silently clobbered. Mutate the live node under
the mutex instead.
Surfaces in tsnet's TestListenService as a flaky ErrUntaggedServiceHost
panic: the test calls control.UpdateNode to attach a tag, a concurrent
updateRoutine map request from the host races, and the host's next
netmap arrives with Tags=[].
Updates #19822
Change-Id: I6c5ebd5e5bf79a40316f53f627157230773cb469
Signed-off-by: James Tucker <james@tailscale.com>
Some netmap updates are guaranteed to affect only the "static" parts of the
netmap, and so should not require us to walk through all the peers and user
profiles when updating the cache. To support this, the new UpdateSelfOnly
method updates only the Self node and other tailnet settings that are not
dependent on the peers and profiles.
Use this when updating the cache on DERP home changes.
Updates #12542
Change-Id: Ifed522b29d579fb76e010b4ff738cc4e0a72d27f
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
The TestShouldUseOneCGNATRoute test fails when the underlying system
interfaces don’t match what the underlying assumptions of the test.
That assumption was that there would only ever be one CGNAT interface:
the Tailscale one.
This breaks on Linux when border0 is installed because border0 also
creates an interface with a CGNAT route.
This patch stubs netmon.RegisterInterfaceGetter to replace the system
interfaces and netmon.SetTailscaleInterfaceProps to identify the test
data that defines the Tailscale interface.
This patch also tests the control knob override for CGNAT for every
combination of operating system and system interfaces, instead of just
a couple of combinations.
Fixes#19731
Signed-off-by: Simon Law <sfllaw@tailscale.com>
Add Go tests that drive a real headless Chromium (via chromedp) against
the built cmd/tsconnect/pkg/ artifact and verify the @tailscale/connect
public API surface end-to-end. The package has not been republished in
three years, in part because no test exercises the produced artifact at
runtime — only tsc --noEmit and a Go build run in CI.
TestCreateIPN loads pkg.js into the browser, calls createIPN with a junk
auth key, and asserts that pkg.createIPN / pkg.runSSHSession are
functions and that createIPN() returns an IPN with the documented
run/login/logout/ssh/fetch methods. No control-plane traffic.
TestFetchTailnetPeer stands up a full local tailnet (testcontrol +
DERP + a tsnet.Server peer) and verifies that the browser-side WASM
client can join over WebSocket-noise to the same control, connect to
DERP over WSS, and then ipn.fetch() an HTTP service hosted on the tsnet
peer through the tailnet. The test asserts the response body matches a
known string. Browser state transitions are logged: NoState -> NeedsLogin
-> Starting -> Running.
Tests are opt-in via --run-headless-browser-tests (matching the existing
--run-vm-tests pattern in tstest/natlab/vmtest) so they never fire in
casual `go test ./...` runs. When the flag is set, a test is skipped if
cmd/tsconnect/pkg/ has not been built, and fails with t.Error if no
chromium binary is found on $PATH (honoring $CHROME_BIN as an override).
findChromium also falls back to /Applications/Google Chrome.app and
/Applications/Chromium.app on darwin, since macOS Chrome's executable
lives inside an .app bundle and is not on $PATH by default. The
.github/workflows/test.yml wasm job is extended to install
google-chrome-stable and run the tests with the flag after build-pkg.
To prevent silently testing a stale pkg/main.wasm (built from an older
checkout than the rest of the test invocation), build-pkg now writes
pkg/build-info.json recording the sha256 of the raw (pre-wasm-opt)
go-build output. The test does its own `go build` of
cmd/tsconnect/wasm with the same -tags/-trimpath/-ldflags (factored
into a new cmd/tsconnect/wasmbuild package shared by both call sites)
and t.Fatalfs with a "rebuild" instruction on mismatch. Cost is
near-zero because the Go build cache from the prior build-pkg makes
the rebuild a cache hit.
The new wasmbuild package also replaces cmd/tsconnect's hardcoded -tags
string with a minimal-feature-set computation. wasmbuild.Keep names the
small set of feature/featuretags entries the browser client actually
needs (netstack, logtail, dns, health, c2n, ipnbus); wasmbuild.Tags()
emits a ts_omit_<f> for every other
omittable feature in feature/featuretags.Features, with transitive deps
expanded via featuretags.Requires. An init() panics if Keep references
a feature unknown to feature/featuretags so a rename there fails
loudly. Net effect on size: 32M raw / 9.4M brotli before this change,
25M raw / 4.4M brotli after — vs the last-published 1.39.98 at 21M /
3.8M. The transitive package-import graph is unchanged (176
tailscale.com/* packages either way): featuretags omits eliminate
dead code via `const HasX = false`, not imports. Trimming the import
graph would require a separate, larger refactor splitting interface
packages by build tag.
Writing TestFetchTailnetPeer surfaced several real issues, all fixed
here:
* cmd/tsconnect built the wasm with the nethttpomithttp2 tag, but
control/ts2021 (since commit 1d93bdce2, "control/controlclient:
remove x/net/http2, use net/http", Oct 2025) requires HTTP/2 from
net/http's bundled implementation. With nethttpomithttp2 set, the
bundle is excluded and the wasm client cannot speak HTTP/2 to any
control plane, including production. Drop the tag. Wasm size grows
~1 MB raw / ~300 KB brotli (more than offset by the feature
pruning above). The last published @tailscale/connect (1.39.98,
early 2023) pre-dates the regression, which is why no consumer has
reported the breakage.
* tstest/integration/testcontrol.Server's /ts2021 noise upgrade
endpoint rejected anything but POST. WebSocket clients (the only
transport available to browser-WASM) come in as GET. Allow both;
the controlhttp AcceptHTTP path dispatches on the Upgrade header,
so the websocket library still enforces GET for WS upgrades.
This matches production, where the same controlhttpserver.AcceptHTTP
routes purely on the Upgrade header without checking method.
* derp/derphttp's urlString built the DERP URL from node.HostName
only, dropping node.DERPPort. Non-WS clients use a separate code
path (connectToHost) that honors DERPPort, but WebSocket-only
clients (browser-WASM) went through urlString and so could not
reach a DERP running on any port other than 443. Include the port
when it differs from the scheme default.
Also move addWebSocketSupport from cmd/derper (where it was main-only)
to derp/derpserver.AddWebSocketSupport so tstest/integration.RunDERPAndSTUN
can wrap its DERP handler with WebSocket support — without that, the
test DERP would not accept the browser's wss connection.
Fixes#9394
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Iff9cdee303e3b239924249b5bffb2fd04e02f391
A data race in a package matters more than any individual test
result. Two related problems:
1. Where go test's race detector text ("WARNING: DATA RACE" plus
the goroutine stack traces) lands in JSON output is timing-
dependent: it can be attributed to a test that ends up reporting
PASS (e.g. when the racing goroutines outlive the test that
spawned them and TSan prints during a different test's window).
testwrapper's main loop only flushes the logs of failed tests,
so the race report ends up stuck in a passing test's buffer and
is silently dropped. The race builders just see a bare
"FAIL\nFAIL\tpkg\ttime".
2. If the failing test in such a package happens to be marked flaky,
testwrapper retries it. That is the worst possible response to a
race: the flaky test might not even be the racy code, and a
second run without the racy goroutines could "succeed" while
hiding the real bug.
Address both: scan every output line for the race detector's first-
line marker. Track whether the package observed a race at all, on
the pkgFinished testAttempt. When a race was seen, fold every per-
test log buffer into the package-level logs (so the full report
surfaces from the existing pkg-fail flush path), and drop any
flaky-test retry plans for that package so we fail immediately
instead of running another attempt.
Two new tests:
- TestRaceSuppressesFlakyRetry verifies that a flaky test alongside
a racy test does NOT get retried.
- TestRaceAttributedToPassingTest verifies that a race attributed by
test2json to a passing test still surfaces in the output.
Also add a corpus of captured raw test binary outputs under
cmd/testwrapper/testdata/, with one subdirectory per scenario,
documenting the six representative shapes that go test -race can
emit (race in test body, race in goroutines that outlive a test,
race forced into a later test, race in TestMain post-m.Run, and a
parallel-tests split-attribution case via a "=== NAME" redirect
line). See its README.md for details.
Fixes#19603
Change-Id: Ifbfcd67fb3b1882c4907bd9cb2d68a8b5a91dd54
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
If the context given to DialContext has a shorter lifetime than the OS
TCP SYN timeout, and TCP SYNs are dropped from the path to the remote,
DialContext would never fall back to try IPv6 after IPv4.
Instead, use the normal happy eyeballs race if there is more than one
address. This does remove the implicit prioritization of IPv4 over IPv6
in cases where there is only a single IPv4 remote address.
Updates #13346
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Issue #19737 ran into a nil pointer dereference, the cause of which was fixed
by #19761. If we end up on this code path with a nil table again, we should
bubble that up as an error (which is logged by the health warning system)
rather than failing catastrophically.
Signed-off-by: Naman Sood <mail@nsood.in>
The Engine watchdog wrapped every wgengine.Engine method call in a
goroutine with a 45s timeout and crashed the process on timeout. It
was added years ago to surface deadlocks during development, but the
underlying deadlocks have long since been fixed, and even when it did
fire it produced obscure stack traces (from inside the watchdog
goroutine, not the original caller) without buying much.
Audit of userspaceEngine's methods shows none have cyclic locking or
unbounded blocking now that ResetAndStop no longer loops waiting for
DERPs to drain (fa49009ee). The watchdog is dead weight; remove it
along with the TS_DEBUG_DISABLE_WATCHDOG escape hatch.
Updates #19759
Change-Id: Iba9d718fe1f8718a6631296e336b138c31b99ff1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
RouteCheck, which checks that overlapping routers are reachable, is
enabled by default for both tailscaled and tsnet.
Updates #17366
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
Move the inline CSS and JS into separate files to be more friendly
to Content Security Policies. ServeHTTP is updated to serve these
assets from the '/static/' path.
Updates tailscale/corp#32398
Signed-off-by: Noel O'Brien <noel@tailscale.com>
linuxRouter has two blocks (connmark rules and the CGNAT drop rule) that
gate on cfg.NetfilterMode, the requested config state. This may cause an
error when setNetfilterModeLocked fails, since it may keep assuming this
config is valid.
We now gate both blocks on r.netfilterMode, matching the pattern used by
SNAT, stateful, and loopback paths.
Fixes#19737
Change-Id: Ia6003a082db99c376e662132d725661afbac0ee9
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
cibuild.On() returns true for any CI environment that sets CI=true,
including Alpine Linux's package build CI. TestTsgoRevInCacheKey was
guarded by cibuild.On() (or use of tsgo), so it ran under Alpine's CI
with stock Go, where go.toolchain.rev isn't blended into build cache
keys, and unsurprisingly failed.
Add cibuild.OnTailscaleCI, which keys off GITHUB_REPOSITORY_OWNER to
distinguish tailscale/tailscale's own GitHub Actions CI from arbitrary
downstream CI, and use it in TestTsgoRevInCacheKey.
Fixes#19754
Change-Id: Id31cfe71903a235f1460dca1e2fdf334e3ba1ee5
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Since f343b496c3 ("wgengine, all: remove LazyWG, use wireguard-go
callback API for on-demand peers"), Reconfig is fully synchronous:
magicConn.UpdatePeers, wgdev.RemovePeer, router.Set, and dns.Set all
return when the work is done, and the peer list is updated under
wgLock before Reconfig returns. So after Reconfig with empty configs,
len(st.Peers) is already 0.
The old loop also waited for st.DERPs to drain to 0, but UpdatePeers
only edits maps; active DERP connections idle out on their own
timeout. The sole caller (LocalBackend.stopEngineAndWait) doesn't
inspect st.DERPs anyway; it just hands the Status to
setWgengineStatusLocked. So the drain-wait was for nothing observable
and could theoretically (or at least appear to readers to) loop
forever holding b.mu. Remove that reader confusion by removing
the backoff loop entirely.
Updates #19759
Change-Id: Ibfac3f0baabcad7604b713c934a8fc37932e0a50
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>