Adds logic for containerboot to signal that it can't auth, so the
operator can reissue a new auth key. This only applies when running with
a config file and with a kube state store.
If the operator sees reissue_authkey in a state Secret, it will create a
new auth key iff the config has no auth key or its auth key matches the
value of reissue_authkey from the state Secret. This is to ensure we
don't reissue auth keys in a tight loop if the proxy is slow to start or
failing for some other reason. The reissue logic also uses a burstable
rate limiter to ensure there's no way a terminally misconfigured
or buggy operator can automatically generate new auth keys in a tight loop.
Additional implementation details (ChaosInTheCRD):
- Added `ipn.NotifyInitialHealthState` to ipn watcher, to ensure that
`n.Health` is populated when notify's are returned.
- on auth failure, containerboot:
- Disconnects from control server
- Sets reissue_authkey marker in state Secret with the failing key
- Polls config file for new auth key (10 minute timeout)
- Restarts after receiving new key to apply it
- modified operator's reissue logic slightly:
- Deletes old device from tailnet before creating new key
- Rate limiting: 1 key per 30s with initial burst equal to replica count
- In-flight tracking (authKeyReissuing map) prevents duplicate API calls
across reconcile loops
Updates #14080
Change-Id: I6982f8e741932a6891f2f48a2936f7f6a455317f
(cherry picked from commit 969927c47c)
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
Co-authored-by: chaosinthecrd <tom@tmlabs.co.uk>
This amends the session creation and auth status querying logic of the device UI
backend. On creation of new browser sessions we now store a PendingAuth flag
as part of the session that indicates a pending auth process that needs to be
awaited. On auth status queries, the server initiates a polling for the auth result
if it finds this flag to be true. Once the polling is completes, the flag is set to false.
Why this change was necessary: with regular browser settings, the device UI
frontend opens the control auth URL in a new tab and starts polling for the
results of the auth flow in the current tab. With certain browser settings (that
we still want to support), however, the auth URL opens in the same tab, thus
aborting the subsequent call to auth/session/wait that initiates the polling,
and preventing successful registration of the auth results in the session
status. The new logic ensures the polling happens on the next call to /api/auth
in these kinds of scenarios.
In addition to ensuring the auth wait happens, we now also revalidate the auth
state whenever an open tab regains focus, so that auth changes effected in one
tab propagate to other tabs without the need to refresh. This improves the
experience for all users of the web client when they've got multiple tabs open,
regardless of their browser settings.
Fixes#11905
Signed-off-by: Gesa Stupperich <gesa@tailscale.com>
This makes tsnet apps not depend on x/crypto/ssh and locks that in with a test.
It also paves the wave for tsnet apps to opt-in to SSH support via a
blank feature import in the future.
Updates #12614
Change-Id: Ica85628f89c8f015413b074f5001b82b27c953a9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Two issues caused TestCollectPanic to flake:
1. ETXTBSY: The test exec'd the tailscaled binary directly without
going through StartDaemon/awaitTailscaledRunnable, so it lacked
the retry loop that other tests use to work around a mysterious
ETXTBSY on GitHub Actions.
2. Shared filch files: The test didn't pass --statedir or TS_LOGS_DIR,
so all parallel test instances wrote panic logs to the shared system
state directory (~/.local/share/tailscale). Concurrent runs would
clobber each other's filch log files, causing the second run to not
find the panic data from the first.
Fix both by adding awaitTailscaledRunnable before the first exec, and
passing --statedir and TS_LOGS_DIR to isolate each test's log files,
matching what StartDaemon does.
It now passes x/tools/cmd/stress.
Fixes#15865
Change-Id: If18b9acf8dbe9a986446a42c5d98de7ad8aae098
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When IPv6 is unavailable on a system, AddConnmarkSaveRule() and
DelConnmarkSaveRule() would panic with a nil pointer dereference.
Both methods directly iterated over []iptablesInterface{i.ipt4, i.ipt6}
without checking if ipt6 was nil.
Use `getTables()` instead to properly retrieve the available tables
on a given system
Fixes#3310
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
Raw byte accessors for key types, mirroring existing patterns
(NodePublic.Raw32 and DiscoPublicFromRaw32 already exist).
NodePrivate.Raw32 returns the raw 32 bytes of a node private key.
DiscoPrivateFromRaw32 parses a 32-byte raw value as a DiscoPrivate.
Updates tailscale/corp#24454
Change-Id: Ibc08bed14ab359eddefbebd811c375b6365c7919
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
* cmd/k8s-operator: use correct tailnet client for L7 & L3 ingresses
This commit fixes a bug when using multi-tailnet within the operator
to spin up L7 & L3 ingresses where the client used to create the
tailscale services was not switching depending on the tailnet used
by the proxygroup backing the service/ingress.
Updates: https://github.com/tailscale/corp/issues/34561
Signed-off-by: David Bond <davidsbond93@gmail.com>
* cmd/k8s-operator: adding server url to proxygroups when a custom tailnet has been specified
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
(cherry picked from commit 3b21ac5504e713e32dfcd43d9ee21e7e712ac200)
---------
Signed-off-by: David Bond <davidsbond93@gmail.com>
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Co-authored-by: chaosinthecrd <tom@tmlabs.co.uk>
We did so for Linux and macOS already, so also do so for Windows. We
only didn't already because originally we never produced binaries for
it (due to our corp repo not needing them), and later because we had
no ./tool/go wrapper. But we have both of those things now.
Updates #18884
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When a recording upload fails mid-session, killProcessOnContextDone
writes the termination message to ss.Stderr() and kills the process.
Meanwhile, run() takes the ss.ctx.Done() path and proceeds to
ss.Exit(), which tears down the SSH channel. The termination message
write races with the channel teardown, so the client sometimes never
receives it.
Fix by adding an exitHandled channel that killProcessOnContextDone
closes when done. run() now waits on this channel after ctx.Done()
fires, ensuring the termination message is fully written before
the SSH channel is torn down.
Fixes#7707
Change-Id: Ib60116c928d3af46d553a4186a72963c2c731e3e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
After we intercept a DNS response and assign magic and transit addresses
we must communicate the assignment to our connector so that it can
direct traffic when it arrives.
Use the recently added peerapi endpoint to send the addresses.
Updates tailscale/corp#34258
Signed-off-by: Fran Bull <fran@tailscale.com>
This change reintroduces UserProfile.Groups, a slice that contains
the ACL-defined and synced groups that a user is a member of.
The slice will only be non-nil for clients with the node attribute
see-groups, and will only contain groups that the client is allowed
to see as per the app payload of the see-groups node attribute.
For example:
```
"nodeAttrs": [
{
"target": ["tag:dev"],
"app": {
"tailscale.com/see-groups": [{"groups": ["group:dev"]}]
}
},
[...]
]
```
UserProfile.Groups will also be gated by a feature flag for the time
being.
Updates tailscale/corp#31529
Signed-off-by: Gesa Stupperich <gesa@tailscale.com>
Users on FreeBSD run into a similar problem as has been reported for
Linux #11682 and fixed in #11682: because the tailscaled binaries
that we distribute are static and don't link cgo tailscaled fails to
fetch group IDs that are returned via NSS when spawning an ssh child
process.
This change extends the fallback on the 'id' command that was put in
place as part of #11682 to FreeBSD. More precisely, we try to fetch
the group IDs with the 'id' command first, and only if that fails do
we fall back on the logic in the os/user package.
Updates #14025
Signed-off-by: Gesa Stupperich <gesa@tailscale.com>
I omitted a lot of the min/max modernizers because they didn't
result in more clear code.
Some of it's older "for x := range 123".
Also: errors.AsType, any, fmt.Appendf, etc.
Updates #18682
Change-Id: I83a451577f33877f962766a5b65ce86f7696471c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The txRecords buffer had two compounding bugs that caused the
overflow guard to fire on every send tick under high DERP server load,
spamming logs at the full send rate (e.g. 100x/second).
First, int(packetTimeout.Seconds()) truncates fractional-second timeouts,
under-allocating the buffer. Second, the capacity was sized to exactly the
theoretical maximum number of in-flight records with no headroom,
and the expiry check used strict > rather than >=, so records at exactly
the timeout boundary were never evicted by applyTimeouts,
leaving len==cap on the very next tick.
Fixestailscale/corp#37696
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
This hook addition is motivated by the Connectors 2025 work, in which
NATed "Transit IPs" are used to route interesting traffic to the
appropriate peer, without advertising the actual real IPs.
It overlaps with #17858, and specifically with the WIP PR #17861.
If that work completes, this hook may be replaced by other ones
that fit the new WireGuard configuration paradigm.
Fixestailscale/corp#37146
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
This had gotten flaky with Go 1.26.
Use synctest + AllocsPerRun to make it fast and deterministic.
Updates #18682
Change-Id: If673d6ecd8c1177f59c1b9c0f3fca42309375dff
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Otherwise it gets confused on new(123) etc.
Updates #18682
Change-Id: I9e2e93ea24f2b952b2396dceaf094b4db64424b0
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Fix its/it's, who's/whose, wether/whether, missing apostrophes
in contractions, and other misspellings across the codebase.
Updates #cleanup
Change-Id: I20453b81a7aceaa14ea2a551abba08a2e7f0a1d8
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
considerable latency was seen when using k8s-proxy with ProxyGroup
in the kubernetes operator. Switching to L4 TCPForward solves this.
Fixes tailscale#18171
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Co-authored-by: chaosinthecrd <tom@tmlabs.co.uk>
OpenWrt is changing to using alpine like `apk` for package installation
over its previous opkg. Additionally, they are not using the same repo
files as alpine making installation fail.
Add support for the new repository files and ensure that the required
package detection system uses apk.
Updates #18535
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
This commit adds `--json` output mode to dns debug commands.
It defines structs for the data that is returned from:
`tailscale dns status` and `tailscale dns query <DOMAIN>` and
populates that as it runs the diagnostics.
When all the information is collected, it is serialised to JSON
or string built into an output and returned to the user.
The structs are defined and exported to golang consumers of this command
can use them for unmarshalling.
Updates #13326
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Remove the TS_EXPERIMENTAL_KUBE_API_EVENTS env var from the operator and its
helm chart. This has already been marked as deprecated, and has been
scheduled to be removed in release 1.96.
Add a check in helm chart to fail if the removed variable is set to true,
prompting users to move to ACLs instead.
Fixes: #18875
Signed-off-by: Becky Pauley <becky@tailscale.com>
Go 1.26's url.Parser is stricter and made our tests elsewhere fail
with this scheme because when these listen addresses get shoved
into a URL, it can't parse back out.
I verified this makes tests elsewhere pass with Go 1.26.
Updates #18682
Change-Id: I04dd3cee591aa85a9417a0bbae2b6f699d8302fa
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
runtime.NumCPU() returns the number of CPUs on the host, which in
containerized environments is the node's CPU count rather than the
container's CPU limit. This causes excessive memory allocation in
pods with low CPU requests running on large nodes, as each socket's
packetReadLoop allocates significant buffer memory.
Use runtime.GOMAXPROCS(0) instead, which is container-aware since
Go 1.25 and respects CPU limits set via cgroups.
Fixes#18774
Signed-off-by: Daniel Pañeda <daniel.paneda@clickhouse.com>
We use the TS_USE_CACHED_NETMAP knob to condition loading a cached netmap, but
were hitherto writing the map out to disk even when it was disabled. Let's not
do that; the two should travel together.
Updates #12639
Change-Id: Iee5aa828e2c59937d5b95093ea1ac26c9536721e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
After fixing the flakey tests in #18811 and #18814 we can enable running
the natlab testsuite running on CI generally.
Fixes#18810
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
This is a minimal hacky fix for a case where the portlist poller extension
could miss updates to NetMap's CollectServices bool.
Updates tailscale/corp#36813
Change-Id: I9b50de8ba8b09e4a44f9fbfe90c9df4d8ab4d586
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
PR #18860 adds firewall rules in the mangle table to save outbound packet
marks to conntrack and restore them on reply packets before the routing
decision. When reply packets have their marks restored, the kernel uses
the correct routing table (based on the mark) and the packets pass the
rp_filter check.
This makes the risk check and reverse path filtering warnings unnecessary.
Updates #3310Fixestailscale/corp#37846
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
When a Linux system acts as an exit node or subnet router with strict
reverse path filtering (rp_filter=1), reply packets may
be dropped because they fail the RPF check. Reply packets arrive on the
WAN interface but the routing table indicates they should have arrived
on the Tailscale interface, causing the kernel to drop them.
This adds firewall rules in the mangle table to save outbound packet
marks to conntrack and restore them on reply packets before the routing
decision. When reply packets have their marks restored, the kernel uses
the correct routing table (based on the mark) and the packets pass the
rp_filter check.
Implementation adds two rules per address family (IPv4/IPv6):
- mangle/OUTPUT: Save packet marks to conntrack for NEW connections
with non-zero marks in the Tailscale fwmark range (0xff0000)
- mangle/PREROUTING: Restore marks from conntrack to packets for
ESTABLISHED,RELATED connections before routing decision and rp_filter
check
The workaround is automatically enabled when UseConnmarkForRPFilter is
set in the router configuration, which happens when subnet routes are
advertised on Linux systems.
Both iptables and nftables implementations are provided, with automatic
backend detection.
Fixes#3310Fixes#14409Fixes#12022Fixes#15815Fixes#9612
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
We should only add one entry to our magic ips for each domain+dst and
look up any existing entry instead of always creating a new one.
Fixestailscale/corp#34252
Signed-off-by: Fran Bull <fran@tailscale.com>
To be less spammy in stable, add a nob that disables the creation and
processing of TSMPDiscoKeyAdvertisements until we have a proper rollout
mechanism.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
The "public key moved" panic has caused confusion on multiple occasions,
and is a known issue for Mullvad. Add a loose heuristic to detect
Mullvad nodes, and trigger distinct panics for Mullvad and non-Mullvad
instances, with a link to the associated bug.
When this occurs again with Mullvad, it'll be easier for somebody to
find the existing bug.
If it occurs again with something other than Mullvad, it'll be more
obvious that it's a distinct issue.
Updates tailscale/corp#27300
Change-Id: Ie47271f45f2ff28f767578fcca5e6b21731d08a1
Signed-off-by: Alex Chan <alexc@tailscale.com>
Subtle floating point imprecision can propagate and lead to
trigonometric functions receiving inputs outside their
domain, thus returning NaN. Clamp the input to the valid domain
to prevent this.
Also adds a fuzz test for SphericalAngleTo.
Updates tailscale/corp#37518
Signed-off-by: Amal Bansode <amal@tailscale.com>