If a worker is in some heavy JS e.g. `for(;;)` it will be stuck forever, even
if the peer closes the CDP connection.
With CDP reads now owned by the network thread, we now correctly detect the
disconnect and simply need to force the worker to shutdown. To achieve this,
on socket close, the CdpLink held by the network is given a terminate_ms (five
seconds from now) and added to a linked list. On every wakeup, the network
thread can check the list + timestamp and, if necessary, call Isolate::Terminate
(which is safe to call on a different thread).
Currently, if a disconnect/close is captured in a worker during a syncRequest,
that specific request is terminated, but the error doesn't bubble up. The worker
remains alive and will subsequently block in a perform, with no connection alive
to wake it up.
In this commit, when disconnect/close is received, inbox.terminate is set to
true. This flag is checked (in syncRequest and http_client.tick) and
error.ClientDisconnected is returned.
(Also, on network shutdown, always broadcast the cdp_unregister since there's
no harm in sending an extra signal even if nothing was removed).
The CDP server ignored a single SIGTERM while a connection was live: the
process only exited if the socket was closed before the signal, or after a
third signal. A conventional one-shot graceful stop (SIGTERM then waitpid)
hung.
On shutdown the sighandler runs Network.stop (which sets `shutdown` and lets
the run loop exit) before Server.shutdown. A live CDP worker parks in
curl_multi_poll and is woken ONLY by the Network thread via
dropCdp -> handles.wakeup(). Once the run loop exits with links still live,
nothing can wake those workers, so Server.deinit()'s
`while (active_threads > 0)` loop spins forever.
Drop every still-live CDP link from the run loop when `shutdown` is set,
reusing the existing peer-EOF path: dropCdp(notify=true) pushes a .disconnect
into the worker's inbox and wakes it, so cdp.tick() returns false and the
worker exits before the loop breaks.
preparePollFds cleared and rebuilt the curl portion of `pollfds` every
loop iteration, but sliced `pollfds[PSEUDO_POLLFDS..]` — all the way to
the end of the array. That range also covers the CDP socket region
`[cdp_start..]`, which prepareCdpPollFds owns and only rebuilds when
`cdp_dirty` is set (a steady-state optimization). So the @memset wiped
every live CDP socket fd to -1 on each iteration.
This only bites once Network owns a curl multi handle, which is created
solely by telemetry — and telemetry is disabled in Debug builds, which
is why it reproduced only in ReleaseFast/ReleaseSafe (and the nightly).
Regular HTTP/navigation runs on the worker's own handles, not Network's
multi, so it never triggered the path in Debug.
Once the CDP sockets are dropped from the poll set, the Network thread
stops reading client messages (#2508, hard stall after the first
command) and never observes peer EOF or `conn.shutdown`, so the worker
is never told to exit and SIGTERM is ignored after a connection (#2507).
Fix: slice only the curl region `[PSEUDO_POLLFDS..cdp_start]`.
Also harden the poll timeout: `curl_multi_timeout` returns -1 when curl
is idle, and `@min(250, -1)` is -1 (block forever), which starved
onTick (telemetry's periodic flush) and turned any missed wakeup into a
permanent hang rather than a <=250ms blip. Treat curl_timeout <= 0 as
"no deadline" and fall back to the 250ms cap.
Fixes#2507Fixes#2508
Replaces 4 boolean flags with a state. Makes it easier to figure out what the
state of the transfer is, and removes the possibility of inconsistent flags
.e.g queued + loop_owned.
loop_owned -> state != .created
_queued -> state == .queued
_perform -> state == .completing
aborted -> state == .aborted
This largely reverts 92607ad765 (captured in PR:
https://github.com/lightpanda-io/browser/pull/2398).
https://github.com/lightpanda-io/browser/pull/2495 introduces protection against
execution arbitrary CDP command during JavaScript callbacks. Claude initially
made the case for keeping the existing code as a safety net, but sycophanted
when I pushed by.
My reason for removing it is that it isn't a low-maintenance guard. It's a flag
that serves a real purpose (ensuring 1 JS script is finished before executing
another one), that has been expended to solve these issues. It needs to be set
(and reverted) at every callsite that makes a blocking call, and it needs to be
checked (recursively across all frames) in any place that can teardown the page/
frame.
Claude called the allowlist "load-bearing in a non-obvious way", but I think
it's purpose built specifically for this case. Extended the comment atop
`allowDuringSyncWait` so that future-selves remember this.
Previously, the CDP socket was added to the worker's multi and fully owned
by the worker. While this is simple, it introduced some issues:
1 - Cannot detect a disconnected client during JS processing ( for(;;) )
2 - A blocked worker can cause back-pressure that blocks the client. This can
cause a deadlock if the worker is blocked waiting for a CDP message
In addition to these 2 problems, there was 1 other serious CDP-related issue:
arbitrary CDP messages could be processed during JavaScript callback. For
example, a Worker calls importScripts while request interception is enabled,
this requires us to tick the HttpClient waiting for the interception response.
But, a client could sent Target.closeTarget, which we'd process and delete the
frame..all while importScripts is still blocked. Assuming importScripts unblocks
everything is a big UAF since the frame (and its workers) were cleared from
closeTarget.
The CDP socket is now read from the network (main) thread and an OTP-style
mailbox is used. The network thread posts message to the Worker's inbox and
signals it to wakeup. This solves #1 and #2. It doesn't directly solve the
reentrancy issue, but it provides the foundation. Specifically, in introduces
a queue for of CDP message and more control over when/how that queue is
processed. At "safe points" (Runner.tick, HttpClient.tick), any message can
be processed. But, when inside a JavaScript callback, we can process only non-
destructive/mutating message. Specifically, we can process only messages related
to request interception.
network/WsConnection.zig was poorly named. It didn't represent a generic WS
connection, but rather a CDP-specific connection. This splits the generic WS
logic into network/WS.zig and the CDP-specific details in cdp/Connection.zig.
Some of the connection management in the Server has also been simplified.
This is just moving fields around. The end result is that there's a
`transfer.req` and a `transfer.res`.
On the Request side, we use to have a nested `params: RequestParam` resulting
in a lot of `transfer.req.params.url`. This is now `transfer.req.url`. On the
Response side, we had the exact opposite: response fields splattered directly
in the transfer, `transfer.response_header`. This is now `transfer.res.header`.
There is now an HttpClient.Response, which is the actual final response (which
could be for a transfer or something else, e.g the cache). And an
HttpClient.Transfer.Response which captures the inflight response data (and is
one of the polymorphic variants of the HttpClient.Response). Probably still not
ideal, but I'm not sure how to make it cleaner, and even if this is just an
intermediary step, I consider it an small win.
1 - Track owner of a request (for simpler / more accurate abort (TBD))
2 - Create Transfer upfront, make everything work on Transfer (not Request)
This helps remove ambiguity about cleanup and simplifies layers. For example
Robots request is just another normal request, not a special case. This gives
everything a stable address (the *Transfer which can be looked up by id)
This change causes lightpanda to display the actual port number (instead of 0)
when binding a dynamic port (--port 0), which makes automating based on
scraping lightpanda output simple.