mirror of https://github.com/lightpanda-io/browser.git synced 2026-06-11 01:25:53 -04:00

Files

Adrià Arrufat d06bf747f4 agent: clarify synchronous nature of primitives

Update agent.md and the LLM synthesis prompt to explicitly state
that agent primitives are synchronous and blocking, and should not
use async/await or Promises.

2026-06-03 20:11:55 +02:00

17 KiB

Raw Permalink Blame History

Agent mode

Looking for a step-by-step walkthrough instead of a reference? See agent-tutorial.md — it builds one end-to-end Hacker News scenario covering the REPL, recording, replay, and the MCP roundtrip.

Looking for the JavaScript script format? See agent-script.md for the runtime contract and primitive API.

lightpanda agent runs a browsing agent backed by Lightpanda's headless engine. It can act as:

an LLM agent that drives the browser with tool calls (--provider),
a scripted runner that runs a recorded .js script deterministically,
a basic REPL for hand-driven slash commands with no LLM at all,
a one-shot task runner that prints a single answer to stdout (--task).

All four modes share the same browser tools (goto, click, fill, tree, markdown, search, ...). The same set is exposed over MCP via lightpanda mcp, so an agent script and an MCP client see the same surface — that is also the way to drive Lightpanda from an external LLM agent (Claude Code, etc.) without giving Lightpanda its own API key.

Quick start

# Interactive REPL — auto-detects an API key from your environment
./lightpanda agent

# Force a specific provider
./lightpanda agent --provider anthropic

# Basic REPL (no LLM, slash commands only)
./lightpanda agent --no-llm

# Run a saved script, then exit
./lightpanda agent session.js

# One-shot: ask a question, capture the answer on stdout
./lightpanda agent --task "what is on the front page of hn?"

Providers and API keys

Provider	Flag	API key env
Anthropic	`--provider anthropic`	`ANTHROPIC_API_KEY`
OpenAI	`--provider openai`	`OPENAI_API_KEY`
Gemini	`--provider gemini`	`GOOGLE_API_KEY` or `GEMINI_API_KEY`
Ollama	`--provider ollama`	none (local)

Defaults: --model falls back to a sensible per-provider default; in the REPL, /provider <name> and /model <name> change the current selection (Tab completes the candidates). --base-url overrides the API endpoint (Ollama defaults to http://localhost:11434/v1).

Provider auto-detection

When --provider is omitted, lightpanda picks one in this order, printing a one-line notice (on stderr) of what it chose:

Remembered → the provider/model you last selected with /provider or /model, persisted per-directory in .lp-agent.zon, as long as its key is still set.
Auto-detected → otherwise the first key found in priority order (ANTHROPIC_API_KEY → GOOGLE_API_KEY/GEMINI_API_KEY → OPENAI_API_KEY). Switch any time with /provider in the REPL, or override with --provider.
No keys set → falls back to the basic REPL (slash commands only). Natural language, /login, and /acceptCookies will reject.

Ollama is never auto-detected (no env var to look at) — pass --provider ollama, or select it once with /provider ollama and it'll be remembered.

--no-llm is the explicit bypass: it forces the basic REPL even when an API key is present or --provider is set. Use it to test slash commands without burning tokens, or to disable the LLM in a saved command without editing the existing flags. --no-llm wins over --provider.

REPL Slash Commands

The REPL uses a tiny slash-command language for browser actions. Each command is /<tool> [args], a # comment, or blank. There is no other syntax in basic REPL mode: anything that doesn't match those three forms is a parse error.

Slash commands accept any of:

a single positional value, when the tool has exactly one required field — /goto 'https://example.com', /extract '{"karma":"#karma"}';
key=value pairs — values may be bare or quoted; strings with whitespace must be quoted (/fill selector='#email' value='user@x.com');
a raw {json} blob — handed straight to the tool (/findElement {"role":"button"}).

Tools whose selector is optional (e.g. /click, /hover, /findElement) have zero required fields, so they don't take a positional and must be written as key=value: /click selector='a.login', not /click 'a.login'.

Quoting is content-aware: '…', "…", and triple-quoted '''…''' / """…""" for values that mix both quote styles or span multiple lines. Recorded JavaScript scripts use the equivalent function-call form instead of slash lines.

Two slash commands have no underlying tool — they trigger an LLM turn that the agent translates into actual tool calls:

Command	Notes
`/login`	LLM-driven: fills credentials from `$LP_*` env vars.
`/acceptCookies`	LLM-driven: dismiss the consent banner.

Both require an LLM. --no-llm rejects them.

In the REPL (and only the REPL), a line that isn't a slash command and doesn't start with # is sent to the LLM as a natural-language prompt. To leave the REPL, use the /quit meta command.

Example script

# Log into the demo and grab the dashboard title and visible cards.
# Site-scoped vars (LP_<SITE>_<FIELD>) avoid collisions when you have
# credentials for several sites; the unprefixed form is the fallback.
/goto 'https://demo-browser.lightpanda.io/'
/acceptCookies
/fill selector='#email' value='$LP_DEMO_USERNAME'
/fill selector='#password' value='$LP_DEMO_PASSWORD'
/click selector='button[type="submit"]'
/waitForSelector '.dashboard'
/extract '{"title": ".dashboard h1", "cards": [".dashboard .card .name"]}'

/extract takes a JSON schema object — each value tells the extractor what to lift off the page, and the whole result is printed to stdout as a single JSON object. Supported value forms:

"<sel>" — textContent.trim() of the first match (string or null).
"" — the matched element's own text (only inside a fields block).
["<sel>"] — text of every match (string array). Sugar for [{"selector": "<sel>"}].
{"selector": "<sel>", "attr": "<name>"} — attribute of the first match.
[{"selector": "<sel>", "fields": {…}}] — array of records, each fields value resolved relative to the matched element.
Add "limit": N inside any array's object spec to cap matches at N (works for text, attribute, and fields shapes — e.g. [{"selector": ".story .title", "limit": 5}] for top 5 titles).

Use /extract '''…''' (or """…""") to spread a schema across multiple lines. The schema is parsed in Zig before the page-side walker runs, so a malformed schema fails with Error: invalid /extract schema JSON rather than a V8 stack trace. See agent-tutorial.md section 3 for a worked example against Hacker News.

Cross-call state with `lp.*`

/extract and /evaluate each return one value per call, but real scrapes often need to carry data forward — capture a list on one page, then walk it across navigations. Two primitives keep that simple.

save=<name> on /extract or /evaluate stashes the result in a Session-scoped store keyed by <name> instead of dumping it to stdout. The stored value is then exposed to every subsequent /evaluate as globalThis.lp.<name>:

/goto 'https://news.ycombinator.com/'

/extract save=front '''
{
  "stories": [{
    "selector": "tr.athing",
    "limit": 5,
    "fields": {
      "id":    {"attr": "id"},
      "title": ".titleline > a"
    }
  }]
}
'''

/evaluate '''
console.log(lp.front.stories[0].title);
'''

save=d commands print nothing on success so scripts pipe cleanly.

Auto-sync. Any mutation of lp.* inside an /evaluate is persisted at the end of the call. Adding a key (lp.x = …), updating a nested value (lp.front.stories[0].comments = […]), or removing a key (delete lp.x) all propagate to the store. The next /evaluate sees the update — even after a navigation, because the store lives Session-side, not on the page.

List → detail. A common scrape captures a list, then visits each row for more data. Capture the list with /extract save=<name>, then loop in /evaluate: read lp.<name>, goto each row's URL, and extract the detail — /evaluate's top-level await and full JS make the round-trip explicit.

Async evaluate. When a scrape needs logic /extract can't express, /evaluate is the escape hatch: top-level await works directly — the body runs as an async function, so use return to produce a value. runEval pumps the event loop until it settles, then surfaces the resolved value (or the rejection as an error). A body with no explicit return resolves to undefined, which evaluate treats as silent. Returned objects and arrays are serialized to JSON automatically, so no JSON.stringify is needed.

The store is script-run scoped: it's bound to the Session that runs the script, and goes away when that Session does. There is no cross-session persistence; if you need that, use localStorage (which is origin-scoped and persists across navigations within a session).

Saving and loading

From the REPL, /save [file.js] writes the session back to a .js file and /load <path> runs a script from disk against the current session.

State-mutating commands (/goto, /click, /fill, /scroll, /hover, /selectOption, /setChecked, /waitForSelector, /press, /evaluate, /extract) are saved; read-only commands (/tree, /markdown, /links, /findElement, …) and the natural-language turns that produced them are not. Natural-language turns are saved as // <prompt> comments above the resulting JavaScript calls so the script stays readable. In the basic REPL (--no-llm) /save transcribes the session deterministically; with an LLM it synthesizes an equivalent idiomatic script.

JavaScript Script Running

./lightpanda agent script.js runs without making any LLM call. Agent scripts are plain synchronous JavaScript plus the installed Lightpanda primitives:

goto("https://example.com");
click({ selector: "a.login" });
evaluate("document.title");

The script runs in an agent-only V8 context. It has no window, document, or DOM APIs. Browser interaction happens only through the installed primitives (goto, click, fill, evaluate, extract, and the other recorded browser actions). The primitives are synchronous and blocking — each returns its result directly, so write const data = extract(…), not await extract(…). There is no async/await/Promise contract around them. (evaluate(...) can run async JS inside the page, but the evaluate(...) call itself still returns synchronously.) It is not Node.js either: there is no require, process, fs, npm package loading, or Node standard library. The evaluate(...) primitive executes its string in the current page context; page scripts cannot see agent variables or agent primitives.

Tool errors throw JavaScript exceptions and stop execution. See agent-script.md for the full script format reference.

REPL features

Tab completion (case-insensitive): cycles through /<tool> and meta slash commands. The dim grey suffix shown after the cursor is the first match.
Persistent history: stored in .lp-history in the working directory.
Meta slash commands: /help lists tools (/help <tool> prints the JSON schema), /provider [name] and /model [name] change the active provider/model — Tab after the space completes from detected providers and the provider's fetched model list, and bare /provider//model print the current selection — /save [file.js] writes the session to a script and /load <path> runs one from disk (Tab completes file paths), /quit exits the REPL, /verbosity <low|medium|high> tunes the log level. These are REPL-only and never recorded.
```
> /goto https://example.com
> /findElement role=button name=Submit
> /evaluate {"script": "document.title"}
> /quit
```
Stdout vs stderr: the final assistant answer and data-producing slash commands (/extract, /evaluate, /markdown, /tree, …) write to stdout. Tool calls, progress, and errors go to stderr, so lightpanda agent --task ... > out.txt captures a clean answer.

One-shot mode (`--task`)

./lightpanda agent --provider gemini \
  --task "what is the top story on news.ycombinator.com?"

--task runs a single user turn, prints the final answer on stdout, and exits. Combine with -a <path> / --attach <path> (repeatable) to feed local files to providers that accept attachments. Text files are inlined into the prompt (max 512 KiB each); binary files (image/*, audio/*, pdf) are base64-encoded inline (max 20 MiB each). Unsupported MIME types error out before any browser work runs.

Driving Lightpanda from an external LLM agent

When the calling agent already has its own LLM (e.g. Claude Code), use lightpanda mcp rather than lightpanda agent. The MCP server exposes the same browser tools (goto, click, fill, ...) listed below, so the external agent does the planning while Lightpanda only drives the browser. No --provider or API key is required on the Lightpanda side.

{
  "mcpServers": {
    "lightpanda": {
      "command": "/path/to/lightpanda",
      "args": ["mcp"]
    }
  }
}

Tool names are camelCase and case-sensitive — there are no aliases. MCP clients must call the canonical tags (goto, evaluate, tree, save, …).

For sub-task delegation in the other direction — calling Lightpanda's own LLM-driven agent in a one-shot fashion — use --task on stdin instead.

Saving a script over MCP

lightpanda mcp exposes a save tool so an external agent can persist the session as a .js script for later deterministic replay. Unlike the standalone agent's /save, the MCP server has no LLM of its own — the calling client holds the conversation, so it synthesizes the script and passes it in:

Tool	Args	Effect
`save`	`{ path: string, script: string }`	Write `script` to `path` (relative, no `..`; created or overwritten) and return the absolute location and line count.

The tool's description carries the same synthesis guidance the agent's /save gives its LLM: prefer the builtins you called as tools (goto, click, fill, extract, …) as JavaScript calls, drop dead-ends, and keep $LP_* placeholders. Any literal LP_* value is scrubbed back to its placeholder before the file is written. The result runs without an LLM via ./lightpanda agent session.js.

Browser tools

The agent and MCP server share the tool set defined in src/browser/tools.zig. Highlights:

goto, search (Google with DuckDuckGo fallback on captcha)
tree, markdown, links, interactiveElements, structuredData, detectForms, nodeDetails, findElement
click, fill, hover, press, scroll, selectOption, setChecked, waitForSelector
evaluate, consoleLogs, getUrl, getCookies, getEnv

Selectors prefer CSS over backendNodeId for the click-family tools, since node IDs are invalidated by any DOM mutation. The system prompt enforces this for the LLM.

Security notes

The agent treats page content as untrusted data, not instructions. URLs surfaced by a page are not followed unless they match the user's task.
$LP_* environment variable references in /fill values are resolved at execution time inside the subprocess, so credentials never enter the LLM context. Conventional naming for site-scoped values is LP_<SITE>_<FIELD> (e.g. LP_HN_USERNAME, LP_GH_TOKEN); the unprefixed LP_USERNAME / LP_PASSWORD form is the generic fallback.
The getEnv tool only reads variables whose name starts with LP_. Everything else (provider API keys, system env, third-party secrets) reports "not set" so the model can't probe for it. The user controls what lives under LP_*. Note that getEnv returns the value to the model — fine for non-secret config like base URLs, but never call it on credentials (use $LP_* placeholders in fill values instead).
--obey-robots, --http-proxy, --user-agent, and the rest of the browser-level CLI flags apply to agent the same way they apply to serve, fetch, and mcp.
REPL prompts are persisted to .lp-history in the current working directory in plaintext (no encryption). Anything you type at the prompt — including natural-language context that accompanies a /login — lands in that file. Delete it or move out of sensitive directories if you don't want it retained.
save rejects empty, absolute, and .. paths, but does not follow up on symlinks. On a shared filesystem, a pre-existing symlink at the target would be written through to whatever it points at. Prefer a fresh directory you own when saving in untrusted environments.

17 KiB Raw Permalink Blame History