mirror of https://github.com/lightpanda-io/browser.git synced 2026-06-11 09:35:59 -04:00

Files

Francis Bouvier 8824174afa agent: run recorded scripts as JavaScript

Replace recorded agent replay with a standalone JavaScript script runtime.
Install synchronous agent primitives in an isolated V8 context, add console
output, return structured extract values as JS objects/arrays, and route script
execution through the new runtime.

Update recording to emit .js function calls, default /save filenames to .js,
drop script-level extract save support, and refresh agent docs/tutorials for
the new format.

Self-healing is disabled for now.

2026-06-01 15:43:37 +02:00

16 KiB

Raw Blame History

Agent tutorial — Hacker News, end-to-end

This walks you from "I just built ./lightpanda" to a recorded, replayable JavaScript browser script, then captures the same flow from an external MCP client. Every section ends with a command you can run; nothing references later sections.

For the flag/command/tool tables, see agent.md. This document is the tutorial; that one is the reference. For the JavaScript runtime contract, see agent-script.md.

What you'll build

One session against Hacker News:

Log in with your account.
Confirm the login by reading the username out of the header.
Record the whole flow to a .js file.
Run it offline, with no LLM.
Add local JavaScript logic around extract(...) results.
Record the same browser actions from an external agent over MCP.

Prerequisites

./lightpanda on your PATH (build with zig build).
A Hacker News account.
One LLM API key for sections that use natural language — Anthropic, OpenAI, Gemini, or a local Ollama. Recorded .js scripts run with no key at all.

Export your HN credentials as LP_* env vars. The convention is LP_<SITE>_<FIELD> — a short site identifier (HN for Hacker News, GH for GitHub, …) lets you keep credentials for multiple sites in your environment without collisions. The unprefixed LP_USERNAME / LP_PASSWORD form is the generic fallback when you only have one site.

In bash or zsh:

export LP_HN_USERNAME="your-hn-handle"
export LP_HN_PASSWORD="your-hn-password"

In fish:

set -gx LP_HN_USERNAME "your-hn-handle"
set -gx LP_HN_PASSWORD "your-hn-password"

The LP_ prefix matters. The agent resolves $LP_* references inside the Lightpanda subprocess, so the literal secret never enters the LLM context; and the getEnv tool refuses to read anything that doesn't start with LP_, so the model can't probe your other env vars.

Verify they're set before continuing — substitution fails silently if a variable is missing (the literal $LP_HN_USERNAME ends up typed into the form), and the /fill confirmation message intentionally echoes the placeholder name rather than the resolved value, so the response text won't tell you. Confirm directly:

./lightpanda agent --no-llm
> /getEnv LP_HN_USERNAME

/getEnv returns the literal value if set, or "not set" if missing. Only $LP_* references in fill values are substituted; other $ characters in your password (my$ecret, $5.99) are passed through verbatim.

1. First contact: the REPL

./lightpanda agent

On startup the agent prints a one-line notice telling you which mode it landed in — which provider (and which env var won), or "basic REPL (no LLM)" if no key is set. The REPL writes its history to .lp-history in the working directory, so up-arrow works across runs.

Try the meta commands:

> /help
> /help goto
> /quit

/help lists every browser tool. /help <tool> prints its JSON schema. /quit exits cleanly. If you have no API key yet and want to poke around without an LLM, ./lightpanda agent --no-llm forces the basic REPL.

2. The shortest possible win: `--task`

Before doing anything complicated, prove the LLM + browser stack works end-to-end:

./lightpanda agent --task "what is the top story on news.ycombinator.com?"

--task runs a single user turn, prints the final answer on stdout, and exits. Tool calls, progress, and errors all go to stderr, so redirecting stdout gives you a clean answer:

./lightpanda agent --task "top story on news.ycombinator.com?" > out.txt

If you need to feed the model a local file, repeat -a <path> (or --attach <path>) for each one.

3. Driving the browser by hand

Now back to the REPL. We'll write the HN login flow one command at a time so you can see how each step depends on what the previous one showed.

> /goto https://news.ycombinator.com/login

/goto takes a single URL argument (positional, optionally quoted). The page is now loaded.

The REPL scripting surface is slash commands. click '#foo' (no leading slash) is forwarded to the LLM as a natural-language prompt; only /click '#foo' runs as a command. TAB completion in the REPL helps you find the right tool name.

Inspect it before clicking anything:

> /tree

/tree prints the semantic tree to stdout. Two forms are visible — the login form and the create-account form below it — and each one contains two unlabeled textboxes:

8 form
 13 'username:'
 15 [i] textbox
 18 'password:'
 20 [i]
 22 [i] button 'login' value='login'
30 form
 35 'username:'
 37 [i] textbox
 …

Notice the textboxes have no accessible name — "username:" is a sibling text node, not a <label for="…">. This is typical of older pages. The ARIA-name tool reflects that:

> /findElement role=textbox name=username
[]

Empty result. Slash commands accept a single positional argument (for tools with one required field), key=value pairs, or a raw {json} blob; findElement filters by accessible name, which we don't have here.

For unlabeled login forms, jump to detectForms, which reads the HTML directly and surfaces each form's action plus each input's name attribute:

> /detectForms

You'll see two forms, the first with action: "login" and fields named acct and pw. That's enough to synthesize the CSS selector yourself: scope by form action to avoid colliding with the create-account form, then key on the input's name attribute.

Selector rule, load-bearing: the click-family tools (/click, /fill, /hover, /selectOption, /setChecked) accept CSS selectors only. The backend node IDs findElement and detectForms return are invalidated by any DOM mutation, and they cannot be serialized into durable JavaScript recordings — a session that uses them is not replayable. Always synthesize a CSS selector from the attributes (id, class, name, action, tag_name) and use that.

Now fill the form:

> /fill selector='form[action="login"] input[name="acct"]' value='$LP_HN_USERNAME'
> /fill selector='form[action="login"] input[name="pw"]' value='$LP_HN_PASSWORD'
> /click selector='form[action="login"] input[type="submit"][value="login"]'
> /waitForSelector '#logout'

$LP_* references in /fill values are resolved at execution time inside the subprocess. The LLM never sees the literal credential.

The /waitForSelector '#logout' line is doing two jobs at once, and it's worth unpacking because the pattern recurs in every recorded script:

It's a synchronization point. /click on the submit button returns as soon as the click dispatches, not after the server responds. Without a wait, the next command races the login redirect and may run against the pre-login DOM. /waitForSelector '<sel>' blocks until that selector appears, so the script resumes only after HN's logged-in page has rendered.
It's an implicit assertion. HN renders the #logout link only when the session is authenticated. If the credentials are wrong (or HN throws up a captcha, or rate-limits, or the form layout changed), #logout never appears and /waitForSelector times out — the script fails loudly at the line where the failure actually happened, instead of silently succeeding and producing garbage downstream.

You generally want a /waitForSelector like this after every state-changing action that triggers async work: pick a selector that only exists in the post-action state, and you get free regression protection. Waiting on the URL (location.pathname === "/news") or a generic element that exists on both pages is weaker — both can be true before the navigation finishes.

Confirm by pulling structured data off the page. /extract takes a JSON schema object — each value describes what to lift out, and the whole result is printed to stdout as one JSON object. The simplest form is a flat selector lookup:

> /extract '{"karma": "#karma"}'
{"karma":"42"}

The schema grammar is small but covers the cases you'd reach for:

"<sel>" — textContent.trim() of the first match (string, or null if no match).
"" — the matched element's own text (only meaningful inside a fields block, where there's an outer element to refer to).
["<sel>"] — text of every match (string array).
{"selector": "<sel>", "attr": "<name>"} — the first match's attribute value.
[{"selector": "<sel>", "fields": {…}}] — array of objects, where each fields entry is resolved relative to the matched element.

After a page-changing action (click, navigation, form submit) the previous /tree snapshot is stale; re-inspect before the next interaction. Hop back to the front page and pull the story list to exercise the structured form:

> /goto https://news.ycombinator.com
> /extract '''
{
  "topStories": [{
    "selector": ".athing",
    "fields": {
      "rank": ".rank",
      "title": ".titleline > a",
      "url": {"selector": ".titleline > a", "attr": "href"}
    }
  }]
}
'''

Triple-quoted (''' or """) values let a schema span multiple lines — the REPL keeps reading until it sees the matching closing quote. The result is a single JSON object printed to stdout: {"topStories":[{"rank":"1","title":"…","url":"…"}, …]}.

The schema is parsed in Zig before the page-side walker runs, so a typo like a stray comma surfaces here as Error: invalid /extract schema JSON instead of a confusing V8 stack trace.

4. Recording the session

The same flow, but recorded to a file. Quit the REPL, then:

./lightpanda agent -i hn_login.js

-i <path> opens an interactive REPL that appends state-mutating commands to <path>. Retype the same sequence — login (/goto, two /fills, /click, /waitForSelector), then the front-page hop and structured pull (/goto, multi-line /extract) — then /quit.

Inspect the result:

cat hn_login.js

You should see the seven mutating commands and nothing else — no /tree, no /markdown, no read-only lookups. The recorder filters on a per-tool flag (ToolDef.recorded) so read-only inspection never pollutes the script; /extract is recorded (it changes what the script can read on replay even though it doesn't mutate the page). The saved file is JavaScript:

goto("https://news.ycombinator.com/login");
fill({ selector: "form[action=\"login\"] input[name=\"acct\"]", value: "$LP_HN_USERNAME" });
fill({ selector: "form[action=\"login\"] input[name=\"pw\"]", value: "$LP_HN_PASSWORD" });
click({ selector: "form[action=\"login\"] input[type=\"submit\"][value=\"login\"]" });
waitForSelector("#logout");
goto("https://news.ycombinator.com");
extract({ topStories: [{ selector: ".athing", fields: { rank: ".rank", title: ".titleline > a", url: { selector: ".titleline > a", attr: "href" } } }] });

Natural-language REPL turns are not saved as executable JavaScript. When a natural-language turn produces recorded browser actions, the prompt is kept as a // comment above those actions.

5. Running deterministically

./lightpanda agent hn_login.js

No --provider, no LLM, no token spend. The recorded script runs top to bottom against a fresh browser. This is the form you want for regression tests and CI.

A JavaScript script does not print its final expression automatically. The recorded extract(...) call returns a local JavaScript value, so edit the final line when you want replay output:

const topStories = extract({
  topStories: [{
    selector: ".athing",
    fields: {
      rank: ".rank",
      title: ".titleline > a",
      url: { selector: ".titleline > a", attr: "href" }
    }
  }]
});

console.log(JSON.stringify({ topStories }, null, 2));

Run it again and stdout is clean JSON:

./lightpanda agent hn_login.js > stories.json

/login and /acceptCookies are REPL-only LLM triggers. A pure recording from -i never contains them; the recorder captures the resulting browser tool calls instead. Lines that are neither slash commands nor comments are also REPL-only conveniences, not script syntax.

6. Local JavaScript logic

Agent scripts run in a separate JavaScript context from the web page. There is no window, document, DOM API, require, process, or Node standard library in that context. Browser interaction happens through the installed primitives, and extract(...) is the usual way to move page data into local script logic.

For example, turn the extraction result into a smaller report without running any page-side JavaScript:

goto("https://news.ycombinator.com");

const topStories = extract({
  topStories: [{
    selector: ".athing",
    limit: 5,
    fields: {
      rank: ".rank",
      title: ".titleline > a",
      url: { selector: ".titleline > a", attr: "href" }
    }
  }]
});

const report = topStories.map((story) => ({
  rank: story.rank,
  title: story.title,
  url: story.url
}));

console.log(JSON.stringify({ report }, null, 2));

Use the eval(...) primitive only when you intentionally want to run a string in the current page's JavaScript context. Page eval cannot see agent variables or call goto, extract, and the other agent primitives.

7. Same flow, external agent (MCP)

Everything above used Lightpanda's built-in agent. If you're driving Lightpanda from a different agent (Claude Code, a custom MCP client, your own harness), use lightpanda mcp instead — same browser tools, no API key on the Lightpanda side, the calling agent supplies the LLM.

{
  "mcpServers": {
    "lightpanda": {
      "command": "/path/to/lightpanda",
      "args": ["mcp"]
    }
  }
}

Record a session over MCP

From the external agent, call:

recordStart { "path": "hn_login.js" } — begins appending state-mutating tool calls as JavaScript to hn_login.js. The path must be relative and free of ...
The same browser tools you'd call anyway: goto, fill, click, waitForSelector. Each one that succeeds is appended as a JavaScript primitive call; query-only tools (tree, markdown, findElement, consoleLogs) are never recorded.
recordComment { "text": "logged in" } — drop a // breadcrumb above the next recorded line. Useful for marking the boundary between LLM-driven phases.
recordStop {} — closes the recording and returns {path, line_count}.

The output file uses the same JavaScript format as -i hn_login.js from section 4. It runs via the agent CLI without modification:

./lightpanda agent hn_login.js

About `scriptStep` and `scriptHeal`

recordStart now records JavaScript agent scripts. The MCP scriptStep and scriptHeal tools are still available for external agents that want to run and repair one slash-command line at a time, but that is a separate PandaScript line-healing workflow:

Read the PandaScript file you are healing. For each non-blank, non-comment line, call scriptStep { "line": "<line>" }. Comments and blanks are no-ops on the Lightpanda side.
On isError: true, the structured error message tells you what failed. Hand the current page state and the failing line to your own LLM; have it return a replacement PandaScript line (or several).
Call scriptHeal { "path": "...", "replacements": [{ "original_line": "...", "replacement_lines": ["..."] }] }. Each original_line must match verbatim. Lightpanda writes <path>.bak first, then atomically rewrites the file with the # [Auto-healed] Original: … header prepended to the first replacement.
Continue from the next line.

scriptStep deliberately does not auto-record: the script is already the source of truth during replay, so double-recording would diverge the file from itself. /login, /acceptCookies, and any line that isn't a slash command are rejected — those need an LLM, which is the caller's responsibility.

Where to go next

agent.md — full reference: every flag, every slash command, every browser tool, plus the security model and auto-detection rules.
agent-script.md — JavaScript runtime, primitives, return values, and complete script examples.
lightpanda mcp --help and lightpanda agent --help — current flag listings straight from the binary.

16 KiB Raw Blame History