browser/docs/agent.md

# Agent mode

`lightpanda agent` runs a browsing agent backed by Lightpanda's headless engine.
It can act as:

- an **LLM agent** that drives the browser with tool calls (`--provider`),
- a **scripted runner** that replays a `.panda` script deterministically,
- a **dumb REPL** for hand-driven Pandascript with no LLM at all,
- a **one-shot task runner** that prints a single answer to stdout (`--task`),
- an **MCP server** that exposes the agent itself as a single `task` tool
  for other agents to delegate to (`--mcp`).

All five modes share the same browser tools (`goto`, `click`, `fill`, `tree`,
`markdown`, `search`, ...). The same set is exposed over MCP via `lightpanda
mcp`, so an agent script and an MCP client see the same surface.

## Quick start

```console
# Interactive REPL with an LLM
./lightpanda agent --provider anthropic

# Dumb REPL (no API key, Pandascript only)
./lightpanda agent

# Replay a recorded script
./lightpanda agent session.panda

# Replay then continue interactively, appending new commands to the file
./lightpanda agent -i session.panda

# One-shot: ask a question, capture the answer on stdout
./lightpanda agent --provider gemini --task "what is on the front page of hn?"

# MCP server: expose a single `task` tool for other agents to delegate to
./lightpanda agent --mcp --provider anthropic
```

## Providers and API keys

| Provider    | Flag                   | API key env                          |
|-------------|------------------------|--------------------------------------|
| Anthropic   | `--provider anthropic` | `ANTHROPIC_API_KEY`                  |
| OpenAI      | `--provider openai`    | `OPENAI_API_KEY`                     |
| Gemini      | `--provider gemini`    | `GOOGLE_API_KEY` or `GEMINI_API_KEY` |
| Ollama      | `--provider ollama`    | none (local)                         |

Defaults: `--model` falls back to a sensible per-provider default; `--base-url`
overrides the API endpoint (Ollama defaults to `http://localhost:11434/v1`).

Without `--provider`, the REPL still works for Pandascript commands. Natural
language, `LOGIN`, `ACCEPT_COOKIES`, and `--self-heal` all require a provider.

## Pandascript

Pandascript is a tiny, line-oriented DSL for browser actions. Each line is one
command. Comments start with `#`. Strings are quoted with `'`, `"`, or `'''…'''`
for values that mix both quote styles. Quoting rules are content-aware so that
recorded scripts round-trip through the parser.

| Command          | Form                                  | Notes                                                |
|------------------|---------------------------------------|------------------------------------------------------|
| `GOTO`           | `GOTO <url>`                          | Navigate. URL is unquoted.                           |
| `CLICK`          | `CLICK '<selector>'`                  | CSS selector.                                        |
| `TYPE`           | `TYPE '<selector>' '<value>'`         | Fills an input. `$LP_*` env refs auto-resolve.       |
| `WAIT`           | `WAIT '<selector>'`                   | Wait for selector.                                   |
| `SCROLL`         | `SCROLL [x] [y]`                      | Default `(0, 0)`.                                    |
| `HOVER`          | `HOVER '<selector>'`                  |                                                      |
| `SELECT`         | `SELECT '<selector>' '<value>'`       | `<select>` option by value.                          |
| `CHECK`          | `CHECK '<selector>' [true\|false]`    | Check / uncheck. Default `true`.                     |
| `EXTRACT`        | `EXTRACT '<selector>'`                | Returns text content.                                |
| `EVAL`           | `EVAL '<js>'` or `EVAL '''…'''`       | Triple-quote for multi-line JS.                      |
| `TREE`           | `TREE`                                | Print the semantic tree (not recorded).              |
| `MARKDOWN`       | `MARKDOWN`                            | Print page as markdown (not recorded).               |
| `LOGIN`          | `LOGIN`                               | LLM-driven: fill `$LP_USERNAME` / `$LP_PASSWORD`.    |
| `ACCEPT_COOKIES` | `ACCEPT_COOKIES`                      | LLM-driven: dismiss the consent banner.              |

In the REPL, anything that does not parse as a Pandascript command is sent to
the LLM as natural language. To leave the REPL, use the `/quit` slash command.

### Example script

```pandascript
# Log into the demo and grab the dashboard title.
GOTO https://demo-browser.lightpanda.io/
ACCEPT_COOKIES
TYPE '#email' '$LP_USERNAME'
TYPE '#password' '$LP_PASSWORD'
CLICK 'button[type="submit"]'
WAIT '.dashboard'
EXTRACT '.dashboard h1'
```

### Recording

Interactive sessions can write back to a `.panda` file:

```console
./lightpanda agent -i session.panda
```

State-mutating commands (`GOTO`, `CLICK`, `TYPE`, ...) are appended; read-only
commands (`TREE`, `MARKDOWN`) and the natural-language turns that produced
them are not. Natural-language turns are recorded as `# <prompt>` comments
above the resulting tool calls so the script stays readable.

### Replay and self-healing

`./lightpanda agent script.panda` replays without making any LLM call.

With `--self-heal --provider <p>`, a failed command (typically a stale
selector after the page changed) triggers a short LLM turn that inspects the
current page and emits a replacement command. The healed command runs, and
the original script line is rewritten in place so the next replay succeeds
deterministically.

Self-heal is constrained: at most one replacement per failure, capped LLM
budget, no navigation away from the current page. It is meant to recover
from selector drift, not to redesign the script.

## REPL features

- **Tab completion** (case-insensitive): cycles through Pandascript keywords
  and `/<tool>` slash commands. The dim grey suffix shown after the cursor is
  the first match.
- **Persistent history**: stored in `.lp-history` in the working directory.
- **Slash commands**: `/<tool> [args]` calls a browser tool directly without
  going through the LLM. Args accept either a single positional value (for
  tools with one required field), `key=value` pairs, or a raw `{json}` blob.
  Two meta commands round out the set: `/help` lists tools (`/help <tool>`
  prints the JSON schema), and `/quit` exits the REPL.
  ```
  > /goto https://example.com
  > /findElement role=button name=Submit
  > /eval {"script": "document.title"}
  > /quit
  ```
- **Stdout vs stderr**: the final assistant answer and data-producing commands
  (`EXTRACT`, `EVAL`, `MARKDOWN`, `TREE`) write to stdout. Tool calls,
  progress, and errors go to stderr, so `lightpanda agent --task ... > out.txt`
  captures a clean answer.

## One-shot mode (`--task`)

```console
./lightpanda agent --provider gemini \
  --task "what is the top story on news.ycombinator.com?"
```

`--task` runs a single user turn, prints the final answer on stdout, and
exits. Combine with `--task-attachment <path>` (repeatable) to feed local
files to providers that accept attachments.

## MCP server mode (`--mcp`)

`lightpanda agent --mcp --provider <p>` runs the agent as an MCP server
over stdio. It exposes a single tool, `task`, so a calling agent can
delegate a high-level browsing task and receive only the final answer
without the intermediate browser tool calls (tree dumps, clicks, scrolls)
filling its own context.

```console
./lightpanda agent --mcp --provider anthropic
```

MCP configuration:

```json
{
  "mcpServers": {
    "lightpanda-agent": {
      "command": "/path/to/lightpanda",
      "args": ["agent", "--mcp", "--provider", "anthropic"]
    }
  }
}
```

The `task` tool accepts:

| Field         | Type             | Notes                                                                  |
|---------------|------------------|------------------------------------------------------------------------|
| `task`        | string, required | Natural-language instruction for the agent.                            |
| `attachments` | string[]         | Optional local file paths (image / PDF / text) for providers that accept attachments. |
| `fresh`       | boolean          | If true, start the task from a fresh browser session (no cookies, no current page). |

Each call resets the agent's LLM conversation, so tasks are independent
from each other at the model level. The browser session, by contrast,
persists across calls by default — set `fresh: true` to reset it.

This mode is distinct from `lightpanda mcp`, which exposes the raw
browser tools (`goto`, `click`, `fill`, ...) and does not depend on an
LLM. Pick `lightpanda mcp` when the calling agent wants to drive the
browser itself, and `lightpanda agent --mcp` when it wants to hand off
the whole sub-task. `--mcp` cannot be combined with `--task`, `-i`, or a
script file.

Limitations: the JSON-RPC loop is single-threaded, so a long-running
task call blocks subsequent calls until it finishes. There is no
cancellation from the client side yet.

## Browser tools

The agent and MCP server share the tool set defined in `src/browser/tools.zig`.
Highlights:

- `goto`, `search` (Google with DuckDuckGo fallback on captcha)
- `tree`, `markdown`, `links`, `interactiveElements`, `structuredData`,
  `detectForms`, `nodeDetails`, `findElement`
- `click`, `fill`, `hover`, `press`, `scroll`, `selectOption`, `setChecked`,
  `waitForSelector`
- `eval`, `consoleLogs`, `getUrl`, `getCookies`, `getEnv`

Selectors prefer CSS over `backendNodeId` for the click-family tools, since
node IDs are invalidated by any DOM mutation. The system prompt enforces this
for the LLM.

## Security notes

- The agent treats page content as untrusted data, not instructions. URLs
  surfaced by a page are not followed unless they match the user's task.
- `$LP_*` environment variable references in `TYPE` / `fill` values are
  resolved at execution time, so credentials never enter the LLM context.
- `--obey-robots`, `--http-proxy`, `--user-agent`, and the rest of the
  browser-level CLI flags apply to `agent` the same way they apply to
  `serve`, `fetch`, and `mcp`.