Architecture
This page covers the technical architecture behind Mole. For user-facing features, see Tips & Tricks.
Agentic Loop Architecture
At Mole's core is a minimal Agentic Loop, with this design philosophy:
Code handles mechanics and boundaries (guarantees the floor), the model handles decisions and strategy (determines the ceiling)
Core Loop
Sample → Has tool calls → Execute → Write back → Continue sampling
No tool calls → FinishCode is Responsible For (Mechanics + Boundaries)
- Sample → Execute → Write-back loop
- Budget enforcement (turn limits, call count limits, context length limits)
- Infinite loop detection (same tool + params repeated N times → auto-terminate)
- Empty response retry
- Auto context compression when too long
- Subtask recursion entry point
Model is Responsible For (Decisions + Strategy)
- Intent classification / tool selection / task decomposition
- Verification strategy / when to stop / response wording
Task Levels
Mole classifies tasks into four levels by complexity, determined autonomously by the model:
Level 1: Direct Answer
Greetings, casual chat, knowledge Q&A — answered directly with text, no tools called.
Level 2: Single-Step Operation
One clear small goal (search, click, screenshot, query) — calls the most suitable tool, replies after getting results.
Level 3: Multi-Step Task
Goals requiring 2+ steps. Each step decides the next based on the actual result of the previous step, rather than planning all steps upfront.
Level 4: Compound Task
Contains multiple relatively independent sub-goals. Uses spawn_subtask to split each independent sub-goal into an isolated task with its own context, preventing information cross-contamination.
Multi-Tab Orchestration
Mole can operate across multiple browser tabs within a single task — for example, searching on one page, extracting data, and filling a form on another.
How It Works
All page-operating tools support an optional tab_id parameter. The AI follows this flow:
- Open a new tab —
tab_navigate(action='open', url='...')returns the new tab'stab_id - Operate on the target tab — Pass
tab_idto any tool:page_snapshot(tab_id=123),cdp_input(tab_id=123, ...),extract_data(tab_id=123, ...) - Clean up —
tab_navigate(action='close', tab_id=123)when done
Key Rules
- element_id is tab-private — An element ID obtained from Tab A cannot be used on Tab B. Always call
page_snapshoton the target tab first. - Default behavior unchanged — When
tab_idis omitted, tools operate on the tab where the user started the conversation, exactly as before. - List open tabs —
tab_navigate(action='list')shows all tabs with their IDs.
Deep CDP Control
Mole connects to 10 Chrome DevTools Protocol (CDP) domains, providing browser-process-level deep control that overcomes Content Script limitations:
- Trusted event injection — Sends
isTrusted=truemouse/keyboard events, bypassing anti-bot event source detection - Dialog handling — Automatically detects and handles alert/confirm/prompt dialogs, preventing automation flow interruption
- iframe piercing — Execute JS and get text within cross-origin iframes, solving CAPTCHA and payment form scenarios
- Network visibility — Complete request/response data (including body and headers), plus Cookie read/write
- Request interception — Intercept requests to modify and continue, return mock responses, or simulate failures; supports auth header injection and CORS bypass
- Deep DOM operations — Query/modify DOM ignoring same-origin policy, get precise box model geometry
- Page storage — Cross-origin read/write of localStorage / sessionStorage without content scripts
- CSS styles — Get computed styles and matching rules, modify inline styles, dynamically inject CSS rules
- Visual highlighting — Highlight DOM elements or regions so users can see what the AI is operating on
- Device emulation — Simulate mobile viewports, override User-Agent, spoof geolocation and timezone
- Console capture — Automatically collect console output and uncaught exceptions to help diagnose page issues
All CDP tools share a unified session manager (cdp-session.ts) that automatically manages debugger attach/detach lifecycle and domain event listeners.
Automatic Context Compression
When conversation context grows too long, Mole automatically compresses historical context while preserving key information, ensuring continued operation within the LLM's context window limits. This prevents long multi-step tasks from failing due to context overflow.
Vision (Visual Understanding)
Mole has visual understanding capabilities. When the screenshot tool is called, the captured image is automatically injected into the LLM context as a multimodal input. The AI can then "see" the page content and make decisions based on visual information.
Annotated Screenshots
Use screenshot(annotate=true) to get a screenshot with numbered interactive element markers:
- Every interactive element in the viewport is marked with a number (1, 2, 3...) and a red highlight box
- Returns a mapping table of number → element_id with tag and text info
- AI can visually identify the target element and use the corresponding element_id for precise operations
- Follows the Look → Act → Check protocol: observe the page first, act with confidence, verify critical results
Limits:
- Up to 15 screenshot images per task to control context size
- Images are automatically stripped during context compression, replaced with text placeholders
- Prefer
page_snapshot/page_skeletonfor structured data; use visual analysis as a supplement - The floating ball is automatically hidden during screenshots to avoid obscuring page content
