Caldera Labs | Engineering Experiments & Research

Overview

WALL·TE (Web Automation Large Language Test Engineer) is a browser automation framework that uses Large Language Models to interpret natural language test descriptions and execute them against web applications. Tests are written in Markdown, and AI handles the translation to browser actions.

The framework eliminates CSS selectors and XPath expressions by using vision-enabled LLMs to understand page structure semantically. This makes tests more resilient to UI changes and reduces maintenance overhead.

See It In Action

Watch WALL·TE execute real test scenarios. These recordings show the AI's decision-making process as it interprets natural language and interacts with live applications.

Synchronous Test Execution

Watch tests execute sequentially with real-time streaming output showing AI reasoning and browser actions.

Parallel Test Execution

Run multiple tests in parallel across different browsers and devices simultaneously. Demonstrates AI's ability to handle concurrent test execution.

Key Features

Markdown Test Format

Write tests in plain English using Markdown. No programming required.

Semantic Page Understanding

AI interprets page structure through accessibility trees and visual layout.

Multi-Device Support

Run tests across mobile, tablet, and desktop with a single test definition.

Parallel Execution

Run multiple tests concurrently for faster test suite completion.

Real-Time Streaming

Watch test execution with live progress updates and AI decision logs.

Structured Reports

JSON, YAML, or TOML output with screenshots, token usage, and cost breakdowns.

BYOK Model

Use your own OpenAI or Anthropic API keys. No vendor lock-in.

CI/CD Integration

Headless mode and structured output work with any CI/CD pipeline.

The Problem

Traditional browser automation relies on brittle selectors that break when markup changes. Every UI refactor requires updating test suites. Dynamic content requires explicit wait conditions and state management. Multi-device testing means maintaining separate test implementations.

• CSS selectors and XPath break when HTML structure changes
• Dynamic UIs require complex conditional logic for different states
• Test maintenance costs often exceed initial development time
• Different viewports require separate test implementations
• Tests become outdated and teams stop maintaining them

Natural Language Testing

Tests are defined in Markdown files with natural language instructions. The AI interprets intent rather than following rigid selector-based instructions. This approach makes tests readable and maintainable.

Example: Authentication Flow

---
title: Login Test
description: Verify user authentication flow
tags: [auth, smoke-test]
---

## Successful Login

Test that a user can log in with valid credentials.

### Steps

1. Navigate to https://app.example.com/login
2. Fill in the email field with "test@example.com"
3. Fill in the password field with "SecurePass123"
4. Click the "Sign In" button
5. Wait for the dashboard to load

### Expectations

- The user should be redirected to /dashboard
- A welcome message should appear
- The user's profile icon should be visible in the header

How It Works

Semantic Understanding

The AI analyzes the page's accessibility tree and visual layout to understand structure. Instead of looking for specific element IDs or classes, it identifies elements by their semantic role and visible text.

// Traditional approach
await page.click('#login-form button[type="submit"]')

// WALL·TE approach
"Click the Sign In button"

Dynamic Adaptation

Instructions like "Click any product with 'wireless' in the title" work naturally. The AI searches the page for matching elements and selects appropriately. No explicit loops or conditional logic required.

Context Retention

The AI maintains conversation history across test steps. It remembers which product was selected, what form fields were filled, and can verify that later steps match earlier actions.

Technical Architecture

WALL·TE is built on three core components: Model Context Protocol integration for browser control, multi-model support for AI providers, and cost tracking for token usage.

Model Context Protocol (MCP) Integration

We use Anthropic's Model Context Protocol to provide LLMs with structured access to Playwright browser automation. MCP defines a standard interface for AI tools to interact with external systems.

The MCP server exposes browser primitives as structured tool calls:

• Navigation, clicking, typing, and form interaction
• Page snapshots and accessibility tree queries
• Screenshot capture for visual verification
• Console log and network request monitoring

Multi-Model Support

WALL·TE supports multiple AI providers with a BYOK (Bring Your Own Key) model. No vendor lock-in—use your existing API keys.

OpenAI Models

• GPT-5 (default)
• GPT-4o
• o1-preview for complex reasoning

Anthropic Models

• Claude Sonnet 4.5
• Claude Opus 4
• Automatic prompt caching

Cost Tracking

Every test run tracks token usage and estimates costs. Reports include:

Input

Prompt tokens sent to AI

Output

Response tokens from AI

Cached

Tokens served from cache (90% savings)

Typical costs: ~$0.03-0.09 per test suite run. Prompt caching significantly reduces costs for repeated test executions.

Technology Stack

RuntimeNode.js 18+

Browser AutomationPlaywright (via MCP)

Test FormatMarkdown

Output FormatsJSON, YAML, TOML

Distributionnpm (global install or npx)

WALL·TE - Web Automation Large Language Test Engineer