Luotain — Blackbox QA for AI Agents

An agent writes a function, then writes a test for that function. The same brain wrote both sides.

That's the structural problem with how AI agents test today. Not that they're bad at testing — that the loop is closed before it starts. The test knows the implementation. The mock knows the schema. The assertion knows the return type. The agent runs the suite and says "all green." Of course it's green. It's confirming what it already believes.

Luotain breaks the loop by making the target a black box.

What it is

Luotain is a probe toolkit. You describe what your software should do in markdown. An AI agent reads that description, probes the live system from the outside, and reports what matches and what doesn't.

products/myapp/
  product.md          <- what the product is
  specs/
    auth/login.md     <- what login should do
  results/
    2026-03-27/
      auth/login.json <- what actually happened

The agent never sees source code. It doesn't know what language the system is written in, what database it uses, or how the auth middleware works. It only knows what the system should do, because you told it in a spec file.

A spec looks like this:

# Login API

## POST /login
- Accepts JSON with `email` and `password`
- Returns 200 with `token` on valid credentials
- Returns 401 on bad credentials
- Rate limits to 5 attempts per minute

No test framework. No assertion syntax. Plain english in a markdown file.

How it works

Luotain has three probe types that the agent can use:

HTTP — send requests, observe status codes, headers, body, timing:

luotain probe http POST https://api.example.com/login \
  --body '{"email":"nobody@example.com","password":"aaaa"}' \
  --header 'Content-Type: application/json'

CLI — run commands, capture exit code, stdout, stderr:

luotain probe cli "docker exec myapp healthcheck"

TCP — connect to a port, optionally send data, observe response:

luotain probe tcp connect localhost 6379

The agent decides which probes to use based on what the spec describes. A spec about an HTTP API triggers HTTP probes. A spec about a CLI tool triggers CLI probes. Luotain doesn't decide — the agent does.

Two ways to run it

Interactive — add Luotain as an MCP server to Claude Code. The agent reads specs and probes the system using tools directly. You watch, you guide, you intervene.

luotain-mcp --product ./products/myapp --target http://localhost:8080

The MCP server exposes tools: read_product, list_specs, read_spec, probe_http, write_result, read_results. The agent calls read_product first — that's its only knowledge of the system. Then it reads specs, probes, and writes results.

Loop — continuous unattended probing. Runs forever on an interval.

luotain-loop --product ./products/myapp \
             --target http://localhost:8080 \
             --interval 300 \
             --agent-url http://localhost:11434/v1 \
             --model llama3

Every 5 minutes: read product, read specs, probe, write results. The --agent-url flag takes any OpenAI-compatible API — Ollama, Mistral, Groq, whatever runs on your machine. A local model on your laptop probing your local services, no external calls, no API keys, no cost per run.

Results

Each spec produces one JSON result file with per-feature verdicts:

{
  "spec": "auth/login.md",
  "verdict": "fail",
  "features": [
    {
      "description": "returns 200 with token on valid credentials",
      "verdict": "pass",
      "why": null
    },
    {
      "description": "rate limits to 5 attempts per minute",
      "verdict": "fail",
      "why": "sent 10 requests in 30 seconds, all returned 200"
    }
  ]
}

Not just pass/fail — which feature failed and why, in a sentence. Results are stored by date under results/YYYY-MM-DD/ so you have a history.

Adversarial mode

Normal mode verifies that features work as described. Adversarial mode tries to break them.

luotain-loop --product ./products/myapp --adversarial

The agent reads the spec, understands what the system expects, and then deliberately violates it. A login endpoint that accepts JSON gets a request with Content-Type: text/xml and a body of {"password": 9999999999999999999}. A "pass" means the system handled it gracefully — correct error code, no 500, no hang. A "fail" means the agent found a crack.

In one early test, adversarial mode discovered that a payment API returned a 200 with an empty body when the amount field was a string instead of a number. The normal-mode specs all passed. The system was "working" — just not safely. Results are tagged "mode": "adversarial".

The product tree

A product directory gives the agent context beyond individual specs:

products/myapp/
  product.md       <- product description, concepts, constraints
  specs/           <- behavior specs (markdown)
  results/         <- evaluation history (JSON, by date)

product.md is the first thing the agent reads. It's the product description — features, domain language, known constraints. An agent that knows "this is a financial API with strict idempotency" probes differently than one testing a blog.

Why blackbox matters for AI

When an AI agent writes code and then writes tests for that code, it's testing its own assumptions. The tests pass because they were born from the same context as the implementation.

Blackbox testing breaks this. The spec was written by a human who knows what the product should do. The probe hits a live system that either does or doesn't do that thing. The implementation in between is invisible.

This means:

An agent can refactor everything and the specs still apply
A spec written by a product manager is a valid test case
The test doesn't break when internals change, only when behavior changes
You test what users experience, not what developers wrote

The name is Finnish. Luotain means a probe or sounder — something you send into the deep to see what's down there without going yourself.

Luotain is part of False Systems. It integrates with Sykli for pipeline orchestration and emits False Protocol events for every probe.