An agent writes a function, then writes a test for that function. The same brain wrote both sides.
That's the structural problem with how AI agents test today. Not that they're bad at testing — that the loop is closed before it starts. The test knows the implementation. The mock knows the schema. The assertion knows the return type. The agent runs the suite and says "all green." Of course it's green. It's confirming what it already believes.
Luotain breaks the loop by making the target a black box.
What it is
Luotain is a probe toolkit. You describe what your software should do in markdown. An AI agent reads that description, probes the live system from the outside, and reports what matches and what doesn't.
products/myapp/
product.md <- what the product is
specs/
auth/login.md <- what login should do
results/
2026-03-27/
auth/login.json <- what actually happened
The agent never sees source code. It doesn't know what language the system is written in, what database it uses, or how the auth middleware works. It only knows what the system should do, because you told it in a spec file.
A spec looks like this:
# Login API
## POST /login
- Accepts JSON with `email` and `password`
- Returns 200 with `token` on valid credentials
- Returns 401 on bad credentials
- Rate limits to 5 attempts per minute
No test framework. No assertion syntax. Plain english in a markdown file.
How it works
Luotain has three probe types that the agent can use:
HTTP — send requests, observe status codes, headers, body, timing:
luotain probe http POST https://api.example.com/login \
--body '{"email":"nobody@example.com","password":"aaaa"}' \
--header 'Content-Type: application/json'
CLI — run commands, capture exit code, stdout, stderr:
luotain probe cli "docker exec myapp healthcheck"
TCP — connect to a port, optionally send data, observe response:
luotain probe tcp connect localhost 6379
The agent decides which probes to use based on what the spec describes. A spec about an HTTP API triggers HTTP probes. A spec about a CLI tool triggers CLI probes. Luotain doesn't decide — the agent does.
Two ways to run it
Interactive — add Luotain as an MCP server to Claude Code. The agent reads specs and probes the system using tools directly. You watch, you guide, you intervene.
luotain-mcp --product ./products/myapp --target http://localhost:8080
The MCP server exposes tools: read_product, list_specs, read_spec,
probe_http, write_result, read_results. The agent calls
read_product first — that's its only knowledge of the system. Then it reads specs, probes,
and writes results.
Loop — continuous unattended probing. Runs forever on an interval.
luotain-loop --product ./products/myapp \
--target http://localhost:8080 \
--interval 300 \
--agent-url http://localhost:11434/v1 \
--model llama3
Every 5 minutes: read product, read specs, probe, write results. The --agent-url flag
takes any OpenAI-compatible API — Ollama, Mistral, Groq, whatever runs on your machine. A local model
on your laptop probing your local services, no external calls, no API keys, no cost per run.
Results
Each spec produces one JSON result file with per-feature verdicts:
{
"spec": "auth/login.md",
"verdict": "fail",
"features": [
{
"description": "returns 200 with token on valid credentials",
"verdict": "pass",
"why": null
},
{
"description": "rate limits to 5 attempts per minute",
"verdict": "fail",
"why": "sent 10 requests in 30 seconds, all returned 200"
}
]
}
Not just pass/fail — which feature failed and why, in a sentence. Results are stored by date
under results/YYYY-MM-DD/ so you have a history.
Adversarial mode
Normal mode verifies that features work as described. Adversarial mode tries to break them.
luotain-loop --product ./products/myapp --adversarial
The agent reads the spec, understands what the system expects, and then deliberately violates it.
A login endpoint that accepts JSON gets a request with Content-Type: text/xml and a
body of {"password": 9999999999999999999}. A "pass" means the system handled it
gracefully — correct error code, no 500, no hang. A "fail" means the agent found a crack.
In one early test, adversarial mode discovered that a payment API returned a 200 with an empty body
when the amount field was a string instead of a number. The normal-mode specs all passed.
The system was "working" — just not safely. Results are tagged "mode": "adversarial".
The product tree
A product directory gives the agent context beyond individual specs:
products/myapp/
product.md <- product description, concepts, constraints
specs/ <- behavior specs (markdown)
results/ <- evaluation history (JSON, by date)
product.md is the first thing the agent reads. It's the product description — features,
domain language, known constraints. An agent that knows "this is a financial API with strict idempotency"
probes differently than one testing a blog.
Why blackbox matters for AI
When an AI agent writes code and then writes tests for that code, it's testing its own assumptions. The tests pass because they were born from the same context as the implementation.
Blackbox testing breaks this. The spec was written by a human who knows what the product should do. The probe hits a live system that either does or doesn't do that thing. The implementation in between is invisible.
This means:
- An agent can refactor everything and the specs still apply
- A spec written by a product manager is a valid test case
- The test doesn't break when internals change, only when behavior changes
- You test what users experience, not what developers wrote
The name is Finnish. Luotain means a probe or sounder — something you send into the deep to see what's down there without going yourself.
Luotain is part of False Systems. It integrates with Sykli for pipeline orchestration and emits False Protocol events for every probe.