LangChain pipelines written in 2023 don’t run in 2026. The code is fine. The abstractions underneath it aren’t.

The model API changed. The retrieval pattern changed. The wrapper library rewrote its surface twice. The pipeline is a fossil of dependencies that were never stable.

This is not a library quality problem. It is the absence of a layer.

What this is actually about

Teams shipped LLM features over the last three years. The features worked. Then a provider deprecated a model, or a library rewrote its surface, or the structured-output mode quietly changed shape — and the pipeline went silent in a way no test caught.

The cost lands in familiar places. A 3am page nobody can debug, because the failure is in a layer nobody wrote. A model bill that doubled because every call routes through the largest model and nobody knows which actually needs it. A backlog of working features the team is now afraid to touch, because every change is a load-bearing prompt edit.

This is not “AI is hard.” It is the same problem every durable software domain has already solved: there is no layer between the application and the moving parts underneath. Every team is rebuilding it as glue code. Every team is paying twice — once to write, again to maintain.

Every domain that lasted found one

Operating systems found POSIX. Networks found sockets. Databases found relational algebra. None of these eliminated churn beneath them — kernels still rewrite their schedulers, networks still change their TCP stacks, databases still rebuild their storage engines. But the layer kept the things written on top alive across that churn — even as everything underneath was replaced. A 1995 SQL query runs against 2026 PostgreSQL. A 2023 LangChain pipeline often does not run against 2026 anything.

LLM operations don’t have that layer yet. So they break.

What the missing layer is

A task contract: a description of the work the pipeline does that does not name the model doing it.

The contract specifies what good enough looks like — typed inputs, typed outputs, declared budgets, declared failure semantics. Any model that meets the contract is a legal binding. As models improve, more models become legal bindings. The contract never has to change, because the contract was never about a model. It was about a task.

This is the SQL longevity argument made exact. SQL queries don’t change across hardware generations because SQL queries were never about hardware — they were about data. Task contracts shouldn’t change across model generations because they shouldn’t be about models. They should be about the task: what’s required, with what confidence, under what budget, with what failure behavior.

What stays stochastic, what doesn’t

The model is stochastic. That part isn’t fixable.

What is fixable is everything around it. The shape of the output can be typed. The failure modes can be enumerated. The budgets can be declared. The composition rules can be checked at compile time. The recovery strategy can be specified, not improvised at 3am.

This is the move. The stochastic kernel stays stochastic. The envelope around it becomes deterministic. You stop reasoning about the whole pipeline as a probabilistic system and start reasoning about a deterministic system with a small stochastic core — the way you’d reason about a SQL query that calls a UDF.

Most of what makes LLM pipelines unbearable to operate today is that the model’s stochasticity has bled outward and contaminated everything around it. Contracts, failures, budgets, composition — all of it has been left soft. Pulling that back is the work.

Why SQL is the right reference, mostly

Four properties made SQL last.

Closure. Every operator takes relations and returns relations. You compose without leaving the algebra.

Schema typing. Queries reference columns the catalog knows about. Mismatches are caught before any data is touched.

Cost-based binding. The query says what to compute. The optimizer chooses how — which index, which join, which order — based on the catalog’s statistics.

Decidable equivalence. A JOIN B = B JOIN A. The optimizer can prove rewrites safe.

Three of these transfer cleanly to LLM operations. The fourth does not, and it would be dishonest to pretend otherwise.

LLM operators close — typed contexts flow through retrieve, generate, extract, verify and come out the other side as typed contexts. Schema typing applies — a contract declares what each step expects and produces. Cost-based binding applies — given declared budgets and a catalog of cost/quality profiles, the optimizer picks concrete models for abstract requests.

But you cannot prove two prompts are semantically equivalent, or that swapping one model class for another preserves output distribution. Equivalences are empirical, learned from observed traffic.

The algebra is SQL-shaped — declarative, typed, optimizer-bound — but its rewrites are justified by the catalog and the observation log, not by algebra. This is a real weakening, and the design has to absorb it rather than hide it.

What this catches at compile time

A prompt declares its output schema. A pipeline binds against it.

(deftype Question [text :String])
(deftype CitedAnswer [text :String evidence (List DocId)])
(deftype Answer [text :String evidence (List DocId)])

(defprompt answer-with-citations
  :input  Question
  :output CitedAnswer)

(pipeline legal-qa
  :input  Question
  :output Answer
  (generate :prompt answer-with-citations :extract Answer))

Today they match. Someone “improves” the prompt to drop citations:

(deftype CitedAnswer [text :String])    ; ← evidence field removed

The compiler refuses, before the change reaches production:

error: prompt output type mismatch
      (generate :prompt answer-with-citations :extract Answer)
      ^ expected `Answer`, found `CitedAnswer`
  note: missing field: evidence : (List DocId)

A class of failure that today reaches production silently — surfacing at 3am when the evidence field is None and the citation formatter throws — becomes a compile error caught in code review.

The runtime conventions can’t replace

A typed algebra closes. But LLM operations are stochastic and externally fragile — rate limits, timeouts, malformed extracts, hallucinated tool calls, models deprecated mid-run. An algebra without a runtime to handle these is a query language without a database engine.

The honest answer is supervision. Each operator runs in a supervised process. Failures are typed (:rate-limit, :timeout, :malformed-extract, :budget-exceeded). The supervisor’s restart strategy is part of the contract. New prompts and models load into the running pipeline; in-flight requests finish on the old, new requests bind to the new.

The runtime that gives you process isolation, supervision trees, hot code reload, and transparent distribution out of the box is BEAM. The claim isn’t that BEAM is the right answer — it’s that BEAM, designed in the 1980s for telecom switches, fits this work better than anything built since. An LLM-ops runtime needs exactly what a telecom switch needed: many concurrent things, each able to fail without taking the others with it. The alternative is reinventing those properties as conventions on top of Python or TypeScript, and the ecosystem has been doing that for three years with predictable results.

Not a language. A contract.

If you noticed the syntax in the compile-time example — parentheses, colons, prefix forms — that wasn’t a language choice. That was the point.

S-expressions because a contract is a tree, not prose. You read them as data, transform them as data, check them as data. A contract you can’t programmatically manipulate isn’t a contract — it’s documentation.

Typed because the whole proposition is structural verification. A contract whose fields can drift silently is not a contract. It’s a hope.

BEAM because the runtime is part of the contract. Failure semantics, supervision strategy, hot reload — these are not deployment concerns layered on after the fact. They are clauses of the contract, and they need a runtime that can honor them natively.

Put together: this isn’t a language you write code in. It’s a notation in which contracts are declared, type-checked, and bound to a runtime that already knows how to supervise them. That it executes is a consequence, not the point.

What’s still open

Three things are not solved, and the credibility of any layer like this depends on saying which.

Calibration. A confidence of 0.85 from one model on one task is not comparable to 0.85 from another. Addressable — temperature scaling, eval-derived calibration curves stored in the catalog — but the runtime has to do the work, and today none does.

The optimizer. Without decidable equivalence, cost-based optimization is closer to learned database optimization (Bao, Neo) than to Selinger. v1 ships rule-based — route short prompts to small models, prefer cached plans — with learned refinement on top.

The catalog. Provider-published specs, community benchmarks, per-deployment observations: three sources, weighted by recency. Getting providers to populate it honestly is a standards problem. Standards problems take a decade. Which is exactly why the layer needs defining now.

Everything above is what the layer has to do. None of it exists today as a coherent system.

The bet

The contract outlives the model that ran it the first time, and the model after that, and the framework that hosted it.

That is the bet that justifies the cost of a new layer. It is also the only bet that does.


This is the direction I’m taking Sykli: not another pipeline framework, but a typed execution substrate where task contracts, model bindings, supervision, and runtime behavior are part of the same system. Early, working core; honest about the gaps — github.com/false-systems/sykli. Sparked by a conversation with Hans Hübner and Manuel Odendahl in Berlin.