[02] · Settled (TripFix) · Sept 2025 — May 2026

TripFix — autonomous flight-claim co-pilot

An AI co-pilot for flight-delay refund claims. Reads boarding passes and airline emails, drafts the rebuttal letter, escalates only when uncertain. Built as a small team of specialised agents — not one monolithic prompt — so each piece is testable, swappable, and auditable on its own.

Anthropic Agents SDKCursor Cloud Agent APIAnthropic VisionLangfusePostgreSQLRedis · HorizonLaravel 10 · LivewirePython · FastAPI

The problem

Most flight-delay claims fail not because the airline is right, but because the passenger can't assemble a tight, well-cited case in under thirty minutes. We wanted to do that for them — without lying, hallucinating, or auto-submitting nonsense.

[01]

Architecture

A small team of agents, not one big prompt.

Early on the obvious move was “one model, one prompt, one pass.” That breaks the moment the model gets a passport photo upside down or sees an email it’s never seen before. I split the work into specialised agents— one reads documents, one drafts the rebuttal, one decides when to escalate to a human, one judges the output.

The agents talk through a typed event bus, not a tangled chain. Each one is testable in isolation, and any one of them can be swapped (or quietly rolled back) without touching the others.

Document agent reads, classifies, extracts structured data
Drafting agent composes the airline rebuttal with citations
Routing agent decides what a human should see
Judge agent scores the output before it’s sent

[02]

Design decision

Deterministic citations or the agent stays silent.

Letters that quote the wrong booking reference are worse than no letter at all — they break trust instantly. So the drafting agent cannot output a claim without a citation handlethat maps back to a span in a real document.

If the agent wants to say “the airline confirmed the cancellation at 14:32,” it must point to the line in the email where that confirmation lives. The UI renders those citations as clickable chips; the eval harness checks every one.

[03]

Quality gate

Five judges grade every output before it leaves the lab.

Every prompt change goes through a five-evaluator harness that scores hallucination, citation grounding, tone, completeness, and refusal handling. If the new prompt regresses on any dimension, the deploy is blocked.

This is the only reason one engineer can iterate on prompts daily at production scale without quietly breaking the product.

[04]

Operating principle

Build for the operator first, not the model.

Agentic systems live or die by how fast a human can audit a bad run. Every agent at TripFix writes to a timeline that surfaces its prompt, its thinking, every tool it called, and the answer. If an ops human can’t debug a run in 60 seconds, we don’t call it production-ready.

[05]

Scale handled

Multi-model routing across fourteen LLMs.

No single model is the best at everything — or the cheapest. The router picks the right LLM for each sub-task based on latency, cost, and historical eval scores. Frontier-class for the rebuttal draft; smaller models for classification and routing.

Result: production cost per claim stayed flat as the system did more work, and any single-vendor outage stops being existential.

The takeaway

“Production agents aren't a model choice — they're a discipline. Cite or stay quiet, judge before you deploy, and build the timeline before you build the prompt.”

Next case study

Cursor v1 timeline →