A

[01] · Selected work

Production agents,
shipped solo& in flight.

Four pieces of work that show how I think about agentic systems — from the giant end-to-end product down to the single building blocks that hold it up.

[01]The portfolio

Four projects.
One throughline: ship it, then prove it.

[01]Settled (TripFix)·
shipped

TripFix — autonomous flight-claim co-pilot

An AI co-pilot for flight-delay refund claims. Reads boarding passes and airline emails, drafts the rebuttal letter, escalates only when uncertain. Built as a small team of specialised agents — not one monolithic prompt — so each piece is testable, swappable, and auditable on its own.

LLMs orchestrated

14+

Eval dimensions

5

Citation grounding

deterministic

Headcount in AI

1

  • Multi-agent systems
  • Eval harnesses
  • Vision reasoning
Read case study
[02]Settled (TripFix)·
shipped

Cursor Cloud Agent v1 — conversation timeline rebuild

A flight recorder for cloud agents. Stitches prompt, thinking, and tool calls into a single replayable timeline — so any agent run is auditable in under a minute. Design call: optimise for the operator first, not the model.

Stream types unified

3

Replay fidelity

100%

PRs merged solo

9

  • Agent observability
  • Tool-use traces
Read case study
[03]Settled (TripFix)·
in flight

Agentic preparation checklist

An agent that reads a case and figures out what’s missing. Instead of one giant ‘knows everything’ prompt, it loads short markdown skills on demand for the stage it’s in. Cheaper inference, sharper answers, knowledge anyone on the team can edit in a text file.

Skills authored

12

Tools wired

7

Snapshot evals

passing

  • Agent design
  • Skills-as-prompts
Read case study
[04]Settled (TripFix)·
shipped

LLM-as-judge evaluation framework

The quality bar for every AI change we ship. Five automatic judges grade each output on truth, sourcing, tone, completeness, and safety. New prompt scores worse than the live one — the deploy is blocked. The only reason daily prompt iteration is safe at production scale.

Evaluators

5

Daily judged samples

hundreds

Regressions caught pre-deploy

many

  • Evals
  • Production safety
Read case study