[01] · Selected work
Production agents,
shipped solo& in flight.
Four pieces of work that show how I think about agentic systems — from the giant end-to-end product down to the single building blocks that hold it up.
Four projects.
One throughline: ship it, then prove it.
TripFix — autonomous flight-claim co-pilot
An AI co-pilot for flight-delay refund claims. Reads boarding passes and airline emails, drafts the rebuttal letter, escalates only when uncertain. Built as a small team of specialised agents — not one monolithic prompt — so each piece is testable, swappable, and auditable on its own.
LLMs orchestrated
14+
Eval dimensions
5
Citation grounding
deterministic
Headcount in AI
1
- Multi-agent systems
- Eval harnesses
- Vision reasoning
Cursor Cloud Agent v1 — conversation timeline rebuild
A flight recorder for cloud agents. Stitches prompt, thinking, and tool calls into a single replayable timeline — so any agent run is auditable in under a minute. Design call: optimise for the operator first, not the model.
Stream types unified
3
Replay fidelity
100%
PRs merged solo
9
- Agent observability
- Tool-use traces
Agentic preparation checklist
An agent that reads a case and figures out what’s missing. Instead of one giant ‘knows everything’ prompt, it loads short markdown skills on demand for the stage it’s in. Cheaper inference, sharper answers, knowledge anyone on the team can edit in a text file.
Skills authored
12
Tools wired
7
Snapshot evals
passing
- Agent design
- Skills-as-prompts
LLM-as-judge evaluation framework
The quality bar for every AI change we ship. Five automatic judges grade each output on truth, sourcing, tone, completeness, and safety. New prompt scores worse than the live one — the deploy is blocked. The only reason daily prompt iteration is safe at production scale.
Evaluators
5
Daily judged samples
hundreds
Regressions caught pre-deploy
many
- Evals
- Production safety