A
Back to work

[04] · Settled (TripFix) · Nov 2025 — ongoing

LLM-as-judge evaluation framework

The quality bar for every AI change we ship. Five automatic judges grade each output on truth, sourcing, tone, completeness, and safety. New prompt scores worse than the live one — the deploy is blocked. The only reason daily prompt iteration is safe at production scale.

LangfuseAnthropicOpenAIPostgreSQLSlack alerting

The problem

Vibes don't ship. A prompt change that 'feels better' but quietly regresses citation grounding will hurt real users. Before any model or prompt update goes live, it has to clear a measurable bar.

[01]

Architecture

Five judges, one verdict per output.

Each AI output is graded by five LLM judges — one each for hallucination, citation grounding, tone, completeness, and refusal handling. Each returns a numeric score and a short rationale.

A single output failing any one dimension is a soft fail. Failing two is a hard fail. The aggregate verdict gates the deploy.

[02]

Pipeline

Langfuse for traces, Slack for the alert.

Every production run lands in Langfuse with prompts, model outputs, and judge scores attached. A scheduled job samples runs, re-grades them, and posts daily quality digests to a Slack channel so the team sees regressions before customers do.

  • Langfuse traces for every prompt version
  • Daily judge sweep over hundreds of samples
  • Slack digest with regressions flagged at the top
  • Snapshot evals run in CI on every PR

[03]

Discipline

Write the eval before you write the prompt.

TDD for prompts. If I can’t articulate the test case for what “good” looks like, the prompt isn’t ready to write yet. Every PR that changes a prompt also has to touch the eval suite.

If you can’t score the output, you can’t deploy the model.

The takeaway

The eval harness is the only piece of infrastructure that lets one engineer iterate on prompts daily at production scale without quietly breaking the product.

Next case study

TripFix flagship →