[01]
Architecture
Five judges, one verdict per output.
Each AI output is graded by five LLM judges — one each for hallucination, citation grounding, tone, completeness, and refusal handling. Each returns a numeric score and a short rationale.
A single output failing any one dimension is a soft fail. Failing two is a hard fail. The aggregate verdict gates the deploy.