Evaluation
Building eval sets, metrics, and LLM-as-judge for reliable systems
Overview
Evaluation turns "it looks good" into measurable quality. Because LLM outputs are non-deterministic and open-ended, you need a fixed dataset of inputs with expected properties, plus metrics suited to the task. Approaches range from exact/heuristic checks (for structured output) to LLM-as-judge scoring (for open-ended answers). Good evals catch regressions when you change prompts, models, or retrieval.
Syntax / Usage
Start with a versioned dataset of cases and a scoring function. For deterministic tasks like classification or JSON extraction, assertions are enough.
import json
from openai import OpenAI
client = OpenAI()
DATASET = [
{"input": "charged twice", "expected": "billing"},
{"input": "reset link expired", "expected": "account"},
]
def classify(text: str) -> str:
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Return only the category word."},
{"role": "user", "content": text}],
)
return r.choices[0].message.content.strip().lower()
def run_eval() -> float:
correct = sum(classify(c["input"]) == c["expected"] for c in DATASET)
return correct / len(DATASET) # accuracy
print(f"accuracy: {run_eval():.0%}")
Examples
For open-ended answers, an LLM judge scores dimensions like faithfulness and relevance against a rubric. Use a strong model and force numeric output:
def judge(question: str, answer: str, context: str) -> dict:
rubric = (
"Score 1-5 for FAITHFULNESS (supported by context) and "
"RELEVANCE (answers the question). Return JSON: "
'{"faithfulness": n, "relevance": n}.'
)
r = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": rubric},
{"role": "user",
"content": f"Q: {question}\nContext: {context}\nAnswer: {answer}"}],
response_format={"type": "json_object"},
)
return json.loads(r.choices[0].message.content)
Track metrics over time so prompt or model changes are compared on the same cases:
def compare(variant_fn) -> dict:
scores = [judge(c["input"], variant_fn(c["input"]), c.get("context", ""))
for c in DATASET]
avg = sum(s["faithfulness"] for s in scores) / len(scores)
return {"avg_faithfulness": round(avg, 2)}
Common Mistakes
- Eyeballing a few outputs instead of scoring a fixed, versioned dataset
- Testing only happy paths, ignoring adversarial and edge-case inputs
- Trusting a single LLM-judge run without spot-checking against humans
- Changing prompt, model, and data at once, so you can't attribute changes
- No regression gate in CI, letting quality silently drift between releases
See Also
large-language-models prompt-engineering ai-rag-advanced