stackademic

The leading education platform for anyone with an interest in software development.

Evaluating LLM Applications

Building eval sets, metrics, and LLM-as-judge for reliable systems

Overview

Evaluation turns "it looks good" into measurable quality. Because LLM outputs are non-deterministic and open-ended, you need a fixed dataset of inputs with expected properties, plus metrics suited to the task. Approaches range from exact/heuristic checks (for structured output) to LLM-as-judge scoring (for open-ended answers). Good evals catch regressions when you change prompts, models, or retrieval.

Syntax / Usage

Start with a versioned dataset of cases and a scoring function. For deterministic tasks like classification or JSON extraction, assertions are enough.

import json
from openai import OpenAI

client = OpenAI()

DATASET = [
    {"input": "charged twice", "expected": "billing"},
    {"input": "reset link expired", "expected": "account"},
]

def classify(text: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Return only the category word."},
                  {"role": "user", "content": text}],
    )
    return r.choices[0].message.content.strip().lower()

def run_eval() -> float:
    correct = sum(classify(c["input"]) == c["expected"] for c in DATASET)
    return correct / len(DATASET)  # accuracy

print(f"accuracy: {run_eval():.0%}")

Examples

For open-ended answers, an LLM judge scores dimensions like faithfulness and relevance against a rubric. Use a strong model and force numeric output:

def judge(question: str, answer: str, context: str) -> dict:
    rubric = (
        "Score 1-5 for FAITHFULNESS (supported by context) and "
        "RELEVANCE (answers the question). Return JSON: "
        '{"faithfulness": n, "relevance": n}.'
    )
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": rubric},
                  {"role": "user",
                   "content": f"Q: {question}\nContext: {context}\nAnswer: {answer}"}],
        response_format={"type": "json_object"},
    )
    return json.loads(r.choices[0].message.content)

Track metrics over time so prompt or model changes are compared on the same cases:

def compare(variant_fn) -> dict:
    scores = [judge(c["input"], variant_fn(c["input"]), c.get("context", ""))
              for c in DATASET]
    avg = sum(s["faithfulness"] for s in scores) / len(scores)
    return {"avg_faithfulness": round(avg, 2)}

Common Mistakes

  • Eyeballing a few outputs instead of scoring a fixed, versioned dataset
  • Testing only happy paths, ignoring adversarial and edge-case inputs
  • Trusting a single LLM-judge run without spot-checking against humans
  • Changing prompt, model, and data at once, so you can't attribute changes
  • No regression gate in CI, letting quality silently drift between releases

See Also

large-language-models prompt-engineering ai-rag-advanced