Aspects & Assessment of LLMs

A study‑style, hand‑annotated layout you can reuse as a blog theme.

Large Language Models (LLMs) are judged along multiple axes: factuality, reasoning, helpfulness, safety, latency/throughput, context handling, and cost. A practical rule: evaluate like your users behave — not only with synthetic prompts.

Quick scan: MRR, nDCG for retrieval; EM/F1 for QA; ROUGE/BLEU/BERTScore for generation; MMLU, BIG‑bench for breadth.

What to look for

For product fit, run task‑grounded evaluations with your own data and success criteria. Use a scorecard so trade‑offs are visible; highlight safety regressions in red and quality gains in green.

How to assess (fast)

  1. Define clear acceptance criteria (e.g., “no medical claims,” “≥80% EM on internal set”).
  2. Create a golden set of ~200 prompts + references; include edge cases and tricky negatives.
  3. Automate metrics; sample human review weekly for drift.
  4. Compare variants A/B with paired prompts; keep a change log.
  5. Instrument production: capture user flags, latency, and failure modes.

Common pitfalls

Mini glossary

EM: exact string match; F1: token overlap; nDCG: graded ranking; MRR: first‑hit rank; BERTScore: semantic similarity; Hallucination: confident, fabricated content.

★ Remember: "Measure twice, deploy once." Open scorecard template See eval scripts