Aspects & Assessment of LLMs

A study‑style, hand‑annotated layout you can reuse as a blog theme.

Large Language Models (LLMs) are judged along multiple axes: factuality, reasoning, helpfulness, safety, latency/throughput, context handling, and cost. A practical rule: evaluate like your users behave — not only with synthetic prompts.

Quick scan: MRR, nDCG for retrieval; EM/F1 for QA; ROUGE/BLEU/BERTScore for generation; MMLU, BIG‑bench for breadth.

What to look for

Factuality: reduce hallucinations via retrieval, citations, and constrained decoding.
Reasoning: chain‑of‑thought, tool use, and structured scratchpads.
Safety & bias: red‑team tests, refusal handling, and fairness audits.
Latency & cost: measure P50/P95, tokens/sec, cache hits; track $/task.
Robustness: adversarial prompts, long context drift, non‑English inputs.

For product fit, run task‑grounded evaluations with your own data and success criteria. Use a scorecard so trade‑offs are visible; highlight safety regressions in red and quality gains in green.

How to assess (fast)

Define clear acceptance criteria (e.g., “no medical claims,” “≥80% EM on internal set”).
Create a golden set of ~200 prompts + references; include edge cases and tricky negatives.
Automate metrics; sample human review weekly for drift.
Compare variants A/B with paired prompts; keep a change log.
Instrument production: capture user flags, latency, and failure modes.

Common pitfalls

Overfitting to leaderboards like MMLU.
Ignoring data provenance and privacy constraints.
Confusing clever demos with reliable capability.
Not testing distribution shifts: new topics, dates, or formats.

Mini glossary

EM: exact string match; F1: token overlap; nDCG: graded ranking; MRR: first‑hit rank; BERTScore: semantic similarity; Hallucination: confident, fabricated content.

★ Remember: "Measure twice, deploy once." Open scorecard template See eval scripts