Aspects & Assessment of LLMs
A study‑style, hand‑annotated layout you can reuse as a blog theme.
Large Language Models (LLMs) are judged along multiple axes: factuality, reasoning, helpfulness, safety, latency/throughput, context handling, and cost. A practical rule: evaluate like your users behave — not only with synthetic prompts.
What to look for
- Factuality: reduce hallucinations via retrieval, citations, and constrained decoding.
- Reasoning: chain‑of‑thought, tool use, and structured scratchpads.
- Safety & bias: red‑team tests, refusal handling, and fairness audits.
- Latency & cost: measure P50/P95, tokens/sec, cache hits; track $/task.
- Robustness: adversarial prompts, long context drift, non‑English inputs.
For product fit, run task‑grounded evaluations with your own data and success criteria. Use a scorecard so trade‑offs are visible; highlight safety regressions in red and quality gains in green.
How to assess (fast)
- Define clear acceptance criteria (e.g., “no medical claims,” “≥80% EM on internal set”).
- Create a golden set of ~200 prompts + references; include edge cases and tricky negatives.
- Automate metrics; sample human review weekly for drift.
- Compare variants A/B with paired prompts; keep a change log.
- Instrument production: capture user flags, latency, and failure modes.
Common pitfalls
- Overfitting to leaderboards like MMLU.
- Ignoring data provenance and privacy constraints.
- Confusing clever demos with reliable capability.
- Not testing distribution shifts: new topics, dates, or formats.
Mini glossary
EM: exact string match; F1: token overlap; nDCG: graded ranking; MRR: first‑hit rank; BERTScore: semantic similarity; Hallucination: confident, fabricated content.