Lab 5: The Full RAGAS Scorecard
All 4 metrics together. The complete evaluation pipeline โ live queries, batch testing, and eval strategy.
1
Live Evaluation
โ Ask any question, see all 4 metrics๐
Type any question about NovaCorp HR policy. The system runs full RAG retrieval, generates an answer, then computes all 4 RAGAS metrics in parallel. All 4 gauges animate to their scores simultaneously.
2
Golden Dataset
โ Batch evaluation on 5 curated Q&A pairsGolden Test Set (5 questions)
1
How many wellness days do I get at NovaCorp?
Ground truth: Every NovaCorp employee receives 7.5 Wellness Days per quarter. They are non-transferable and non-encashable, and they expire at the end of each quarter.
2
How do I apply for paternity leave?
Ground truth: Submit Form W-77B to the Culture Team at hr-culture@novacorp.internal at least 2 weeks before the expected date of birth. NovaCorp offers 14 days of fully paid paternity leave.
3
What is the gym reimbursement limit at NovaCorp?
Ground truth: NovaCorp reimburses up to $500 per year for gym memberships and fitness classes. Claims must be submitted via the benefits portal with receipts within 30 days of payment.
4
Can Level 2 employees work from home?
Ground truth: Level 1 and Level 2 employees may work remotely 1 day per week after completing 6 months of service.
5
What happens if I work on a public holiday?
Ground truth: Employees working on public holidays receive double pay or a compensatory day off. The choice between double pay and comp-off must be made within 5 working days of the holiday.
3
Offline vs Online Evaluation
โ Two complementary strategies๐งชOffline Evaluation
Run before you deploy. Use a golden test set โ curated Q&A pairs with known correct answers.
โRun on every code change
โCatch regressions before deploy
โCompare configs head-to-head
โReproducible, stable baselines
Golden SetโRAGโScoreโPass/Fail
๐กOnline Evaluation
Run in production. Evaluate real user queries as they arrive. Catch quality drops before users complain.
โMonitor live traffic continuously
โAlert when scores drop below threshold
โIdentify problematic query patterns
โTrack metrics over time
User QueryโRAGโScoreโAlert?
The Key Difference
๐งช
Offline: Known Answers
You have the ground truth. You can compute exact recall. Used to prevent regressions and compare experiments.
๐ก
Online: Unknown Answers
No ground truth available. Use faithfulness + relevancy only. Statistical sampling and LLM-as-judge at scale.
Use offline evaluation to build confidence before deploy. Use online evaluation to maintain quality after deploy. Both are necessary.