Lab 5: The Full RAGAS Scorecard

All 4 metrics together. The complete evaluation pipeline โ€” live queries, batch testing, and eval strategy.

1

Live Evaluation

โ€” Ask any question, see all 4 metrics
๐Ÿ“Š
Type any question about NovaCorp HR policy. The system runs full RAG retrieval, generates an answer, then computes all 4 RAGAS metrics in parallel. All 4 gauges animate to their scores simultaneously.
2

Golden Dataset

โ€” Batch evaluation on 5 curated Q&A pairs
Golden Test Set (5 questions)
1
How many wellness days do I get at NovaCorp?
Ground truth: Every NovaCorp employee receives 7.5 Wellness Days per quarter. They are non-transferable and non-encashable, and they expire at the end of each quarter.
2
How do I apply for paternity leave?
Ground truth: Submit Form W-77B to the Culture Team at hr-culture@novacorp.internal at least 2 weeks before the expected date of birth. NovaCorp offers 14 days of fully paid paternity leave.
3
What is the gym reimbursement limit at NovaCorp?
Ground truth: NovaCorp reimburses up to $500 per year for gym memberships and fitness classes. Claims must be submitted via the benefits portal with receipts within 30 days of payment.
4
Can Level 2 employees work from home?
Ground truth: Level 1 and Level 2 employees may work remotely 1 day per week after completing 6 months of service.
5
What happens if I work on a public holiday?
Ground truth: Employees working on public holidays receive double pay or a compensatory day off. The choice between double pay and comp-off must be made within 5 working days of the holiday.
3

Offline vs Online Evaluation

โ€” Two complementary strategies
๐ŸงชOffline Evaluation

Run before you deploy. Use a golden test set โ€” curated Q&A pairs with known correct answers.

โœ“Run on every code change
โœ“Catch regressions before deploy
โœ“Compare configs head-to-head
โœ“Reproducible, stable baselines
Golden Setโ†’RAGโ†’Scoreโ†’Pass/Fail
๐Ÿ“กOnline Evaluation

Run in production. Evaluate real user queries as they arrive. Catch quality drops before users complain.

โœ“Monitor live traffic continuously
โœ“Alert when scores drop below threshold
โœ“Identify problematic query patterns
โœ“Track metrics over time
User Queryโ†’RAGโ†’Scoreโ†’Alert?
The Key Difference
๐Ÿงช
Offline: Known Answers
You have the ground truth. You can compute exact recall. Used to prevent regressions and compare experiments.
๐Ÿ“ก
Online: Unknown Answers
No ground truth available. Use faithfulness + relevancy only. Statistical sampling and LLM-as-judge at scale.
Use offline evaluation to build confidence before deploy. Use online evaluation to maintain quality after deploy. Both are necessary.