Lab 5: The Full RAGAS Scorecard

All 4 metrics together. The complete evaluation pipeline — live queries, batch testing, and eval strategy.

Live Evaluation

— Ask any question, see all 4 metrics

📊

Type any question about NovaCorp HR policy. The system runs full RAG retrieval, generates an answer, then computes all 4 RAGAS metrics in parallel. All 4 gauges animate to their scores simultaneously.

Golden Dataset

— Batch evaluation on 5 curated Q&A pairs

Golden Test Set (5 questions)

How many wellness days do I get at NovaCorp?

Ground truth: Every NovaCorp employee receives 7.5 Wellness Days per quarter. They are non-transferable and non-encashable, and they expire at the end of each quarter.

How do I apply for paternity leave?

Ground truth: Submit Form W-77B to the Culture Team at hr-culture@novacorp.internal at least 2 weeks before the expected date of birth. NovaCorp offers 14 days of fully paid paternity leave.

What is the gym reimbursement limit at NovaCorp?

Ground truth: NovaCorp reimburses up to $500 per year for gym memberships and fitness classes. Claims must be submitted via the benefits portal with receipts within 30 days of payment.

Can Level 2 employees work from home?

Ground truth: Level 1 and Level 2 employees may work remotely 1 day per week after completing 6 months of service.

What happens if I work on a public holiday?

Ground truth: Employees working on public holidays receive double pay or a compensatory day off. The choice between double pay and comp-off must be made within 5 working days of the holiday.

Offline vs Online Evaluation

— Two complementary strategies

🧪Offline Evaluation

Run before you deploy. Use a golden test set — curated Q&A pairs with known correct answers.

✓Run on every code change

✓Catch regressions before deploy

✓Compare configs head-to-head

✓Reproducible, stable baselines

Golden Set→RAG→Score→Pass/Fail

📡Online Evaluation

Run in production. Evaluate real user queries as they arrive. Catch quality drops before users complain.

✓Monitor live traffic continuously

✓Alert when scores drop below threshold

✓Identify problematic query patterns

✓Track metrics over time

User Query→RAG→Score→Alert?

The Key Difference

🧪

Offline: Known Answers

You have the ground truth. You can compute exact recall. Used to prevent regressions and compare experiments.

📡

Online: Unknown Answers

No ground truth available. Use faithfulness + relevancy only. Statistical sampling and LLM-as-judge at scale.

Use offline evaluation to build confidence before deploy. Use online evaluation to maintain quality after deploy. Both are necessary.