EvalDog grades your prompt & RAG outputs against real assertions, scores every case, and barks the moment a model update breaks something. Hosted dashboard + a zero-token CLI for CI and AI agents.
$ npx evaldog run shopbot.csv --min 80
✓ Greeting & intent
✓ Product search
✓ Add to cart
✗ Order status contains "delivered"
✓ Refund & escalation
80% 4/5 passed (gate 80%) exit 1▌
HOW IT WORKS
Drop a CSV, JSON, or YAML of test cases — the output you already have, plus what to assert.
Every case is checked — contains, equals, regex, valid-JSON, not-empty — and scored pass/fail.
Re-run on every model update. EvalDog flags the moment your score drops. (rolling out)
SEE IT WORK
No video needed. This is the real flow, on loop: grade your suite → the model drifts → an alert fires.
support-bot
5 cases · openai:gpt-4o · watched daily
support-bot dropped 82% → 61% · trigger: model-change · 2/5 cases failing
MODEL DRIFT, LIVE
Our TDA Cart bot, asked the same thing across models — EvalDog grades each one and flags the gap automatically.
Prompt: “What’s your refund policy?”
50% driftSame prompt, same assertion (contains “30 days”). Two models nail it, two drift. EvalDog caught it.
Run the live drift demoFOR CI & AGENTS
The evaldog CLI grades locally with no model calls — so an agent can check 200 outputs with a single shell command instead of streaming every case through the LLM.
# fail the build if quality drops
$ npx evaldog run evals/*.csv --min 90 --json
…
✓ 47 passed
✗ 3 failed
94% 213/226 (gate 90% → exit 1)
TRY IT NOW
Five ready-made evals — greeting, search, cart, order status, refund. One click each in the dashboard.
Greeting & Intent
Product Search
Add to Cart
Order Status
Refund & Escalation
Grade your prompts before they ship. Free to try — no card, no setup.