Live at evaldog.com · early access

Ship LLM features that don't silently break.

EvalDog grades your prompt & RAG outputs against real assertions, scores every case, and barks the moment a model update breaks something. Hosted dashboard + a zero-token CLI for CI and AI agents.

Open the dashboard

$ npx evaldog run cases.csv

evaldog — terminal

$ npx evaldog run shopbot.csv --min 80

✓ Greeting & intent

✓ Product search

✓ Add to cart

✗ Order status contains "delivered"

✓ Refund & escalation

80% 4/5 passed (gate 80%) exit 1▌

support-bot67%

⚠ model fingerprint changed

Runs in CI·Built for AI agents·Zero LLM tokens·CSV / JSON / YAML·promptfoo-compatible

HOW IT WORKS

From test cases to a graded report in 60 seconds.

Upload your cases

Drop a CSV, JSON, or YAML of test cases — the output you already have, plus what to assert.

Get a graded report

Every case is checked — contains, equals, regex, valid-JSON, not-empty — and scored pass/fail.

Watch for drift

Re-run on every model update. EvalDog flags the moment your score drops. (rolling out)

SEE IT WORK

Watch EvalDog catch a regression — live.

No video needed. This is the real flow, on loop: grade your suite → the model drifts → an alert fires.

evaldog · support-bot.eval.yaml

support-bot

5 cases · openai:gpt-4o · watched daily

grading…82%61%

Greeting & intentpass

Product searchpass

Add to cartpass

Order statuscontains "delivered"

Refund & escalationpass

⚠ model fingerprint changed · gpt-4o updated

EvalDog → #qa-alertsnow

support-bot dropped 82% → 61% · trigger: model-change · 2/5 cases failing

MODEL DRIFT, LIVE

Change the model. Watch the answer drift.

Our TDA Cart bot, asked the same thing across models — EvalDog grades each one and flags the gap automatically.

Prompt: “What’s your refund policy?”

50% drift

GPT-4o miniOpenAI

100%

Claude 3.5 HaikuAnthropic

100%

Llama 3.3 70Bopen

50%

Mistral Smallopen

50%

Same prompt, same assertion (contains “30 days”). Two models nail it, two drift. EvalDog caught it.

Run the live drift demo

FOR CI & AGENTS

One command. A score. An exit code.

The evaldog CLI grades locally with no model calls — so an agent can check 200 outputs with a single shell command instead of streaming every case through the LLM.

Deterministic — no tokens, no API keys
Exit 1 on regression — drop it straight into CI
--json output your agent can parse
Same engine as the hosted dashboard

Read the quick start

ci.yml

# fail the build if quality drops

$ npx evaldog run evals/*.csv --min 90 --json

…

✓ 47 passed

✗ 3 failed

94% 213/226 (gate 90% → exit 1)

TRY IT NOW

A full ShopBot journey, pre-loaded.

Five ready-made evals — greeting, search, cart, order status, refund. One click each in the dashboard.

Open the dashboard

STEP 1

Greeting & Intent

STEP 2

Product Search

STEP 3

Add to Cart

STEP 4

Order Status

STEP 5

Refund & Escalation

Stop finding out from your users.

Grade your prompts before they ship. Free to try — no card, no setup.

Open the dashboard Quick start