EvalDog

LIVE DEMO · TDA CART SHOPBOT

Same question. Different model. Different answer.

TDA Cart added a simple support chatbot. Swap the model behind it and the answer drifts — the refund policy goes vague, the JSON breaks, the bot guesses instead of asking. EvalDog runs the same checks on every model and catches the drift automatically.

1 · Pick a question

Asserts: Must state the exact 30-day window. (contains, not-empty)

2 · Pick models (max 4)

3 · OpenRouter key (optional — for live calls)

Used only for this request — never stored. Get one at openrouter.ai/keys.

Pick a question + models, then run. EvalDog grades every model’s answer.

What this proves

The exact same prompt and assertions, run across models, produce different scores. That gap is model drift — and it’s exactly what breaks silently when a provider ships an update. EvalDog turns it into a number you can gate on.

Grade your own cases