AI demos are misleading: how to evaluate AI tools honestly
The gap between a polished demo and a production system is enormous. Here's what demos hide, why it matters, and how to evaluate AI tools honestly.
On this page
A demo is a trailer, not a movie. That’s the entire argument of this article, and you can stop reading here if it lands.
I’ve watched dozens of AI demos this year. Product launches, investor pitches, conference keynotes, Twitter threads with the fire emoji. Almost all of them share one thing in common: they’re misleading. Not because the presenters are lying, but because the format itself is designed to hide everything that matters.
A demo shows you a tool working perfectly, once, on an input chosen by the person selling it to you. That tells you almost nothing about whether the tool will work for your problems, on your data, at your scale, when things go wrong. And things will go wrong.
I’m not saying AI tools are bad. Many of them are genuinely useful. I’m saying that demos are a terrible way to evaluate them, and most people don’t realize how much is being hidden.
The perfect input problem
Every demo starts with a carefully chosen input. The presenter types a prompt they’ve tested fifty times. They know exactly what the model will say. They’ve refined the wording until the output is impressive.
This is the AI equivalent of a magician forcing a card. You think you’re seeing a general capability. You’re actually seeing one specific input that happens to produce a great result.
I once watched a coding demo where the presenter asked an AI to build a complete web app from a single prompt. The output was clean, well-structured, functional. The audience was stunned. What the audience didn’t know: the prompt had been refined over weeks. The presenter had tried dozens of variations. The “simple” prompt was actually a precisely engineered instruction that exploited the model’s training data in a specific way. Change three words and you’d get garbage.
When you try the tool yourself with your own inputs, your own phrasing, your own problems, the results will be different. Sometimes better, often worse, almost always less consistent. The demo showed you the ceiling. Your daily experience will be somewhere between the ceiling and the floor.
The hidden system prompt
Behind every impressive demo is a system prompt you never see. This is the instruction set that tells the model how to behave, what format to use, what constraints to follow, what persona to adopt. It can be hundreds or thousands of words long.
The system prompt is doing a lot of the work. When a demo shows you an AI that responds in a specific format, with specific guardrails, citing sources in a specific way, that behavior isn’t magic. It’s instructions. Someone spent days writing and tuning those instructions to get the output you’re seeing.
This matters because when you sign up for the tool, you might not have access to that system prompt. Or you might have a different version. Or you might be expected to write your own. The demo made it look like the AI just “knows” how to do this. In reality, a human engineer wrote a very specific set of rules, and the model is following them. Sometimes.
What happens when it fails?
Demos never fail. That’s the point. But real usage fails constantly.
What does the tool do when the input is ambiguous? When the data is messy? When the model hallucinates a confident answer that’s completely wrong? When the API times out? When the output is in the wrong format? When the user asks something slightly outside the expected scope?
I’ve never seen a demo address any of these questions. Not once. And these are the questions that determine whether a tool is actually usable in production.
Error handling is the difference between a prototype and a product. A demo shows you the prototype. It shows you the happy path, the golden scenario, the one time out of ten where everything lines up perfectly. The other nine times? You’ll discover those on your own, probably at the worst possible moment, probably when a customer is waiting.
If you’ve worked with AI agents at all, you know that thinking about failure modes is where most of the real engineering happens.
Curated speed
Watch the timing on any AI demo. The presenter types a prompt and the response appears almost instantly. Or they cut to the result. Or they speed up the video. Or they narrate over the waiting time so you don’t notice it.
In reality, many AI operations take 10 to 60 seconds. Some take minutes. Complex agent workflows can take much longer. This matters more than you think, because speed affects whether a tool is practical for real workflows.
If your coding assistant takes 45 seconds to respond to each question, you’re not going to use it for rapid iteration. If your document summarizer takes two minutes per document, it’s not going to work for processing a hundred documents in a meeting prep session. The demo made it look instant. Your experience will involve a lot of staring at loading spinners.
Some presenters are honest about this. Most are not. The incentive structure rewards speed, so demos are optimized for perceived speed, not actual speed.
The cost nobody mentions
The demo ran once. Maybe a hundred times during testing. The presenter is not paying for 10,000 runs a day.
But you might be. And the cost math changes everything.
I’ve seen teams get excited about an AI feature that costs $0.15 per API call during the demo. That sounds cheap. Then they calculate what it costs at their actual volume: 50,000 calls per day, $7,500 per day, $225,000 per month. Suddenly the feature that looked magical in the demo looks like a budget catastrophe.
The cost problem gets worse with complex workflows. If your agent makes five tool calls per task, and each call involves a large context window, a single user interaction might cost $1 or more. Multiply by thousands of users. Do the math before you get excited.
Demos never show you the bill. They never show you the token counts, the context window sizes, the retry costs when things fail, the monitoring costs, the infrastructure costs. They show you the output and let you assume the input was free.
The “works on my machine” of AI
Software developers have a running joke: “works on my machine.” The code runs perfectly on the developer’s laptop but breaks everywhere else. AI has the same problem, but worse.
A demo works on the presenter’s setup because the presenter controls everything. The model version, the temperature setting, the context window, the available tools, the data format, the input distribution. Change any one of those variables and the results shift.
Model versions change constantly. If your code is hitting an unversioned endpoint (just gpt-4 or just claude rather than a pinned version like gpt-4-0125-preview or claude-sonnet-4-6), the underlying model can shift under you and the carefully tuned prompts that worked in the demo will start producing different output. The vendors call this model drift; production teams that pin versions can mostly avoid it, but a lot of demos and prototype code don’t, which is part of why so many “it worked yesterday” stories exist. I’ve seen production systems break because a model provider shipped a minor update that changed how the model interpreted a specific instruction.
This isn’t a bug, it’s a fundamental property of these systems. They’re probabilistic. They’re non-deterministic. The same input can produce different outputs on different runs. The demo showed you one run. Your production system will run thousands of times, and the variance across those runs is something demos never acknowledge.
How to actually evaluate an AI tool
So if demos are unreliable, what should you do instead? Here’s my approach, and I use it every time I evaluate a new tool.
Start by bringing your own inputs. Your actual data, your actual problems, your actual edge cases. If the tool can’t handle your real work, it doesn’t matter how well it handles the demo scenario. Skip the example prompts from the documentation and use the messy, ambiguous, complicated prompts that reflect your actual needs.
Then test the edges, not the center. Find the boundary of what the tool can do. Give it malformed input, ambiguous instructions, a task slightly outside its intended scope. The center is where demos live. The edges are where your production system will spend most of its time. While you’re at it, deliberately try to break it. Does it fail gracefully with a useful error message? Does it hallucinate an answer? Does it silently produce wrong output? How the tool fails tells you more about its quality than how it succeeds.
Run fifty queries, not one. Time the slowest ones. Time the average. Think about what those times mean for your workflow. If the average response time is 20 seconds and your users expect instant feedback, the tool isn’t going to work, no matter how good the output is. Then estimate your actual volume. Not the optimistic “we’ll start small” volume, but the volume you’ll have in six months if the tool works and people start relying on it. Multiply by the per-call cost. Build in a retry budget (real-world retry rates vary wildly by use case, but a planning assumption of “10-20% of calls will need retries” is rarely too pessimistic). Add monitoring and infrastructure. Is the total still acceptable?
Run the same query ten times and check whether the results are consistent. For many use cases, consistency matters more than peak quality. If the tool produces brilliant output 70% of the time and garbage 30% of the time, that might be worse than a simpler tool that produces good output 95% of the time.
Finally, find someone who’s used it in production. Not the customer testimonials on the website (those are gameable, and so are public reviews). Find someone on a forum, a Discord, a subreddit who’s been using the tool for months. Ask them what breaks. Ask them what surprised them. Ask them what they wish they’d known before they committed.
For the related work this article gestures at: error handling patterns covers what graceful degradation actually looks like, choosing the right tool covers the broader tool-selection criteria, and when not to use agents covers the cases where the right answer is no agent at all.
So what
Demos are useful for one thing: understanding what a tool is trying to do. They show you the vision, the intended use case, the best possible outcome. The trailer is fine. Just don’t confuse it with the movie.
The next time you watch an AI demo and feel that rush of excitement, pause. Ask yourself: what am I not seeing? What input did they choose and why? What’s in the system prompt? What happens when it fails? How much does this cost at scale? How consistent is it really?
Those questions won’t make the demo less impressive. But they’ll make your evaluation of the tool far more honest.
Related articles
When NOT to use an agent
Agents are powerful, but they're not the answer to everything. Sometimes a script, a form, or a human is the better choice. Here's how to tell the difference.
Agent memory patterns for non-developers
What 'memory' actually means for AI agents, why your assistant forgets things, and how to work with memory instead of fighting it.
How to design AI agent skills
A deep dive into the four pillars of skill design: clear descriptions, well-typed parameters, error handling, and predictable output.