Agentic Testing Will Save You

Mar 26, 2026

I don’t think we’ve really figured out how to test yet in the age of agents. Many assume it’s kind of the same story as before, except that now agents are doing things. In the before-times, we used unit testing, and maybe some integration testing, so that’s what we continue to use with agents. Like many things that are happening these days, it’s basically “the same as before, but with agents”. That is not the right way to think about things. I’m going to explain why in particular this approach to testing is wrong, and give a solution: an agent-driven test that will test at the highest level, and explain why this works.

First of all, writing tests is not great. It’s hard to do it well. As an engineer, I take pride in writing really good tests, tests that can be read like a story, where all the data needed to verify that it works is right there; simple and beautiful. But it adds a lot of time, often most of the time in writing code is writing tests. Regardless of the fact that I value tests and value well-written tests, I’m also someone who has things to do, and LLMs can help with tests. I’m not alone in this. In fact, that’s really the first point of entry, the high-effort weak spot where LLMs begin to infiltrate your codebase. How can it be that unit tests are a protection against agents messing up your codebase when agents are writing the unit tests? It’s like being afraid of the Terminator and buying the killer robotic dog from Black Mirror to protect your home. I realize that in *Terminator 2: Judgement Day* this kind of worked out, but despite what most people believe, it’s a much weaker movie than the first Terminator. What would actually happen is that you now have two killer robots you have to worry about.

So the answer isn’t unit tests. It’s worth asking why. Fundamentally, what’s wrong with unit tests? The short answer is that they are too low-level, and most logic isn’t interesting enough to benefit from unit tests. I’ve seen various TDD and other testing advocates say “not so!” and proceed to write a TDD example, beautiful and useful, for something like a queue. But that’s not what real code looks like. Yes, if your goal is to produce a library, unit testing is really the way to go. There’s nothing better! But if you have an actual program that does things, testing a very small part of the system only goes so far, and really, once you’ve handed over that responsibility to the agent, even with coverage tests, you don’t really have a good sense of how well you’re doing. You can have good coverage, but the tests are probably testing the wrong thing, for example.

Maybe integration tests are what will save us? No, integration tests have always been worse than unit tests. It’s hard to set up, it’s hard to read, and you still aren’t testing everything, just a few things. A common result is to have a clean integration test for part of your system, but something just outside it is rendering the system useless.

Let me ask you: if you wanted to ensure your program is working, and you wanted to really be sure, what would you do? I personally would start up my program and test it manually. It’s the only way to be sure before merging it in. Well, now it pays to ask the question: why am I running manual tests like a chump, when I have a perfectly good agent that can do it for me? Now we’re getting into the spirit of the age\! And it makes sense: the reason we don’t test manually as much as we should is because it is time consuming, annoying, and limited. The LLM can’t really solve the time consuming part, but if you aren’t dealing with it, it’s just wall time, not programmer time, so it’s much easier to swallow. And the annoying part can be solved by running this *as an agent*. This is the magic part that makes it useful, and different from other ways of testing.

By agentic testing, I’m not talking about just letting Claude Code or whatever try to do the equivalent of manual testing, I’m referring to an actual agentic test framework which has the tools appropriate for interacting with your app as a user, plus additional abilities as well. This formalizes the testing more than just having a skill or tools, and having it never have your coding agent’s context is pretty important. Plus, it needs to be a tool which outputs an appropriate error code so it can be used for scripting.

A good agentic test has the following properties:

1. It tests out a wide enough range of functionality to be useful, in a way that is trivially readable and writeable.

2. It tests what the user sees and judges whether it makes sense from a user’s perspective.

3. It can pass / fail but also can comment on unusual things it finds.

4. It has enough independence to vary the script according to the circumstances.

5. It looks at the logs and other byproducts of processing to make sure that there are no obvious issues as well.

6. It can be run based on the change under test or vague descriptions of what to test by the developer.

This mirrors the value that you as a developer bring when you test things out manually, and the LLM can be more thorough, especially when looking at logs and similar byproducts. It’s very powerful!

Each one of these points is worth going through.

For (1), it should be extremely easy to write a test, and it should be extremely easy to read it. You can rely on agent intelligence to supply the actual way to run the test. Your UI changes? You’re testing an LLM that returns random things? An agent can deal with it. Here’s a real agentic test we have, which you can see is trivially writable and readable. You, a reader with no context of what our system is like, can understand everything about this test, and, if you want, change or extend it.

steps:

  # --- HTML: well-known static page ---

  - channel_id: “extract-smoke-html”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this URL and summarize it:

      https://www.paulgraham.com/startupideas.html

      The test PASSES if Continua returns a substantive summary that

      includes concepts from the essay (e.g. startup ideas, problems,

      organic growth, or similar themes from Paul Graham’s writing).

      The response should contain real extracted content, not an error

      message or “could not read” type response.

      The test FAILS if Continua says it could not read the page,

      returns an error, or gives a generic response that doesn’t

      reflect the actual page content.

  # --- HTML: JS-heavy page (SPA / dynamic content) ---

  - channel_id: “extract-smoke-js”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this URL and tell you what it’s about:

      https://react.dev/learn

      The test PASSES if Continua returns content that describes

      React concepts (components, JSX, hooks, rendering, etc.).

      The key signal is that real page content was extracted, not

      just a “please enable JavaScript” shell or an error.

      The test FAILS if Continua says it could not read the page,

      returns only a JavaScript-required notice, or gives content

      that clearly doesn’t match the React documentation page.

  # --- PDF: publicly accessible document ---

  - channel_id: “extract-smoke-pdf”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this PDF and summarize it:

      https://www.w3.org/WAI/WCAG21/Techniques/pdf/PDF1.pdf

      The test PASSES if Continua returns content related to PDF

      accessibility, WCAG, or web content accessibility guidelines.

      The response should reflect actual PDF text extraction, not

      an error or “unsupported format” message.

      The test FAILS if Continua says it cannot read PDFs, returns

      an error, or gives content that doesn’t match the PDF.

An agentic test needs to see what the user sees, which is (2). Continua’s product is mostly text-based, so it’s fairly easy. But this may involve screenshots, or more advanced techniques. If the system is working great but the results are not presented in a clear way to the user, it may cause the agent to fail, which is a really nice outcome.

For (3), besides passing or failing, we want to be able to just comment on weird things that happened along the way. Like, “yes, this worked as intended, but it was formatted in a strange way”, or “it took too long”. Having a way to collect this warning-type information is important, because the tests can serve not just to pass / fail but to also help with agentic loop-closing, so you can instruct an agent to fix all tests and all the other weird things found along the way. Like compiler warnings, for such situations you may want to treat the warnings as errors, so you can ensure not only did the task succeed, but nothing unusual happened along the way.

Varying the test according to what it finds is important, to help deal with minor product differences. If accomplishing a task changes in some major or minor way, the test shouldn’t really need to change. As long as it is understandable to a user, it should in theory be understandable to an agent. This helps keep the test simple. The testing agent must be written in such a way that it has the tools that let it navigate the UI according to the output of the product that it perceives.

Looking at logs (5) is super important. Especially in the world of products with AIs in them, just because you have a problem doesn’t necessarily mean things might fail. The product under test may have enough intelligence to paper over minor issues and accomplish the task anyway. But looking at the logs or other sources of debug information lets you make sure that everything is working as intended. For us, we have these rules in a higher-level script that runs the test, then looks at logs and a few other things and checks the results against a few rules. For example, we can store the product’s LLM input, which we check to make sure that it is well-formed. If we, say, accidentally duplicate part of the chat, that almost certainly wouldn’t result in a failure, but we want to know about it, because it will decrease quality to some extent. This is another thing that ordinary testing simply doesn’t do.

Finally, besides the YAML-based specifications or other stock scripts, you should be able to just tell the agent, “test out my PR” and it can look at the current git branch and figure out what “manual”-type tests should be run to exercise this. Or, the developer could pass it a simple command such as “test out image generation”. It really should just be this easy. This means that testing often doesn’t even need a script.

There’s much more, though. Once you have this system, you can use it not just for testing, but for experimentally-driven improvements. For example, you can have these tests run in a loop, generating a metric, while another agent varies the program under test to improve that metric. The agentic testing gives you the important property that whatever happens with the code under test, it can just run and get the results you need.

Just like we looked at fundamentally why unit tests don’t work, I want to emphasize why this method fundamentally works in the agentic age. Here, the prompt of the text is right there, and it’s written by the actual developer, and humans can read and write it, and because it is interpreted at runtime, it can be flexible in a way that unit tests cannot be. Before, I complained about agents testing agents, and this is also agents testing agentic code, but the difference is, it’s not agentic code. It’s human prompts testing agentic code via agents, rather than agents testing agentic code as a byproduct of more remote human prompts. This is the core difference.

The disadvantage of this system is that now you have another LLM, so there’s some amount of uncertainty: whether your tests pass or fail is now a quality problem. However, in practice, this hasn’t really caused us issues, probably because models are sufficiently good to act as a judge, and our tests are testing reasonably clear things. But I think this kind of system is best used as part of local development, and it wouldn’t be a great fit for automated acceptance testing.

It isn’t a fit for every product; the product must be cheap and fast enough to run and inspect, and non-text paradigms are challenging to do. Trying to get this to run on a game, for example, might be very rewarding, but it’s a serious effort.

To me, this seems like the future of testing. And, I feel strongly enough that it is the future that I made an Emacs package for agentic testing Emacs functionality, llm-test. Emacs users, you really deserve the best of everything! So you can see how it looks to implement one of these things in practice (if you don’t mind reading elisp). Try it out, or write your own, and you’ll find it opens up a new universe of possibilities.

A guest post by

Andrew Hyatt

Software Engineer, aficionado of Emacs, coffee, safe streets, and pizza, living in New York City.

Discussion about this post

Ready for more?