Playwright version 1.56 shipped with three AI-powered agents: Planner, Generator, and Healer. The promise? AI that can plan your tests, write them, and fix them when they break.
I spent a couple of hours testing all three agents on a real application. I timed every operation, hit real bugs, and watched the AI hallucinate an entire test run that never happened. Here's what I found.
Let's dive in.
What Are Playwright Agents, Really?
When I first heard about Playwright agents, I expected some sophisticated AI magic under the hood.
The reality? An agent is just a Markdown file with a set of instructions and a limited list of tools. That's it.
Each agent lives under the .github/chat-modes/ folder as an .md file. Open the Planner agent file and you'll find a description of what it does, the list of MCP tools it can use, a role definition, workflow steps, and an expected output format.
The difference between an agent and a regular Copilot instruction is tool exposure. An agent only has access to specific tools, while a regular instruction exposes all available tools to the default Copilot agent. You could literally copy the agent's prompt into an .instructions.md file, add it as context to your Copilot chat, and it would work the same way.
Why limit the tools? Security. You don't want your test planning agent accidentally deleting your production database. Fair enough.
Planner Agent: Good for Brainstorming, Not for Planning
With tools properly configured, I asked the Planner to explore my test application and generate a comprehensive test plan.
It took 14 minutes and 18 seconds.
The result was a 1,500-line Markdown file with test scenarios covering registration flows, article management, feed navigation, and more. Sounds impressive on the surface.
The problem: when I looked at the article creation scenarios, the Planner had generated steps for fields and buttons it never actually navigated to during exploration. The MCP browsed my application, but it didn't visit the article creation flow. So where did it get the details?
My best guess is that the LLM was trained on publicly available information about my application, or it combined historical context from previous interactions. Either way, the agent was writing test cases based on knowledge it shouldn't have had from the exploration alone. That's a red flag. It means the scenarios could be completely disconnected from how the application actually works.
It also missed the most important validation scenario: you cannot create an article with a duplicate title. A bunch of edge cases, but the core business rule? Gone.
When I narrowed the scope to "create a comprehensive test plan for creation and deletion of the article workflow," the results got better. The agent actually navigated to the article creation form, discovered the fields, and produced more meaningful scenarios. But then it took 15 minutes before I had to stop it because it got stuck while updating the existing file.
So, where does this leave the Planner? It's useful for brainstorming. If you already have test cases and want to check for missing edge cases, feed your existing tests and the Planner's output into an AI prompt and ask it to find the gaps. Don't rely on it as your primary test planning tool.
Generator Agent: Slower Than Codegen
I picked two scenarios from the test plan and asked the Generator to create test scripts.
15 minutes and 40 seconds for two test cases.
The workflow is clever in theory. The Generator opens the browser via Playwright MCP, navigates through the application executing test steps, reads the generator log, and writes the test based on what it observed.
The generated tests had the right structure: login steps, field interactions, button clicks, assertions. But the assertions were rough. One test validated that text was "visible on the page" without specifying where. Another used expect(body).toMatchSnapshot('empty'), which made no sense in context.
Both tests failed when I ran them.
But here's my real problem with the Generator. Playwright Codegen is just faster and produces better results. Playwright MCP only has access to the accessibility tree of your application. It cannot see the full DOM. No data-test-id attributes, no CSS selectors, no XPath. If your application isn't optimized for accessibility, the MCP will generate inaccurate locators.
Why? Context limitations. The DOM of a modern web application is huge. Feeding it entirely into an LLM would overwhelm the context window. Codegen runs locally on your machine with no context limit, so it can inspect the entire DOM and produce reliable locators.
For generating test scripts, use Codegen. It's faster, it produces more stable locators, and it doesn't cost you 15 minutes of waiting per test.
Healer Agent: The One I Actually Like
Both tests from the Generator failed. Perfect chance to test the Healer.
It took about 10 minutes to fix both tests.
The Healer ran each test in debug mode using a new MCP tool called playwright_debug. This tool collects console logs, network requests, and page snapshots during execution. That's way more context than the old approach of just reading error messages from the console. Previously, Playwright MCP lacked this tool, and debugging was limited to whatever was printed to the terminal.
What did the Healer fix?
It figured out that article titles need to be unique and added Date.now() to prevent conflicts. It replaced hardcoded values with variables. It found two "Edit Article" buttons on the page and scoped the selector to the first one. It replaced the meaningless snapshot assertion with an actual visibility check.
After the Healer finished, both tests passed. Not perfect code - there were some odd choices, like using a regex instead of a variable for the article title in one spot. But functional tests that actually work.
Debugging is often the most painful part of test automation. You have a failing test, you don't know why, and you spend an hour digging through logs. The Healer automates that process. Even at 10 minutes per run, it can save you real time on tricky failures. Out of all three agents, this is the one I'd actually use.
Which Playwright AI Agent Is Actually Worth Using?
Let me give you the quick summary:
Agent | Time | Usefulness | Recommendation |
|---|---|---|---|
Planner | 14+ min | Moderate | Good for brainstorming edge cases, not primary planning |
Generator | 15+ min | Low | Use Playwright Codegen instead |
Healer | ~10 min | High | Best of the three, use it for debugging |
I ran everything on Sonnet 4.5. With faster models, you might get quicker results, but no model change will give the Generator access to the DOM. That's a structural limitation, not a speed problem.
Final Thoughts
Playwright AI agents are not magic. They're Markdown files with prompts and limited tool access. That's it.
The Planner helps you brainstorm, the Generator is outperformed by Codegen, and the Healer earns its spot by using the debug tool to actually fix your failing tests. If you try one agent, make it the Healer.
Playwright is growing in popularity on the market very quickly and is becoming the mainstream framework for UI test automation. Get the new skills at Bondar Academy with the Playwright UI Testing Mastery program. Start from scratch and become an expert to increase your value on the market!
Frequently Asked Questions
How do I install Playwright AI agents?
Update Playwright to version 1.56 or higher, then run npx playwright init-agents in your VS Code terminal. This generates the agent files under .github/chat-modes/.
Why are Playwright agent tools showing as "unknown" in VS Code?
The tools don't connect to VS Code Copilot automatically. Click the tools icon in the Copilot chat panel and manually enable the Edit tool, Search tool, and Playwright Test MCP server for each agent.
Can Playwright agents replace manual test writing?
Not yet. The Generator produces tests with rough assertions and can't access the full DOM for locators. Stick with Playwright Codegen for writing tests and the Healer for fixing them. Human review is still a must.
Why does the Playwright Generator agent produce inaccurate locators?
The Generator uses Playwright MCP, which only accesses the accessibility tree, not the full DOM. It can't see data-test-id attributes, CSS selectors, or other DOM-specific identifiers. If your application isn't optimized for accessibility, the locators will be less reliable.
Is the Playwright Healer agent good for fixing flaky tests?
Yes. It runs tests in debug mode using playwright_debug, which collects console logs, network requests, and page snapshots. Way more context than just error messages. In my testing it found and fixed issues like duplicate locators and missing dynamic values.
