I Let AI Write My Test Suite — Here's What It Got Right and Wrong

I gave Claude a feature spec for a login page — OAuth redirect, form validation, error states — and asked it to generate a full Playwright test suite. Twenty minutes later, I had 14 tests, a page object, and a surprisingly clean file structure. Nine of those tests had bugs that would have made it to CI if I hadn’t reviewed them.

That’s the reality of AI test generation in 2025. It’s not useless. It’s not magic. And if you’re an SDET being told by your manager to “just use AI for the tests,” you need to understand where it helps and where it’ll cost you more time than it saves.

Why This Matters Right Now

The pressure to adopt AI in testing is coming from everywhere. Engineering managers watch a 90-second demo where Copilot generates a test file and think, “Why do we need three SDETs?” LinkedIn is flooded with posts about how AI will replace manual testers by next quarter. And sure, some of that is real progress — but most of it skips the part where a senior engineer spends an hour fixing what the AI got wrong.

Here’s the thing nobody talks about in those demos: the demo app is always a clean TODO list or a simple form. No OAuth redirects. No dynamic IDs generated by a backend framework. No race conditions from WebSocket connections. No third-party iframes from compliance vendors.

I wanted to test what happens when you throw enterprise complexity at AI test generation. So I did.

The Experiment

I took a real feature from a project: a login flow with these characteristics:

Email/password form with client-side validation
OAuth redirect through a third-party identity provider
Session token stored in HttpOnly cookies
Error states for invalid credentials, expired sessions, and rate limiting
A “remember me” checkbox that extends token expiry

I gave Claude the following prompt along with the component code:

Write a complete Playwright test suite for this login page.
Include a page object model, cover happy paths and edge cases,
and use data-testid attributes for selectors. Include tests for
OAuth flow, form validation, error handling, and session management.

Here’s what I got back.

What AI Got Right

I’ll give credit where it’s due — the structural output was genuinely impressive.

Test scaffolding and file organization

The AI generated a page object with a clean constructor, properly typed locators, and reusable methods. This is boilerplate I write on every project, and it was correct on the first try.

export class LoginPage {
  readonly page: Page;
  readonly emailInput: Locator;
  readonly passwordInput: Locator;
  readonly submitButton: Locator;

  constructor(page: Page) {
    this.page = page;
    this.emailInput = page.getByTestId('email-input');
    this.passwordInput = page.getByTestId('password-input');
    this.submitButton = page.getByRole('button', { name: 'Sign In' });
  }

  async login(email: string, password: string) {
    await this.emailInput.fill(email);
    await this.passwordInput.fill(password);
    await this.submitButton.click();
  }
}

Locators use getByTestId and getByRole — exactly what I’d write. Every action is properly await-ed, which matters more than most people realize. If you’ve dealt with the async trap that causes flaky Playwright tests, you know how much damage a missing await can do.

Assertion variety

The AI generated assertions I wouldn’t have thought to include on a first pass: checking that the password field masks input, verifying the submit button is disabled until both fields are filled, confirming the OAuth redirect URL contains the expected state parameter. These aren’t revolutionary, but they’re the kind of tests you add in sprint 3, not sprint 1. Having them from the start is a genuine time-saver.

Test naming

Every test name read as a sentence describing behavior, not implementation:

test('displays validation error when email format is invalid', ...);
test('redirects to dashboard after successful login', ...);
test('shows rate limit message after 5 failed attempts', ...);
test('preserves return URL through OAuth redirect flow', ...);

I didn’t need to rename a single one. That’s rare — even experienced SDETs tend to write names that describe what the test does rather than what it verifies.

What AI Got Dangerously Wrong

Here’s where it gets interesting. Five of the nine broken tests would have passed locally and failed in CI. The other four would have failed immediately but for subtle reasons.

Problem 1: Fragile selectors in generated assertions

Despite using getByTestId in the page object, the AI switched to fragile selectors inside individual tests when reaching for elements it hadn’t put in the page object:

// AI generated this — looks reasonable, breaks on any CSS change
test('shows validation errors', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await loginPage.submitButton.click();

  // These selectors are a maintenance nightmare
  const emailError = page.locator('.form-group:first-child .error-text');
  const passwordError = page.locator('.form-group:nth-child(2) .error-text');
  await expect(emailError).toHaveText('Email is required');
  await expect(passwordError).toHaveText('Password is required');
});

The fix is straightforward — use getByRole or add test IDs:

test('shows validation errors', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await loginPage.submitButton.click();

  await expect(page.getByTestId('email-error')).toHaveText('Email is required');
  await expect(page.getByTestId('password-error')).toHaveText('Password is required');
});

If you’re building a locator strategy from scratch, text-based locators that match how users actually interact with your app will save you from most of these issues.

Problem 2: Happy-path bias

Out of 14 generated tests, 10 covered happy paths or simple validation. Only 4 attempted edge cases — and all 4 were wrong.

The AI generated a test for “session expiry” that navigated to the login page, filled in credentials, and then… checked that the session token existed. It didn’t actually test what happens when a session expires mid-flow. It didn’t simulate a token becoming invalid between page loads. It tested the presence of a session, not the absence of one.

The edge cases an enterprise login flow actually needs:

What happens when the OAuth provider returns a 503?
What happens when the user’s session token expires between the redirect and the callback?
What happens when two tabs are open and one logs out?
What happens when the “remember me” token is present but the account has been deactivated?

AI can’t generate these because they require understanding the system’s failure modes — not just its interface. You learn these from production incidents, not from reading component code.

Problem 3: Assertions that pass but verify nothing

This is the most dangerous category. The test passes, the CI is green, and nobody notices that the test doesn’t actually prove anything:

// AI generated — passes but proves nothing useful
test('handles OAuth redirect', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await page.getByRole('button', { name: 'Sign in with Google' }).click();

  // This just checks we left the page — not WHERE we went
  await expect(page).not.toHaveURL(/login/);
});

“Not on the login page anymore” is not a meaningful assertion. After clicking the OAuth button, you need to verify the redirect URL contains the correct client_id, redirect_uri, and state parameters. Otherwise, you’re testing that a button click does something, which is barely better than no test at all.

// What this test should actually verify
test('handles OAuth redirect', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await page.getByRole('button', { name: 'Sign in with Google' }).click();

  await expect(page).toHaveURL(/accounts\.google\.com/);
  const url = new URL(page.url());
  expect(url.searchParams.get('client_id')).toBeTruthy();
  expect(url.searchParams.get('redirect_uri')).toContain('/auth/callback');
  expect(url.searchParams.get('state')).toBeTruthy();
});

Problem 4: No awareness of test isolation

The AI generated tests that shared implicit state. Test 7 created a user account, and test 11 assumed that account existed. Run them in order — green. Run test 11 in isolation — red. Run them in parallel — unpredictable.

This isn’t a nitpick. Test isolation is fundamental to running tests in parallel at scale. AI doesn’t think about execution order because it generates tests as a flat list, not as independent units that might run on different CI workers.

The Workflow That Actually Works

After running this experiment across several features — not just the login page — I settled on a workflow where AI handles what it’s good at and I handle what it can’t:

Use AI for scaffolding

Let AI generate the page object structure, test file skeleton, and basic happy-path tests. This saves 20-30 minutes of boilerplate per feature. Accept the structure, but review every selector.

Write critical assertions yourself

Edge cases, failure modes, security-sensitive flows — write these by hand. AI doesn’t know your system’s failure history, and it can’t simulate the production incident that taught you to always check the state parameter on OAuth callbacks.

Use AI to brainstorm what you missed

After writing your tests, paste them back to the AI and ask: “What edge cases am I not covering?” This is where AI shines — it’s excellent at generating a list of scenarios. About 30% of its suggestions will be relevant. That 30% includes tests you genuinely would have missed.

Never trust AI-generated selectors

Even if the AI used getByTestId in the page object, audit every selector in the test body. Open your browser DevTools, inspect the actual DOM, and verify the locator resolves to exactly one element. This takes 2 minutes per test and prevents hours of debugging later.

Treat AI output as a first draft

The mental model that works: AI generates a first draft. You’re the editor. You wouldn’t publish a first draft of an article — don’t merge a first draft of a test. The editing is where your 10 years of experience add value.

The Bottom Line

AI test generation is a productivity multiplier for the parts of testing that are tedious and repetitive — boilerplate, scaffolding, basic coverage. It’s not a replacement for the judgment that comes from years of debugging flaky CI pipelines at 2 AM, investigating race conditions in OAuth flows, or knowing that a “green” test suite can give false confidence if the assertions don’t verify the right things.

If your manager asks why you can’t “just use AI for the tests,” show them this post. The AI gave me 14 tests in 20 minutes. Fixing 9 of them took another hour. Writing the 6 edge-case tests it couldn’t generate took another 45 minutes. The total time was about the same as writing the suite from scratch — but the starting structure was better.

The ROI is real, but it’s in scaffolding speed and coverage brainstorming, not in replacing SDET judgment. Use it like a junior dev who’s fast at typing and terrible at knowing what makes a good test.

Try it yourself: take your next feature spec, feed it to Claude or Copilot, and compare the output to what you’d write. Count the assertions that would pass in CI but prove nothing. That count is the gap between AI-generated tests and production-grade tests — and it’s where your experience earns its keep.

Can AI fully replace SDETs for test automation?

No. AI handles structural boilerplate and basic coverage well, but it can’t generate meaningful edge-case tests, understand system failure modes, or ensure test isolation for parallel execution. The SDET’s role shifts from writing boilerplate to reviewing AI output and adding the judgment layer.

Which AI tools are best for generating test code?

Claude and GitHub Copilot both produce usable Playwright/TypeScript test code. Claude is better for larger context windows — you can feed it full component code and specs. Copilot is better for inline autocomplete while you’re already writing tests. Use both for different parts of the workflow.

How do I convince my team to adopt AI for testing without losing quality?

Start with the scaffolding workflow: AI generates page objects and happy-path tests, humans review and add edge cases. Track two metrics — time to first test coverage (should decrease) and flaky test rate (should not increase). If flake rate goes up, you’re trusting AI output too much.