The Flaky Test Isn't Flaky — It's a Race Condition
Table of Contents
At a large Canadian telecom, we had a Playwright suite — 260 tests, 4 parallel workers — that failed 8-12 times per run. Always different tests. Always passing on retry. The team had already added retries: 2 to the Playwright config and moved on. For six months, nobody questioned it. The retry mask was hiding a real bug that customers were hitting in production.
The @Retry Anti-Pattern
Here’s the thing about retries: they don’t fix anything. They hide symptoms. And in test automation, hidden symptoms compound.
Every retry is a question you’re choosing not to ask. “Why did this fail?” becomes “did it pass the second time?” and that second question costs you nothing today and everything over time.
I’ve seen teams carry 15-20% retry rates for months. The math is brutal. If you run 300 tests 3 times a day and 15% need a retry, that’s 135 extra test executions daily. At 30 seconds each, you’re burning over an hour of CI time every day on tests that “pass.” Multiply that by a year — that’s 365+ hours of compute time spent re-asking questions you already got honest answers to.
But the CI time isn’t even the real cost. The real cost is trust erosion. Once a team accepts that “some tests are just flaky,” every legitimate failure gets the benefit of the doubt. I’ve watched teams retry a genuine regression four times before someone actually read the error message. That’s the compound interest on technical debt.
What a Race Condition Actually Looks Like
A race condition is two operations fighting over the same resource with no coordination on who goes first. In application code, it’s two threads writing to the same variable. In test automation, it’s usually subtler — two tests sharing state they shouldn’t be sharing.
The symptoms are always the same:
- Passes in isolation, fails in parallel
- Failures are non-deterministic — different tests, different runs
- Error messages don’t match the test logic (wrong user, wrong data, unexpected state)
- Failure rate increases when you add more parallel workers
If that list describes your suite, you don’t have flaky tests. You have shared state leaking between parallel executions.
The War Story: A “Flaky” Login Test
Our failing tests had no pattern — sometimes a checkout test, sometimes a profile update, sometimes a simple dashboard load. The only commonality was that they all involved authenticated flows. The team had accepted it as Playwright being “flaky with auth.” Playwright wasn’t the problem.
The first clue was in the CI history — four consecutive runs, each failing on different authenticated tests:
Run #1204 Mar 02 258 passed 2 failed ✗ account-management > update billing address ✗ dashboard > load account summary widgetRun #1205 Mar 03 256 passed 4 failed ✗ profile > change notification preferences ✗ checkout > apply promo code to subscription ✗ billing > download invoice PDF ✗ dashboard > verify usage chart rendersRun #1206 Mar 04 259 passed 1 failed ✗ account-management > cancel add-on serviceRun #1207 Mar 05 257 passed 3 failed ✗ billing > update payment method ✗ profile > upload avatar image ✗ checkout > upgrade plan tierThe pattern is invisible if you look at any single run. Line up four runs and it jumps out: every failure is an authenticated flow, no two runs fail on the same test, and every single one passes on retry. That’s not flakiness — that’s contention.
I pulled up the Playwright Trace Viewer on three consecutive failures to confirm. The network timelines told the whole story:
- Test A (Worker 1) logs in as
testuser@corp.comatT+0ms - Test B (Worker 3) logs in as
testuser@corp.comatT+200ms - Test A’s session token gets invalidated when Test B authenticates
- Test A tries to load the dashboard at
T+500ms— gets a 401, redirected to login
The authentication service enforced single-session. When Test B logged in with the same credentials, it killed Test A’s session. Both tests were correct. The infrastructure was correct. The problem was that all 260 tests shared a single test user account.
Worker 1: login(testuser) -----> 200 OK -----> GET /dashboard -----> 401 UnauthorizedWorker 3: login(testuser) -----> 200 OK -----> GET /dashboard -----> 200 OK ^ Session invalidated hereWhy Shared Test Users Break Everything
This anti-pattern is everywhere. A team creates one test account, hardcodes the credentials, and it works fine in sequential execution. The moment you add parallel workers, you’ve introduced a race condition into your test infrastructure.
The failures aren’t limited to session conflicts. Shared test users cause at least three categories of non-deterministic failure:
1. Session/auth token invalidation. The one we hit. Most auth systems enforce single-session or rotate tokens on new login. Two workers logging in with the same credentials means one always loses.
2. Data contention. Test A creates an order for testuser. Test B queries orders for testuser and finds unexpected data. Test A deletes the order. Test B tries to verify the order and it’s gone.
3. State pollution. Test A changes the user’s profile to “Ontario.” Test B expects the default “British Columbia.” Both tests are correct in isolation, both fail unpredictably in parallel.
The Fix: Test User Isolation
The principle is simple — every parallel worker gets its own isolated test user. The implementation depends on your constraints.
Option 1: Pre-Created User Pool
Create a pool of test users ahead of time and assign one per worker. This is the fastest path if your user provisioning is complex or slow.
public class TestUserPool { // One user per parallel worker — never shared private static final String[] USERS = { "testuser-w0@corp.com", "testuser-w1@corp.com", "testuser-w2@corp.com", "testuser-w3@corp.com" };
public static String forWorker(int workerIndex) { return USERS[workerIndex % USERS.length]; }}For Playwright specifically, you can use the built-in workerIndex:
import { test as base } from '@playwright/test';
export const test = base.extend<{ testUser: string }>({ testUser: async ({}, use, workerInfo) => { const email = `testuser-w${workerInfo.workerIndex}@corp.com`; await use(email); },});Option 2: Dynamic User Provisioning
If your system supports it, create a fresh user per test or per worker via API. More overhead, but zero chance of collision.
public class TestUserFactory { public static TestUser create() { // Unique per invocation — UUID eliminates collisions String email = "auto-" + UUID.randomUUID() + "@test.corp.com"; return userApi.createUser(email, DEFAULT_PASSWORD); }}Option 3: Worker-Scoped Setup (Best of Both Worlds)
Create the user once per worker, reuse it across all tests on that worker, and tear it down after. This is what we ended up using — it balances isolation with performance.
export const test = base.extend<{}, { workerUser: TestUser }>({ workerUser: [async ({}, use, workerInfo) => { // Created once per worker, torn down after all its tests const user = await api.createUser({ email: `worker-${workerInfo.workerIndex}-${Date.now()}@test.corp.com`, password: 'Test1234!', }); await use(user); await api.deleteUser(user.id); }, { scope: 'worker' }],});CI Retry Rate
Before
15% of tests needed retries
After
0.4% retry rate (actual infra flakes)
The Organizational Argument
The technical fix took two days. Convincing the team to prioritize it took two weeks. Here’s the argument that worked:
Every retry is an admission that your test suite is giving you unreliable answers. If QA can’t trust the suite, neither can anyone gating releases on it. That means manual verification creeps back in “just to be safe.” A test that lies about its confidence level is worse than no test — and a test that sometimes fails for infrastructure reasons is a test nobody trusts even when it fails for real reasons.
Our team was spending roughly 3 hours per week investigating “flaky” failures that turned out to be retry-masked race conditions. After the fix, that dropped to near zero. But the bigger win was cultural — the team stopped treating test failures as noise and started treating them as signal again.
Reframe: Signal, Not Noise
The next time a test fails intermittently, resist the urge to add a retry. Instead, ask three questions:
- Does it pass in isolation but fail in parallel? You have shared state. Check for thread safety violations — shared drivers, shared users, shared data.
- Does the error reference data from a different test? Two workers are contending over the same resource.
- Does the failure rate scale with parallelism? More workers = more failures = shared state, guaranteed.
Your flaky test is trying to tell you something. The retry is just making sure you never hear it.
Related Posts
Our Enterprise Approved AI — And Why It's the Biggest Risk
Enterprises lock teams into outdated AI models for safety. The irony? Older, less capable models produce worse code and create more risk than they prevent.
2 Minutes a Day Cost Me More Than You Think
I automated my enterprise VPN login with Python and saved 10+ hours a year. But the real win wasn't time — it was the focus and brain energy I never got back.
Playwright's onceDialog Has a Silent Handler Leak
Why onceDialog stays registered when no dialog fires, silently eating the next dialog from an unrelated action — and the onDialog/offDialog pattern that fixes it.