Thread Safety in Parallel Test Execution: The Bug That Took Us 3 Days to Find
Table of Contents
We had 800 tests running in parallel across 4 threads on a large retail platform. Once a week — never more, never less — a handful of tests would fail with assertion data that didn’t match any test case. A test verifying a product name would assert on a completely different product. A login test would screenshot a dashboard that belonged to a different test’s user. Every failing test passed when rerun individually. It took us three days to find the root cause: our WebDriver instances were leaking between threads.
The Symptoms That Should Scare You
- Tests pass solo but fail in parallel. This is the number one signal. If a test passes with
--threads 1but fails with--threads 4, the test logic isn’t the problem — shared state is. - Assertion values from a different test. You’re asserting on “Wireless Headphones” but the actual value is “Running Shoes.” That data belongs to a test running on another thread.
- Screenshots show the wrong page. Your test failed on the checkout page, but the failure screenshot shows a product listing page. The WebDriver instance was shared, and another thread navigated away.
- Failures cluster around test count thresholds. We noticed failures only happened when the suite had 700+ tests. With fewer tests, the thread contention was too brief to cause visible issues. This is why thread safety bugs slip through for months before they’re caught.
The 3 Things That Must Be Thread-Local
In any parallel test execution framework — TestNG, JUnit 5, or Playwright’s built-in parallelism — three categories of state absolutely cannot be shared across threads.
1. The Browser/WebDriver Instance
This is the most common violation and the one that bit us. If two threads share a WebDriver instance, one thread’s navigate() call affects what the other thread sees.
// BAD — shared static field, all threads use the same driverpublic class DriverFactory { private static WebDriver driver; // Every thread reads and writes this
public static WebDriver getDriver() { if (driver == null) { driver = new ChromeDriver(); } return driver; }}The fix is ThreadLocal, which gives each thread its own isolated instance:
// GOOD — each thread gets its own driver instancepublic class DriverFactory { private static final ThreadLocal<WebDriver> driverThread = new ThreadLocal<>();
public static WebDriver getDriver() { if (driverThread.get() == null) { driverThread.set(new ChromeDriver()); } return driverThread.get(); }
// CRITICAL: clean up after each test to prevent memory leaks public static void quitDriver() { WebDriver driver = driverThread.get(); if (driver != null) { driver.quit(); driverThread.remove(); } }}2. Test Data
If your tests create data during execution — a user account, an order, a temporary file — that data must be scoped to the thread or the test. Two tests creating a user with the same email on different threads will collide.
// GOOD — unique test data per thread using thread IDpublic class TestDataFactory { public static String uniqueEmail() { return "test-" + Thread.currentThread().getId() + "-" + System.currentTimeMillis() + "@example.com"; }
public static String uniqueUsername() { return "user-" + Thread.currentThread().getId() + "-" + System.currentTimeMillis(); }}On the retail platform, our thread safety bug was compounded by test data collision. Two threads creating a cart for “testuser@example.com” meant one thread’s cart got the other thread’s products. Making emails unique per thread eliminated an entire class of phantom failures.
3. Reporting Context
If you’re using Extent Reports or a similar thread-aware reporting library, the test context must be thread-local. We covered this in detail in our squad tagging implementation for Extent Reports, where ThreadLocal<ExtentTest> ensures each thread’s results are attributed correctly.
public class ReportManager { private static final ThreadLocal<ExtentTest> testThread = new ThreadLocal<>();
public static void startTest(String name) { ExtentTest test = extent.createTest(name); testThread.set(test); }
public static ExtentTest getTest() { return testThread.get(); }}Without this, test logs from thread 1 bleed into thread 2’s report entry. The result is a report where the failure logs don’t match the test that actually failed — which sends your team on a debugging detour.
TestNG’s Parallel Modes and What Each Means for Shared State
TestNG offers three parallel modes, and each one changes what’s safe to share:
| Mode | What Runs in Parallel | Safe to Share Across Tests? |
|---|---|---|
methods | Individual test methods | Nothing — each method may run on any thread |
classes | Test classes | Instance fields are safe within a class, not across classes |
tests | <test> blocks from testng.xml | Tests within a block are sequential; across blocks, nothing is safe |
<suite name="Regression" parallel="methods" thread-count="4"> <!-- parallel="methods" is the most aggressive — requires full ThreadLocal discipline --> <test name="AllTests"> <classes> <class name="tests.LoginTests"/> <class name="tests.CheckoutTests"/> <class name="tests.SearchTests"/> </classes> </test></suite>Most teams I work with use parallel="methods" because it gives the best speed improvement. But it’s also the mode that’s least forgiving of shared state. If you’re getting intermittent failures with methods, try switching to classes temporarily. If failures disappear, your problem is shared state between methods — and you need more ThreadLocal.
How to Verify Your Suite Is Actually Thread-Safe
Passing once in parallel doesn’t prove thread safety. Thread safety bugs are probabilistic — they depend on timing, CPU load, and which tests happen to run on the same thread. Here’s how to actually verify:
# Run the suite 10 times in parallel — any failure in any run means thread safety issuefor i in {1..10}; do echo "Run $i of 10" mvn test -Dsurefire.parallel=methods -Dsurefire.threadCount=4 if [ $? -ne 0 ]; then echo "FAILED on run $i — thread safety issue detected" exit 1 fidoneecho "All 10 runs passed — suite is likely thread-safe"On the retail project, our suite passed 1 out of 1 parallel runs consistently. When we ran it 10 times, it failed on runs 3, 7, and 9. That’s a thread safety bug. If your suite passes 10 consecutive parallel runs, you can be reasonably confident it’s thread-safe — though “reasonably” is doing heavy lifting. We run 20 iterations before major releases.
Parallel Test Stability
Before
Random failures at 8+ threads
After
10/10 consecutive runs passing
What I’d Do Differently
If I were setting up parallel execution from scratch, I’d make ThreadLocal the default from day one — not something we retrofit after finding bugs. Every base class, every factory, every shared utility would use ThreadLocal storage. The overhead is negligible. The debugging time saved is enormous.
I’d also add a CI gate that runs the suite 5 times in parallel on every PR. Finding thread safety issues before merge is infinitely cheaper than finding them in the nightly run when nobody remembers what they changed.
Understanding coupling and why loose coupling matters helps here too — tightly coupled test infrastructure (shared drivers, shared data, shared state) is exactly what makes parallel execution fragile.
Your Next Step
Check your WebDriver factory. Is the driver stored in a static field or a ThreadLocal field? If it’s static, add ThreadLocal this week — even if you’re not running tests in parallel yet. When you eventually turn on parallel execution (and you will, because a 45-minute suite demands it), you’ll thank yourself for having the foundation already in place.
If your suite already runs in parallel, run it 10 times in a row. If even one run fails, you have a thread safety bug. The fix is almost always ThreadLocal. The hard part is finding which shared state is leaking — and now you know the three places to look first.
Related Posts
I Migrated 3 Teams Off Cypress — Here's When It's Still the Right Choice
An honest take on Cypress vs Playwright migrations from an SDET who's done three — including when migrating is the wrong call.
XPath text() vs Dot — Why Your Text Match Fails
The real difference between XPath text(), dot, contains(), and normalize-space() for test automation — with examples that explain the flaky failures.
Why Your Playwright Tests Are Flaky — The Async Trap Every SDET Falls Into
The 3 async mistakes that cause flaky Playwright tests after a Selenium migration — and how we fixed a 23% intermittent failure rate.
Get weekly QA automation insights
No fluff, just battle-tested strategies from 10+ years in the trenches.
No spam. Unsubscribe anytime.