foojay – a place for friends of OpenJDK

Eliminating Flaky Tests to End World Hunger

François Martin — Thu, 23 Apr 2026 09:02:34 +0000

Table of Contents

Why Do Flaky Tests Matter?Common Causes of Flaky TestsStrategies to Keep Tests Reliable

1. Awareness of Flaky Tests
2. Fix One Flaky Test Each Sprint
3. Use New Test Data
4. Wait for Conditions to Be Met
5. Run Tests in Parallel
6. Temporarily Quarantine Flaky Tests
7. Split Up End-to-End and Integration Tests

Building a Reliable Test Suite: A Cultural ShiftConclusion

Rita Mae Brown once said:

Insanity is doing the same thing over and over again and expecting different results.

However, everyone has at least once experienced a test that failed and then passed in the next run without changing the code or the environment. In software engineering, we don't call it insanity; we call such unpredictable failures flaky tests. At first, they may seem like minor problems, but like clutter in a junk drawer, they become worse over time if you never take care of them.

What if we lived in a world where we could solve massive global problems like world hunger by eliminating flaky tests in software development? While it may seem exaggerated, it is closer to the truth than you might think, and it highlights the enormous cost and resources that flaky tests drain in development teams worldwide.

Why Do Flaky Tests Matter?

Waste of Time and Money: When a test fails, it can be difficult to tell at first glance if there is a real bug or if the test is just flaky. Developers often have to rerun tests, sift through logs, and add extra debugging code to confirm the issue is genuine. Over time, these continued efforts add up to an enormous cost. Brian Demers and I estimated in our talk "Testing on Thin Ice: Chipping Away at Test Unpredictability" that $36 billion are wasted every year due to flaky tests worldwide—shockingly close to the $40 billion it would take to end world hunger by 2030.
Erosion of Trust: When more test failures are due to flakiness than actual issues in the code, the team loses confidence in the test suite. As a result, teams start to ignore test failures in general, which means that genuine test failures also get ignored. This means that bugs can slip through to production and cause more severe and expensive problems in the future.
Slowed Development Process: Flaky tests can slow down the CI pipeline, especially when developers must rerun it multiple times to get it to pass. In one project, our test suite took about an hour per run, but because of the flaky tests, we had to rerun it at least four or five times per merge request. The worst thing is that when someone else merged new code to the main branch, we had to rebase and try getting the pipeline to pass again. In practice, we would spend most of the day watching the pipeline to get a single change merged.

Like a junk drawer that becomes more daunting with each new item you toss in, allowing flaky tests to accumulate makes it more likely that you'll put off fixing them.

Common Causes of Flaky Tests

To effectively address flaky tests, you have to understand why tests become flaky in the first place. The most common reasons include:

Timing Issues or Hard-Coded Delays:
Using fixed waits, like Thread.sleep(500) instead of waiting for conditions to be met, will lead to unpredictable failures. The test may pass or fail if the system is faster or slower than expected.
UI Race Conditions:
In essence, this is another timing issue. The test may fail if you interact with a UI element in an end-to-end test before it is fully loaded. When you have hard-coded delays or do not wait, you will likely run into this issue on slower runs.
Unreliable or Shared Test Data:
Using the same data across multiple tests can make tests unreliable. For instance, creating a user with a fixed username in one test causes another test to fail when trying to create the same user. Randomly generated data may pass the first time but fail the second time if it violates a validation rule. For example, if a form field restricts the maximum length to 20 characters, the test succeeds if the first randomly generated string is 8 characters long. However, the test will fail if the next run generates a text with 22 characters.
Unstable Environments:
Tests that rely on external services or shared resources can break if the services are slow, unavailable, or unreliable. This is especially problematic in CI environments, where the resource allocation is usually constrained and can differ from one run to the next.
Test Order Dependencies:
If one test relies on the state left behind by another test, parallel or out-of-order execution can lead to failures. Each test should be able to run independently.

Strategies to Keep Tests Reliable

1. Awareness of Flaky Tests

You can't eliminate flaky tests if you don't know they exist. Every time you come across one, ensure you mark it using a consistent comment like:

// Flaky Test
@Test
void testSomething() {
    //...
}

Encourage everyone on the team to do this. The consistent keyword makes it easy to find all flaky tests in the code and gives you a quick overview of how many you have and where they are located.

2. Fix One Flaky Test Each Sprint

Adopt the following rule for your team: For each sprint, fix at least one flaky test before starting a new task or feature.

Think of it as putting away one item from your junk drawer each week. Over time, the drawer (your flaky test backlog) will be empty. This practice is easy to justify to management ("it's just one test per sprint") but makes a considerable difference cumulatively.

Some teams go further by dedicating a day (like "Flaky Test Fridays") to address these issues systematically. This can be especially effective if your project has many flaky tests.

3. Use New Test Data

Make sure each test starts with a clean slate:

Isolate Test Data: To prevent collisions, use unique data for each test. Tools like Testcontainers allow you to spin up disposable containers for databases and other services quickly.
Setup and Teardown: In frameworks like JUnit, use @BeforeEach to set up fresh data and @AfterEach to remove it:

@BeforeEach
void setUpData() {
    // Insert test data or start a container
}

@AfterEach
void cleanUpData() {
    // Remove test data or stop the container
}

4. Wait for Conditions to Be Met

Avoid hard-coded delays by waiting for specific events or conditions. For various popular end-to-end testing frameworks:

Selenium: Use WebDriverWait to wait until an element is present or clickable:

WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));
WebElement element = wait.until(ExpectedConditions.elementToBeClickable(By.cssSelector("[data-testid='submit-button']")));

Cypress: Utilize built-in commands like .should('be.visible') to wait for elements.

Playwright: Use Playwright's auto-retrying assertions, like toHaveText and toBeVisible. They wait for conditions to be satisfied and will fail if not met within a certain time (timeout). For example, when you click a button and expect an element's text to change:

// ❌ Flaky: checks the text immediately; may fail if the text hasn't updated yet
expect(await this.textarea.textContent()).toBe('expected text');

// ✅ Reliable: waits until the element has the expected text
await expect(this.textarea).toHaveText('expected text');

WebdriverIO: Similar to Playwright, built-in assertions automatically wait for conditions to be met within a configurable timeout:

// ❌ Flaky: immediately checks if the button is displayed; fails if it isn't ready
expect(await $('[data-testid="submit-button"]').isDisplayed()).toBe(true);

// ✅ Reliable: waits until the button is displayed
await expect($('[data-testid="submit-button"]')).toBeDisplayed();

This approach makes tests more resilient to variations in system performance and load.

5. Run Tests in Parallel

Parallel execution speeds up testing and uncovers hidden dependencies:

Gradle: Enable parallel execution of test tasks with the --parallel flag.
JUnit 5: Set junit.jupiter.execution.parallel.enabled to true to run tests in parallel.
WebdriverIO and Playwright: They run tests in parallel with different files by default. You can also configure Playwright to parallelize tests within a file.

If tests fail when run in parallel, they may be relying on a shared state.

6. Temporarily Quarantine Flaky Tests

When a flaky test that is not part of your change fails and blocks progress, quarantine it so it doesn't prevent you from merging:

// JUnit
@Disabled("quarantine: reason or link to issue here")
@Test
void flakyTest() {
  //...
}

// Jest
// quarantine: reason or link to issue here
it.skip('should throw an error', () => {
  expect(response).toThrowError(expected_error);
});

Use a consistent prefix (e.g., quarantine:) to easily find and track these tests. However, do not let quarantined tests remain ignored forever—make fixing them a priority.

7. Split Up End-to-End and Integration Tests

End-to-end and integration tests are often slower and more prone to flakiness because they involve multiple components. Evaluate whether each test is at the right level of abstraction:

Integration Tests: Verify how your component behaves when interacting with other components (e.g., between your code and a database) without using the UI.
Unit Tests: Verify the behavior of individual functions or classes in isolation.

Teams often default to adding new integration tests because it seems more straightforward: fewer, broader tests can cover more code. However, this approach leads to larger, slower test suites that are harder to maintain. One practical approach to address this issue is to refactor complex, multi-purpose methods so each method has a specific purpose. This simplifies methods and makes them easier to test with unit tests. For instance, in one project, I took a monolithic integration test suite that ran for five minutes and restructured it so that most logic was covered by unit tests instead. This ended up reducing the total runtime to just 11 seconds.

Building a Reliable Test Suite: A Cultural Shift

Eliminating flaky tests isn't just a technical challenge. It requires a cultural shift within your team:

Educate the Team: Explain the impact and costs of flaky tests. Share examples and encourage everyone to treat flaky tests as a priority.
Track Flaky Tests: Keep a list or dashboard of known flaky tests, their status, and assigned owners. Bring them up in daily standups so they aren't forgotten.
Set Clear Goals: Commit to fixing a specific number or percentage of flaky tests within a set timeframe.
Celebrate progress: Recognize and appreciate those who fix flaky tests. Positive feedback motivates the team to continue improving the test suite.

Conclusion

Flaky tests are not just minor annoyances. They waste valuable time, break your team's trust, and slow development. The good news is you don't have to tackle everything at once. Start with one of these ideas: fix a single flaky test each sprint, ensure your tests use isolated data, wait for specific conditions instead of using fixed delays, run tests in parallel, quarantine problematic tests, or replace extensive end-to-end tests with smaller, more focused tests. Adopting just one is a big step toward making your test suite more reliable.

Think of your test suite like that junk drawer: if you clean it up regularly, it stays helpful and easy to manage. Pick one flaky test that causes frequent trouble and fix it first. That small success will motivate your team and make your testing smoother.

For more details, you can watch my talk "Testing on Thin Ice: Chipping Away at Test Unpredictability" or you can get a free copy of my eBook "Stop Rerunning, Start Shipping: 7 Strategies to Eliminate Flaky Tests".

The post Eliminating Flaky Tests to End World Hunger appeared first on foojay.

Introducing BoxLings! An interactive teacher for BoxLang and TDD/BDD

Luis Majano — Thu, 09 Apr 2026 08:48:57 +0000

Table of Contents

What Is BoxLings?The Full Learning Path

Phase 1 — Core Fundamentals (50 Exercises)
Phase 2 — Intermediate (40 Exercises)
Phase 3 — Advanced (48 Exercises)

The TDD/BDD Learning JourneyHow It WorksBuilt for Learners, Classrooms & WorkshopsGet StartedJoin the Community

We believe the best way to learn a programming language is by writing code — real code, with real feedback, and real tests. That's exactly why we built BoxLings.

Inspired by the beloved rustlings project, BoxLings is an interactive CLI tool that teaches you BoxLang through hands-on exercises. You read failing tests, fix broken code, and level up — one exercise at a time.

Oh, and the whole thing is written in BoxLang itself. 🥊 Dogfooding at its finest.

What Is BoxLings?

BoxLings gives you 129 progressive exercises across 28 topics — from the basics of variables and functions all the way to async programming, Java interop, destructuring, and CLI app development.

But here's what makes BoxLings different: we teach TDD/BDD as a first-class skill, not an afterthought, using TestBox; our BDD/TDD testing library.

From day one, you'll read TestBox specs before touching any implementation code. You'll learn to think in tests. By the time you hit the intermediate exercises, you'll be writing your own. By Phase 3, you'll be doing the full red-green-refactor cycle like a pro.

The Full Learning Path

BoxLings is organized into three progressive phases, with 28 topics and 129 exercises total.

🟢 Phase 1 — Core Fundamentals (50 Exercises)

Perfect for beginners and developers new to BoxLang:

#	Topic	Exercises	What You Learn
1	Introduction	2	Get started with BoxLings and BoxLang basics
2	Variables	6	Dynamic typing, the `var` keyword, scoping basics
3	Functions	6	UDFs, closures, lambdas
4	Conditionals	4	`if/else`, ternary, `switch`
5	Data Types	8	Strings, numbers, booleans, arrays, structs
6	Arrays	4	Array operations and member functions
7	Scopes	5	`variables`, `local`, `this`, `arguments` scopes
8	Structs	5	Struct manipulation and operations
9	Strings	6	Interpolation, multi-line strings, string operations
10	Imports	4	Importing classes and the `java:` prefix

🟡 Phase 2 — Intermediate (40 Exercises)

Dive deeper into BoxLang's power features:

#	Topic	Exercises	What You Learn
11	Structs Advanced	4	Deep operations, merging, complex manipulation
12	Null Handling	4	Elvis operator, safe navigation
13	Error Handling	6	`try/catch`, `throw`, custom exceptions
14	Interfaces	4	Implementing Java interfaces from BoxLang
15	Testing	5	Write your own TestBox specs!
16	Functional	8	`map`, `filter`, `reduce`, lambdas
17	Async	6	Threads, futures, async programming
18	Components	3	`bx:http`, `bx:query`, and more

🔴 Phase 3 — Advanced (48 Exercises)

Master BoxLang-specific and power-user features:

#	Topic	Exercises	What You Learn
19	Casting	5	`castAs`, `javaCast`, type conversions
20	Quizzes	3	Comprehensive knowledge reviews
21	Classes	8	OOP, properties, metadata
22	BIFs	6	Built-in functions and member functions
23	Templating	4	`.bxm` files and template syntax
24	CLI Apps	4	Building real CLI tools with BoxLang
25	Java Interop	6	Calling Java, the `java:` prefix in depth
26	Destructuring	4	Struct and array destructuring, renaming, nesting
27	Spread	4	Spread operator for arrays, structs, and function calls
28	Range	2	The `..` range operator and functional methods
29	Assert	2	The `assert` statement with custom messages

The TDD/BDD Learning Journey

BoxLings teaches test-driven development alongside BoxLang in four progressive stages:

Step 1 — Reading Tests (Topics 1–10)
Read TestBox specs to understand requirements. Tests are your documentation.

Step 2 — Understanding Patterns (Topics 11–14)
Multiple assertions, setup/teardown with beforeEach/afterEach, edge cases, and error scenarios.

Step 3 — Writing Tests (Topic 15)
Now you write the specs. Practice describe / it / expect from scratch.

Step 4 — Full TDD Cycle (Topics 16–29)
Red → Green → Refactor. The real deal.

How It Works

git clone https://github.com/ortus-boxlang/boxlings.git
cd boxlings
boxlang BoxLings.bx

BoxLings drops you into watch mode — it monitors your exercise files and reruns them automatically every time you save. Fix the code, hit save, see the tests go green.

Keyboard shortcuts in watch mode:

Key	Action
`n`	Next exercise
`h`	Show hint
`t`	Show test file
`l`	List all exercises
`r`	Rerun current exercise
`q`	Quit

Three exercise types are supported: scripts (.bxs), classes (.bx), and templates (.bxm), covering the full breadth of how BoxLang is used in practice.

Built for Learners, Classrooms & Workshops

BoxLings is self-contained and runs completely offline after the initial clone. Whether you're learning solo, teaching a workshop, or onboarding a new team member, BoxLings provides a structured, guided path with immediate feedback.

Estimated completion time:

🆕 Beginners: ~15–20 hours
💻 Experienced developers new to BoxLang: ~6–10 hours
🔥 Java developers: ~4–6 hours

Get Started

You'll need BoxLang 1.12+. We recommend BVM to manage your BoxLang versions:

curl -fsSL https://install-bvm.boxlang.io/ | bash
bvm install 1.12.0
bvm use 1.12.0

Then clone and go:

git clone https://github.com/ortus-boxlang/boxlings.git
cd boxlings
boxlang BoxLings.bx

Join the Community

We'd love to hear what you think — and contributions are very welcome. New exercises, bug fixes, documentation — all of it.

👉 github.com/ortus-boxlang/boxlings

Now go fix some broken code. 🥊

The post Introducing BoxLings! An interactive teacher for BoxLang and TDD/BDD appeared first on foojay.

TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM

Cristobal Escobar — Tue, 24 Mar 2026 16:58:53 +0000

Table of Contents

Keyboard Shortcuts
Streaming Test Execution via SSE

Dry Run & Spec Discovery

BoxLang CLI Runner — New Power Options
Other Notable Improvements
TestBox CLI Updates (v1.8.0)
Upgrade Now

TestBox 7.x focuses on improving testing workflows for BoxLang and CFML applications. This release introduces improvements to the BoxLang CLI runner, real-time streaming test execution via SSE, dry run capabilities, a browser-based TestBox RUN interface, and several developer experience enhancements.

Check out the what's new here: https://testbox.ortusbooks.com/readme/release-history/whats-new-with-7.0.0

TestBox RUN: A Browser IDE for Your Tests

The centerpiece of TestBox 7 is TestBox RUN: a self-hosted, single-page web app (bx/tests/index.bxm) that you drop into any BoxLang project and open in a browser. No build toolchain. No external service. Just BoxLang.

It communicates with your existing runner.bxm or runner.cfm endpoints and streams spec results in real time via Server-Sent Events. Results appear in the test tree as each spec finishes, green for passing, red for failures, with full error messages; long before the full suite completes.

What You GetWhat You Get

Real-time streaming test tree — live updates per spec, not per suite
Dark / Light theme with localStorage persistence

Live search + status filters — filter by bundle, suite, or spec name; chips for Passed / Failed / Errored / Skipped

Per-bundle Run button — re-run a single bundle without touching the rest

Debug Buffer Panel — captured TestBox debug output surfaced per-bundle

Floating progress widget — current bundle, specs completed vs. total, animated progress bar

Configurable settings — runner URL, directory, bundle pattern, labels, excludes — all saved in localStorage

Every setting is also overridable via URL query params, making CI integration clean:

/tests/?directory=tests.specs.integration&labels=slow&runnerUrl=/tests/runner.bxm

Keyboard Shortcuts

Shortcut	Action
⌘/Ctrl + K	Focus search bar
⌘/Ctrl + Enter	Run all tests
⌘/Ctrl + .	Reload / rediscover tests
⌘/Ctrl + ,	Open Settings
⌘/Ctrl + B	Toggle expand/collapse all bundles
⌘/Ctrl + D	Toggle dark/light mode

Getting Started

TestBox RUN ships automatically with every TestBox 7 install under bx/tests/. ColdBox apps generated via the ColdBox CLI include it out of the box. For new projects:

testbox generate harness --help

Note: TestBox RUN requires a running web server and a runner.bxm endpoint with SSE support via BoxLang. For pure CLI apps, use the BoxLang runner with --stream (see below).

Coming Soon: TestBox RUN Desktop App

We're actively building a native desktop app version of TestBox RUN on the BoxLang Desktop Runtime — connect to any local or remote runner URL and get the same streaming UI without a browser. Watch testbox.run for early access.

Streaming Test Execution via SSE

TestBox 7 ships a brand-new StreamingRunner that pushes each spec result to the client the moment it completes, rather than buffering the entire suite.

StreamingRunner (Programmatic)StreamingRunner (Programmatic)

component {
    function streamTests( event, rc, prc ) {
        event.setHTTPHeader( name="Content-Type", value="text/event-stream" );
        event.setHTTPHeader( name="Cache-Control", value="no-cache" );

        new testbox.system.runners.StreamingRunner(
            bundles  = "tests.specs",
            options  = {},
            reporter = "text"
        ).run();
    }
}

BoxLang CLI `--stream` Flag

The BoxLang CLI runner gets native streaming support:

./testbox/run --stream
./testbox/run --directory=tests.specs --stream

This is especially useful in CI pipelines where live progress matters more than waiting for a buffered final report.

Dry Run & Spec Discovery

Two long-requested features land in TestBox 7: spec discovery and dry run mode. Audit exactly what would run before committing to a full suite execution.

Runner Dry Run
If you call the runner.bxm|cfm with a ?dryRun=true it will return back to you a JSON representation of what the test executions would look like.

Programmatic Dry Run

var tb      = new testbox.system.TestBox( bundles = "tests.specs" );
var results = tb.dryRun();

CLI Dry Run

./testbox/run --dry-run

Lists every suite and spec that would execute, with labels and skip reasons — perfect for coverage audits and CI test inventory reporting.

JSON Output

Need to feed results into another tool?

./testbox/run --dry-run=json
./testbox/run --dry-run=json --bundles=tests.specs.MySpec | jq .

Dry run respects all the same filters as a normal run: --labels, --bundles, --directory, --testSuites, --testSpecs.

BoxLang CLI Runner — New Power Options

The BoxLang runner gets a substantial set of new flags for fine-grained control over output, failures, and performance analysis.

Focus on Failures

./testbox/run --show-failed-only

Stack Trace Control

./testbox/run --stacktrace=short   # condensed (default)
./testbox/run --stacktrace=full    # complete Java/BoxLang trace

Output & Performance Flags

# Suppress passing or skipped specs
./testbox/run --show-passed=false
./testbox/run --show-skipped=false

# Abort after N failures
./testbox/run --max-failures=10

# Flag slow specs
./testbox/run --slow-threshold-ms=500

# Report the N slowest specs at the end
./testbox/run --top-slowest=5

Combine them for a tight CI workflow:

./testbox/run --show-failed-only --stacktrace=short --max-failures=5 --top-slowest=3

Application Mappings Auto-Load (TESTBOX-440)

The BoxLang runner now automatically loads Application.bx mappings from your project root before running tests. Custom path mappings, datasources, and settings are available to your specs with zero extra configuration — bringing the CLI experience much closer to a full web server environment.

Other Notable Improvements

`ConsoleReporter` — Hide Skipped Tests (TESTBOX-433)

Stop noisy skipped-spec output when you have many pending specs:

var testbox = new testbox.system.TestBox(
    bundles  = "tests.specs",
    reporter = {
        type    : "testbox.system.reports.ConsoleReporter",
        options : { hideSkipped : true }
    }
);

Or from the CLI: --show-skipped=false

Suite Filtering Fixes (TESTBOX-435)

Direct suite name matching is now reliable at any nesting depth. If a suite's name exactly matches testSuites, it always runs — no more surprises with nested suites getting skipped.

./testbox/run --testSuites="My Integration Suite"

TestBox CLI Updates (v1.8.0)

The testbox-cli CommandBox module hits 1.8.0 with two new commands:

# Show installed version, path, and project config
testbox info

# Force a clean reinstall of the CLI module
testbox reinstall

Streaming is also available via the CLI:

testbox run --streaming
testbox run --streaming --verbose   # include passing specs in live output

Engine Support

Engine	Status
BoxLang 1.x+	✅ PREFERRED
Lucee 7.x	✅ NEW
Lucee 6.x	✅
Lucee 5.x	⚠️ DEPRECATED
Adobe 2025	✅
Adobe 2023	⚠️ DEPRECATED
Adobe 2021	❌ Dropped

Adobe 2021 is no longer supported. Upgrade to Adobe 2023+ or migrate to BoxLang.

Upgrade Now

TestBox 7 is available today via CommandBox:

box install testbox

Or pin to 7.x:

box install testbox@^7.0.0

Full release notes and issue links are in the TestBox documentation. As always, file bugs and feature requests in our JIRA. You can also check out the what's new guide here: https://testbox.ortusbooks.com/readme/release-history/whats-new-with-7.0.0

The post TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM appeared first on foojay.

How to Customize JaCoCo Report Styling in Your Java Project

Bruno Borges — Fri, 13 Feb 2026 13:10:45 +0000

Table of Contents

The ProblemThe Strategy: CSS Overlay

Step 1: Create Your Custom report.css
Step 2: Overlay CSS During Maven Build
Step 3: Handle CI Deployment (Optional)

Watch Out: Output Directory PathsThe ResultQuick Start ChecklistFull Example

JaCoCo is the go-to code coverage tool for Java projects. It integrates seamlessly with Maven, generates detailed HTML reports, and works out of the box. But let's be honest — the default reports look like they were designed in 2008, because they were.

If you're publishing coverage reports as part of your project documentation (on GitHub Pages, for example), you probably want them to match your site's look and feel. The good news: it's entirely possible. The bad news: JaCoCo offers zero built-in support for CSS customization.

In this post, I'll walk you through how we themed the JaCoCo reports in the Copilot SDK for Java project to match our Maven site design — and how you can do the same.

The Problem

JaCoCo generates standalone HTML reports that reference their own jacoco-resources/report.css via a `` tag. There's no plugin configuration, no skin system, no hook to inject your own CSS. The generated HTML looks like this:

That report.css controls everything: table styling, source code highlighting, coverage bar colors, breadcrumbs, typography. If you want to change the look, you need to replace that file after JaCoCo generates it.

Here is how the default CSS looks like:

The Strategy: CSS Overlay

The approach is simple: let JaCoCo generate its reports with the default CSS, then overwrite report.css with your custom version. We do this at two levels:

During mvn site — a Maven Resources Plugin execution copies our CSS after JaCoCo finishes
During CI deployment — a workflow step overlays the CSS across all versioned documentation directories

Step 1: Create Your Custom `report.css`

Start by grabbing JaCoCo's default report.css. You can find it in target/site/jacoco/jacoco-resources/report.css after running mvn site with coverage data, or in the JaCoCo source code.

Place your customized version at:

src/site/jacoco-resources/report.css

Here's what ours looks like — a light theme with rounded cards, GitHub-style colors, and a system font stack:

/* ===== Custom JaCoCo Report Theme ===== */

body, td {
  font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
  font-size: 10pt;
  color: #24292f;
}

body {
  background: #f6f8fa;
  margin: 0;
  padding: 20px;
}

/* Breadcrumb navigation */
.breadcrumb {
  background: #fff;
  border: 1px solid #d0d7de;
  border-radius: 10px;
  padding: 10px 16px;
  margin-bottom: 20px;
}

/* Coverage table */
table.coverage {
  border-collapse: separate;
  border-spacing: 0;
  border: 1px solid #d0d7de;
  border-radius: 10px;
  overflow: hidden;
  width: 100%;
  background: #fff;
}

table.coverage thead {
  background: #f6f8fa;
}

table.coverage thead td {
  padding: 10px 14px;
  border-bottom: 2px solid #d0d7de;
  font-weight: 700;
}

table.coverage tbody td {
  padding: 8px 10px;
  border-bottom: 1px solid #eaeef2;
}

table.coverage tbody tr:hover {
  background: rgba(102, 126, 234, 0.04) !important;
}

/* Source code coverage highlights */
pre.source span.fc  { background-color: #dafbe1; }  /* fully covered - green */
pre.source span.nc  { background-color: #ffeef0; }  /* not covered - red */
pre.source span.pc  { background-color: #fff8c5; }  /* partially covered - yellow */

You can see the full file in our repo: src/site/jacoco-resources/report.css.

Important: You must include the element icon classes (.el_package, .el_class, .el_method, etc.) and the sortable header classes (.sortable, .up, .down) in your custom CSS — they reference GIF images that JaCoCo generates alongside the report. Omitting them breaks navigation icons and column sorting.

Step 2: Overlay CSS During Maven Build

Add a copy-resources execution to your maven-resources-plugin configuration in pom.xml. The key is using phase: site so it runs after the JaCoCo reporting plugin generates its default CSS:


    org.apache.maven.plugins
    maven-resources-plugin
    
        
            overlay-jacoco-css
            site
            
                copy-resources
            
            
                
                    ${project.reporting.outputDirectory}/jacoco/jacoco-resources
                
                
                    
                        src/site/jacoco-resources
                        false
                    
                
                true

See this in context: pom.xml (lines 371–387).

Now when you run mvn site, the custom CSS replaces JaCoCo's default immediately after report generation.

Step 3: Handle CI Deployment (Optional)

If you deploy versioned documentation (like we do — snapshot/, latest/, 1.0.8/, etc.), you need to overlay the CSS for all version directories, not just the one Maven just built.

In our GitHub Actions deploy workflow, we add a step after all version builds complete:

- name: Overlay custom JaCoCo CSS
  run: |
    cd site
    for dir in */jacoco/jacoco-resources; do
      if [ -d "$dir" ]; then
        cp ../src/site/jacoco-resources/report.css "$dir/report.css"
        echo "Overlaid JaCoCo CSS in $dir"
      fi
    done

See the full workflow: deploy-site.yml.

Watch Out: Output Directory Paths

This is the gotcha that caught us. JaCoCo has two plugin configurations in a typical Maven project, and they output to different directories:

Context	Plugin section	Default output path
mvn verify	(goal: report, phase: verify)	Configurable (we use jacoco-coverage/)
mvn site	(goal: report)	jacoco/ (default)

When deploying documentation, you typically run mvn site, not mvn verify. So the reports land in jacoco/, not wherever your build plugin is configured to write. Your CSS overlay must target the same path the reporting plugin uses. If they're mismatched, you'll end up with custom CSS in an empty directory and default CSS on your actual reports.

The Result

Same data, completely different experience. The custom theme has clean white cards, rounded borders, GitHub-style coverage colors, modern typography.

You can see the live result on our documentation site: Copilot SDK for Java — JaCoCo Report.

Quick Start Checklist

☐ Run mvn clean verify site to generate a JaCoCo report with default CSS
☐ Copy target/site/jacoco/jacoco-resources/report.css to src/site/jacoco-resources/report.css
☐ Customize the CSS to match your site's design
☐ Keep all .el_* icon classes and .sortable/.up/.down sort classes — they reference JaCoCo's GIF assets
☐ Add the overlay-jacoco-css execution to maven-resources-plugin in your pom.xml
☐ If deploying to GitHub Pages with versioned docs, add the workflow overlay step
☐ Run mvn clean verify site again and open target/site/jacoco/index.html to verify

Full Example

The complete implementation is in the Copilot SDK for Java repository:

Custom report.css — the full themed stylesheet
pom.xml overlay config — the Maven copy-resources execution
Deploy workflow — CI overlay step for versioned docs
Live report — the themed output on GitHub Pages

Built with ❤️ by Bruno Borges and GitHub Copilot.

The post How to Customize JaCoCo Report Styling in Your Java Project appeared first on foojay.

Testing Emails with Testcontainers and Mailpit

Simon Martinelli — Thu, 29 Jan 2026 08:38:10 +0000

Table of Contents

What is Mailpit?Why Testcontainers fits perfectlyThe Mailpit Testcontainer module

Maven dependency

Using Spring Boot with @ServiceConnectionUsing Mailpit without Spring BootFluent AssertJ assertionsWaiting for asynchronous emailsWhy this approach works wellConclusion

Testing email functionality is often painful. SMTP servers are external, tests become slow or flaky, and local setups differ from CI environments. As a result, many teams either mock the mail sender or skip proper email tests completely.

Both approaches are unsatisfying. Mocking does not test real behavior, and shared SMTP servers introduce hidden dependencies. What we really want is a real SMTP server that runs locally and in CI, is fully isolated per test run, and allows us to inspect sent emails easily.

This is exactly what Testcontainers and Mailpit provide.

What is Mailpit?

Mailpit is a small and fast SMTP testing server with a modern web UI. Instead of delivering emails, it captures them and exposes everything through an HTTP API and a browser-based inbox. Applications can send emails via SMTP as usual, while tests can inspect the captured messages programmatically or visually in the UI.

This makes Mailpit ideal for automated tests and local development.

Why Testcontainers fits perfectly

Testcontainers allows you to start Docker containers directly from your tests. Containers are created on demand, work the same locally and in CI, and are automatically cleaned up afterwards. There is no manual setup and no shared infrastructure.

Since Mailpit already provides an official Docker image, combining it with Testcontainers is a natural fit.

The Mailpit Testcontainer module

To make this integration easy, I created a dedicated Testcontainers module for Mailpit: https://github.com/martinellich/testcontainers-mailpit

It provides a ready-to-use MailpitContainer, a Java client for the Mailpit API, and convenient test assertions.

Maven dependency

Add the dependency to your test scope:


  ch.martinelli.oss
  testcontainers-mailpit
  1.2.0
  test

Using Spring Boot with @ServiceConnection

If you use Spring Boot 3.1 or newer, the cleanest solution is @ServiceConnection. Spring Boot will automatically wire the SMTP connection and also provide a MailpitClient bean.

You only need a small test configuration:

@TestConfiguration(proxyBeanMethods = false)
class TestcontainersConfiguration {

  @Bean
  @ServiceConnection
  MailpitContainer mailpitContainer() {
    return new MailpitContainer();
  }
}

In your test, you can now use JavaMailSender as usual, and verify emails via MailpitClient:

@SpringBootTest
@Import(TestcontainersConfiguration.class)
class EmailServiceTest {

  @Autowired
  JavaMailSender mailSender;

  @Autowired
  MailpitClient client;

  @Test
  void shouldSendAndVerifyEmail() {
    var msg = new SimpleMailMessage();
    msg.setFrom("noreply@myapp.com");
    msg.setTo("user@example.com");
    msg.setSubject("Welcome");
    msg.setText("Hello!");

    mailSender.send(msg);

    var messages = client.getAllMessages();
    assertThat(messages).hasSize(1);
    assertThat(messages.get(0).subject()).isEqualTo("Welcome");
  }
}

No mail properties are required. Spring Boot derives everything from the running container.

Using Mailpit without Spring Boot

The Mailpit container can also be used in plain JUnit tests. In this case, you configure the SMTP host and port manually and then verify messages via the container’s client.

@Testcontainers
class PlainEmailTest {

  @Container
  static MailpitContainer mailpit = new MailpitContainer();

  @Test
  void shouldSendEmail() throws Exception {
    Properties props = new Properties();
    props.put("mail.smtp.host", mailpit.getSmtpHost());
    props.put("mail.smtp.port", String.valueOf(mailpit.getSmtpPort()));

    Session session = Session.getInstance(props);

    MimeMessage message = new MimeMessage(session);
    message.setFrom(new InternetAddress("sender@example.com"));
    message.setRecipient(RecipientType.TO, new InternetAddress("recipient@example.com"));
    message.setSubject("Test Subject");
    message.setText("Hello, this is a test email!");

    Transport.send(message);

    var messages = mailpit.getClient().getAllMessages();
    assertThat(messages).hasSize(1);
    assertThat(messages.get(0).subject()).isEqualTo("Test Subject");
  }
}

This approach works well if you are not using Spring Boot or want full control over the mail setup.

Fluent AssertJ assertions

Recent versions of the library include AssertJ-style assertions that make tests much more readable. Instead of manually fetching messages, you can express expectations directly.

import static ch.martinelli.oss.testcontainers.mailpit.assertions.MailpitAssertions.assertThat;

@Test
void shouldVerifyEmailSent() {
  // send email...

  assertThat(mailpit)
      .hasMessages()
      .hasMessageCount(1)
      .hasMessageWithSubject("Welcome")
      .hasMessageTo("user@example.com")
      .hasMessageFrom("noreply@myapp.com");
}

You can also assert details of a specific message:

@Test
void shouldVerifyMessageDetails() {
  // send email...

  assertThat(mailpit)
      .firstMessage()
      .hasSubject("Order Confirmation")
      .isFrom("orders@shop.com")
      .hasRecipient("customer@example.com")
      .hasNoAttachments()
      .hasSnippetContaining("Thank you");
}

Waiting for asynchronous emails

Many applications send emails asynchronously. For these cases, the assertions support waiting with timeouts and polling.

@Test
void shouldWaitForAsyncEmail() {
  // trigger async email sending...

  assertThat(mailpit)
      .withTimeout(Duration.ofSeconds(30))
      .withPollInterval(Duration.ofSeconds(1))
      .awaitMessage()
      .withSubject("Password Reset")
      .to("user@example.com")
      .isPresent();
}

This removes the need for manual Thread.sleep calls and makes async tests reliable.

Why this approach works well

With Mailpit and Testcontainers, you test the full email flow end-to-end. There are no mocks, no shared servers, and no environment-specific configuration. The same setup works locally and in CI, and debugging is easy thanks to the Mailpit web UI.

Most importantly, you test what you actually ship.

Conclusion

Email testing does not need to be complex. A small Testcontainer and a lightweight SMTP server are enough to get reliable, readable, and maintainable tests. Mailpit fits naturally into modern Spring Boot and JUnit setups and removes a common source of fragile tests.

Give it a try. Keep IT simple.

This article was originally published on https://martinelli.ch/testing-emails-with-testcontainers-and-mailpit/

The post Testing Emails with Testcontainers and Mailpit appeared first on foojay.

Flaky Tests: a journey to beat them all

Loic Mathieu — Tue, 13 Jan 2026 13:00:07 +0000

Table of Contents

What’s a flaky test?First try: retry them all!Second try: fix them all!Third try: embrace the inevitability!Conclusion

“Sleep is not a synchronization primitive.”

Every test engineer, eventually

What’s a flaky test?

A flaky test is a test that sometimes passes and sometimes fails without any code changes. They’re the by‑product of non‑determinism: timing, concurrency, eventual consistency, network hiccups, clock drift, resource contention, and (our favorite) tests leaking state across runs.

Kestra is an open-source declarative orchestration platform designed to run, coordinate, and monitor large-scale, event-driven workflows. It is built to handle parallelism, asynchronous execution, and distributed systems at scale, exactly the kind of environment where determinism is hard and flaky tests tend to emerge.

At Kestra, we run 6,000+ tests across our repositories. We add dozens every day. If only 1% of those are flaky at 10% failure probability, you’ve got ~50 flaky tests. Expectation math says ~5 failures per CI run, good luck spotting real regressions under that noise.

As an orchestration platform, many of our tests execute parallel, asynchronous workflows. Async is powerful and naturally tricky to test: ordering isn’t guaranteed, and “eventually consistent” is not a helpful assertion.

One of our top issues is due to our queuing system; a test may receive a message from another test or miss a message from the queue. We strive to properly close the queue and handle all messages to ensure they are not leaked across tests, but it’s challenging to guarantee this.

Last year, CI was red often enough that we decided to go on a proper flake‑hunting journey.

First try: retry them all!

Our first try to bite them all was to retry the flaky tests.

Kestra is built in Java, and tests are written with the JUnit framework. The JUnit Pioneer extension contains an annotation that allows for retrying a test if it fails: @RetryingTest(5). We added this annotation to every test that often fails in our CI.

This helped… a bit. But it also inflated test times and masked real issues. Worse, some failures are structural (leaked resources, race conditions): once they fail, they keep failing, no matter how often you retry.
Verdict: good band‑aid, bad cure.

Second try: fix them all!

We then decided to put effort into fixing the failing test! We remove all the usage of the @RetryingTest(5) annotation and either fix the test or disable it.

As most of the flaky tests are tests that launch a workflow and assert on its execution. We improve our testing framework in this area to be sure that every test properly closes its resources and every workflow and execution created by a test will be deleted.

For that, we create a JUnit extension to manage test resource creation:

A @KestraTest annotation handles starting and closing the Kestra runner in the scope of a test class.
A @LoadFlows annotation handles loading and then removing flows in the scope of a test method.
A @ExecuteFlow annotation handles starting and then removing a flow execution in the scope of a test method.

Using this test framework everywhere gives us more control over resource allocation and deallocation, and allows us to clean any flow or execution created by a test to avoid possible test pollution with unrelated resources.

But after weeks of effort, we had to disable too many tests, and even if the number of flaky tests decreased, some were still failing, even rarely, but with the high number of tests we have, this would still make our CI suffer.

Third try: embrace the inevitability!

So tests will fail; we had to accept that, some pretty often, some rarely, but tests will fail.
We have to be pragmatic and embrace the inevitability of tests being flaky.

We decided to flag flaky tests and allow them to fail in the CI! This was not an easy decision as nobody wants to concede failure and accept it. But if we want to have a reliable CI without compromising test coverage and exploding testing implementation time, we have to avoid disabling tests and accept that some would fail pretty often.

To flag a flaky test, we annotate it with @FlakyTest which is a custom marker annotation that encapsulates Junit @Tag("flaky") annotation.
JUnit tags are very accurate for such use cases, they allow you to target a group of tagged tests when running your tests.

Our CI now launches tests in two steps:
First, tests non-tagged as flaky: those must pass for the CI run to be green
Then, tests tagged as flaky: those can fail

We also improve our CI to report differently standard tests and flaky tests, with a test summary in PR comment that directly contains the list of failing tests with their stack traces. This allows us to better pinpoint any test issues.

Of course, flagging a test as flaky is an easy thing to do, so we take care of first by trying to fix the test and only tag it as flaky as a last resort.
We have test observability in place to track flaky tests, so if they increase a lot, we would know.

Conclusion

You won’t beat every flaky test. That’s fine. The goal is to get reliable signals back into CI so you can confidently merge and ship. Separate what must be green from what’s allowed to wobble, invest in deterministic test lifecycles, and keep an eye on the flaky set so it doesn’t quietly grow.

Flakes are inevitable. Letting flakes dictate your delivery is optional.

Want to try Kestra? You can get started in 5 minutes following the quickstart guide.

The post Flaky Tests: a journey to beat them all appeared first on foojay.

JC-AI Newsletter #7

Miro Wengner — Tue, 14 Oct 2025 05:35:01 +0000

Fourteen days have passed, and it is time to present a fresh collection of readings that could influence developments in the field of artificial intelligence.

Beyond focused tutorials that can enhance your understanding of AI applications, this newsletter concentrates on Hallucination, Java Code Generation, Testing, Agentic System Architecture and LLM benchmarking methodologies designed to ensure models accuracy and competency in handling complex contextual information.

The world influenced by LLM is changing very quickly, let's start...

article: The Missing Layer in AI Infrastructure: Aggregating Agentic Traffic
authors: Eyal Solomon
date: 2025-08-22
desc.: Agentic AI systems introduce new challenges across the entire system architecture. Beyond what research articles address, several critical issues remain unresolved and may pose serious risks, particularly in system architecture where the adoption of LLMs has triggered a major paradigm shift in system design. A key concern involves outbound API calls made by autonomously acting AI agents (e.g., chaining tools, calling external services). Current infrastructure, including API gateways and service meshes, is primarily designed around inbound traffic or service-to-service communication, rather than managing agent-initiated outbound calls. This creates a significant blind spot in our architectural oversight.
category: architecture

article: Learning to Reason for Hallucination Span Detection
authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh and others
date: 2025-10-02
desc.: This paper addresses the challenges of advancing from simple binary classification (Chain-of-Thought, COT)) to fine-grained span-level hallucination detection. In-domain reasoning is essential for robust hallucination detection. The normalization step in Group Relative Policy Optimization proves crucial, as simple reward rescaling policies cannot effectively mitigate reward hacking in the dataset employed. The paper proposes a reinforcement learning framework with span-level rewards to align large language model (LLM) reasoning with hallucination detection tasks on the RAGTruth benchmark. Paper research has been done during an internship at Apple.
category: research

article: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
authors: Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang and others
date: 2025-10-02
desc.: End-to-end speech recognition and fluent answering without noticeable pauses present significant challenges for utilizing LLMs in dialogue-based agentic systems. These systems are prone to hallucination effects caused by various factors. While improving input/output accuracy through Retrieval Augmented Generation (RAG) approaches can mitigate hallucination effects and significantly increase accuracy, this comes with various penalties such as increased resource requirements and latencies. The paper proposes a Model-Triggered Stream RAG approach as an alternative to fixed-interval RAG streaming or without RAG. Although the paper does not provide a complete solution to these challenges, it proposes a benchmarking strategy for future research and highlights key achievements. This research was conducted in cooperation with Meta.
category: research

article: Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment
authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang
date: 2025-10-03
desc.: While prompting strategies show effectiveness in certain tasks, they lack robustness across different benchmarks and model architectures, performing better on larger LLMs and simpler reasoning problems. This paper proposes a Self-Anchor mechanism for structured reasoning with automatic anchoring. Self-Anchor delivers consistent improvements across tasks, model sizes, and architectures, demonstrating strong robustness and effectiveness. The approach leverages inherent structure in reasoning chains to improve attention alignment and enhance reasoning capabilities. However, Self-Anchor primarily addresses attention misalignment without fully resolving deeper issues related to logical validity, semantic understanding, or computational precision.
category: research

article: Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
authors: José Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng and others
date: 2025-10-03
desc.: Agentic Automated Program Repair (APR) is increasingly addressing complex, repository-level bugs in industry settings. However, agent-generated patches still require human review before deployment to ensure they properly resolve the underlying issues. This paper introduces two complementary LLM-based policies for patch assessment. The paper addresses and discusses limitations in automated patch procedures, human supervision requirements, and company-specific bug-fixing approaches. This paper results from cooperation between Google and Meta.
category: research

article: Cache-to-Cache: Direct Semantic Communication Between Large Language Models
authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
date: 2025-10-03
desc.: Rather than relying on text-to-text communication between LLM-based systems, which incurs latency penalties, this paper proposes the Cache-To-Cache paradigm for direct inter-system communication. Experimental results demonstrate improved efficiency and performance without requiring additional cache capacity.
category: research

article: FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
authors: Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring, Silvio Soares Ribeiro Junior
date: 2025-10-06
desc.: This paper introduces the FreshBrew approach for Java code migration tasks utilizing agentic LLM systems. The migration experiments from JDK8 to JDK17 and JDK21 demonstrate the limitations of current LLM implementations, even when integrated with modern deterministic migration tools such as OpenRewrite. Although the overall migration success rate was approximately 50%, the paper provides a comprehensive discussion of the associated limitations and challenges. The article has been done in cooperation with Max-Planck Institute, Google and Saleforce companies.
category: research

article: Investigating The Smells of LLM Generated Code
authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley
date: 2025-10-03
desc.: The paper proposes a scenario-based method for evaluating the quality of LLM-generated code, as such models are increasingly utilized for program code generation. The study experiments with Java programs using Gemini Pro, ChatGPT, Codex, and Falcon LLMs to obtain results. The paper highlights that for moderately advanced topics, particularly those involving object-oriented programming concepts, the generated code quality is noticeably poorer to human-written code.
category: research

article: Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?
authors: Lucas Roberts, Denisa Roberts
date: 2025-09-30
desc.: This paper examines the comparative abilities of Large Language Models (LLMs) and human annotators in identifying and annotating specific elements within source code. The study investigates several widely-used programming languages, including C, Java, JavaScript, Go, and Python. The experimental results reveal various limitations and challenges associated with automated code annotation, while proposing possibilities for future research and emphasizing the critical role of human expertise in the annotation process.
category:

article: AI Coding Tools Blog Post - Model Context Protocol Mastery - Claude, Cursor
authors: Mani Sarkar
date: 2025-10-07
desc.: The post describes and provides guidance on configuring MCP for AI-assisted development using Claude and Cursor. Please be aware of the 'No Warranty' statement and use this as an example only.
category: tutorial

article: DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern
authors: Lekang Yang, Yuetong Liu, Yitong Zhang, Jia Li
date: 2025-09-29
desc.: Writing high-quality unit tests is often a time-consuming effort that requires extensive knowledge of the business domain. This paper proposes DiffTester, an acceleration framework designed to overcome the limitations imposed by single-token generation constraints. The DiffTester framework identifies common patterns through syntax tree analysis. Experimental results demonstrate that the DiffTester framework can used to generate a larger number of tokens, thereby achieving better accuracy.
category: research

article: Deloitte caught out using AI in $440,000 report | 7.30
authors: ABC News In-depth
date: 2025-10-09
desc.: Hallucination remains a significant challenge in current large language models (LLMs). These inaccuracies can cause damage at various levels and require careful eye to identify.
category: youtube

article: SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents
authors: Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen and others
date: 2025-10-13
desc.: In the age of AI, deception through manipulation or hallucination-based choices poses a serious threat to human or system reliability. This paper proposes SysBench, a benchmark for evaluating the susceptibility of computer-user agents (CUAs) and humans to dark patterns that may mislead both users and agents into making harmful decisions. The paper demonstrates that neither humans nor agentic AI systems based on large language models (LLMs) exhibit adequate resistance against dark patterns.
category: research

Previous:
Newsletter vol.1
Newsletter vol.2
Newsletter vol.3
Newsletter vol.4
Newsletter vol.5
Newsletter vol.6

The post JC-AI Newsletter #7 appeared first on foojay.

AI Test Generation: A Dev’s Guide Without Shooting Yourself in the Foot

Jonathan Vila — Mon, 26 May 2025 08:06:15 +0000

Table of Contents

So, AI Can Write Tests Now? Cool, But...How AI Learns to Code (And Why That's a Problem for Tests)Problem #1: AI Tests Might Just Be WrongProblem #2: Testing the Code You Have, Not the Code You Need (Verification vs. Validation Trap)So? What to Do? Don't Use AI for Generating Tests? Nah, Rely on Your AI Test Quality GuardianHow to Use AI Test Tools Without Getting BurnedWrapping Up

So, AI Can Write Tests Now? Cool, But...

AI assisted coding tools are everywhere now, helping with autocomplete, suggesting fixes, and sometimes writing surprisingly large blocks of code. A hot topic is using generative AI to generate tests automatically – unit, integration, e2e, etc.

The idea's definitely appealing. Who wouldn't want an AI to help crank out tests, bump up those coverage numbers, and maybe save us from some of the testing grind? It sounds like a fast track to better feedback and tackling that mountain of untested code.

But, hang on a sec. Like any tool, especially one this complex, artificial intelligence is not a silver bullet. Just grabbing AI driven tests and calling it a day is risky. You might think your code's solid because the test count is high, but the tests themselves might be junk.

These AI language models learn from tons of code online and in repos – and let's face it, a lot of that code isn't exactly high quality nor correct code.

This article is for devs figuring out how to actually use these AI tools without creating a mess and produce high quality software. We'll touch on the good stuff but focus on the traps: the tests might be flat-out wrong, or they might just "prove" that your buggy code works exactly like the buggy mess it is, instead of checking if it meets user needs.

We'll also bring in static analysis with SonarQube, using its big list of Java test rules [https://rules.sonarsource.com/java/tag/tests/] to show concrete examples of what can go wrong and what to watch for. The point isn't to ditch AI, but to use it intelligently, so you don't trade real quality for fake coverage.

How AI Learns to Code (And Why That's a Problem for Tests)

To understand why AI tests can be uncertain, it helps to know how these code-generating AIs learn. Most are Large Language Models (LLMs) trained on absolutely massive datasets. These datasets contain billions of lines of code from GitHub, Stack Overflow, open-source projects, maybe your own company's code.

The AI digests all this and learns patterns: common code structures, how people usually use certain APIs, popular libraries, coding styles. It gets really good at predicting the next bit of code in a sequence, leading it to write stuff that often looks right.

But that's the catch. The training data is just… code. All kinds of code. Including:

Plain old bugs.
Nasty security holes.
Weird anti-patterns.
Outdated code using old libraries, patterns or approaches.
Code that ignores style guides.
Code with zero useful comments.

The AI doesn't understand good code from bad code. It just mimics the patterns it saw. If buggy code patterns were common in its training data, it'll happily reproduce them. It's the classic "garbage in, garbage out" deal.

So when you ask this AI to write tests, problems pop up:

The Tests Have Bugs: The generated test code itself might be flawed, misuse resources, have race conditions – just like buggy tests humans write.
The Tests Verify Bugs: This is the really sneaky one. The AI looks at your current code, sees how it works (even the buggy parts), and writes a test to confirm that behavior. It doesn't know what the code should do from requirements; it just tests what the code does.

Think of learning English only by reading internet comments. You'd get good at slang and common mistakes, but you wouldn't be able to write clean technical docs. An AI testing tool trained on a huge, messy pile of code is similar – good at mimicry, not guaranteed to be correct or follow best practices in software development.

AI powered tests can be inaccurate and may only validate existing code, not the intended behavior. Let’s see some of the main problems you can encounter by generating the tests with AI.

Problem #1: AI Tests Might Just Be Wrong

Yeah, AI can generate code that uses @Test and compiles. It will save a lot of manual effort on time consuming test case generation. But is it correct? Often, it might not be. When you're reviewing AI-generated tests, watch out for:

Looks Right, Works Wrong: AI usually nails the syntax. But code that compiles doesn't mean the test logic is sound or it tests anything useful.
Incomplete Tests: Super common. The AI sets things up, calls the method, and... forgets the important part.
No Asserts: A test without asserts is pointless. AI often forgets to actually check the result.
Weak Asserts: An assertNotNull(result) is better than nothing, but doesn't prove the result is correct. Also assertTrue(true) is useless.
Happy Path Only: AI often tests the simple case. What about nulls, errors, edge conditions? AI might miss these unless you specifically tell it to check them.
Weird or Irrelevant Tests: AI can "hallucinate" and generate tests for things that don't make sense for your app, or test trivial details instead of important behavior.
Sneaky Logic Bugs: These look okay at first glance.
Bad Setup: Mocking things wrong, starting the test in an invalid state.
Bad Asserts: Using the wrong comparison, expecting the wrong result, off-by-one errors.
Flaky Tests: Tests involving threads or async code are hard. AI might generate tests that sometimes pass, sometimes fail due to timing issues.
Missing the Big Picture: Good tests often need domain knowledge. AI usually doesn't have deep context about your specific app unless you give it lots of info. It might test a method fine in isolation but miss its system-wide impact.
Dynamic Stuff & Async: Testing tricky things like UIs, message queues, or async operations? AI often struggles to generate reliable tests for these without a lot of help or manual fixes.

Here’s a quick example:

// AI might generate something like this:
@Test
void testProcessItem() {
    ItemProcessor processor = new ItemProcessor(/* dependencies */);
    Item item = getTestItem();

    // Maybe AI doesn't know mocking is needed here
    // MockItemRepository mockRepo = mock(ItemRepository.class);
    // when(mockRepo.save(any(Item.class))).thenReturn(item);
    // processor.setRepository(mockRepo);

    processor.process(item);

    // Problem: No assertion! Does 'process' do anything? Is item saved?
}

Looks like a test, runs, but proves nothing. And it will be GREEN !! . You gotta check.

Problem #2: Testing the Code You Have, Not the Code You Need (Verification vs. Validation Trap)

This is the deeper problem. Even if an AI test is technically correct for the current code, it might be testing the wrong thing if the code itself is buggy. It's about Verification vs. Validation:

Verification: "Are we building the product right?" Does the code do what the current implementation says it does? AI is okay at this.
Validation: "Are we building the right product?" Does the code actually meet the user's real needs? Does it solve the problem correctly? AI struggles here.

If your calculateTax method has a bug and returns negative tax for some inputs, an AI looking at the code might generate a test asserting that calculateTax(badInput) should return that negative number. It verifies the bug.

Here is a simple example of this buggy method and its AI generated test:

public BigDecimal calculateTax(BigDecimal income) {
   BigDecimal grossTax = income.multiply(TAX_RATE);

   // *** THE BUG IS HERE ***
   // Simple subtraction without checking if the result is negative.
   BigDecimal netTax = grossTax.subtract(STANDARD_DEDUCTION);

   // Rounding for standard currency format (e.g., 2 decimal places)
   return netTax.setScale(2, RoundingMode.HALF_UP);
}

@Test
@DisplayName("Test calculateTax: Should return expected negative tax for low income due to BUG")
void calculateTax_whenIncomeIsLow_shouldReturnNegativeTax_dueToBug() {
  BigDecimal lowIncome = new BigDecimal("10000.00");
  // Expected calculation: (10000 * 0.15) - 5000 = 1500 - 5000 = -3500
  BigDecimal expectedNegativeTax = new BigDecimal("-3500.00");
  BigDecimal actualTax = calculator.calculateTax(lowIncome);
  // We are specifically asserting that the bug produces this negative result.
  assertEquals(expectedNegativeTax, actualTax,
                "BUG CONFIRMATION: calculateTax should return -3500.00 for 10000.00 income");
}

Why?

Code is the Source: The AI learns from the code you give it.
No Requirements Mind-Reading: Without clear, up-to-date requirements, AI doesn't know what the code should do.
It Matches Patterns: It sees input -\> process -\> output in your code and writes a test for that specific pattern, bug or not.

Let’s see which are the challenges with the validation trap and recommendations to avoid them.

The False Confidence Problem: This is bad. A test passing because of a bug makes everything look green, but the bug is still there, now with a test "protecting" it. Fix the bug later, and the AI's test fails, confusing everyone.
Ignoring Requirements Changes: Requirements evolve. Code written last month might be wrong now. AI testing the code won't know that. It just keeps confirming the potentially outdated behavior.
Analogy: Like spell-checking a document but not fact-checking it. Verification passes, validation fails.
Can AI Test Requirements Directly? Some tools try. You feed them requirements (like Gherkin specs), and they generate tests \cite{aws, visuresolutions, thoughtworks}. Better, but still needs perfect, up-to-date requirements and the AI can still misinterpret them. Many simple AI tools just look at the code.

Consider this buggy code:

// Buggy Implementation
public String formatUsername(String name) {
    if (name == null || name.trim().isEmpty()) {
        return "guest"; // Should maybe throw exception?
    }
    // Bug: Doesn't handle names with spaces well
    return name.toLowerCase();
}

// AI-Generated Test (Based on Buggy Code)
@Test
void whenNameHasSpace_shouldReturnLowerCase() { // Validates bug!
    UserFormatter formatter = new UserFormatter();
    String result = formatter.formatUsername("Test User");
    // AI sees the code returns "test user", so it asserts that.
    // Requirement might be to remove spaces or throw error.
    assertEquals("test user", result);
}

This test passes but locks in the bad behavior of allowing spaces. You, the dev, need to check if the test matches the requirement, not just the buggy code.

So? What to Do? Don't Use AI for Generating Tests? Nah, Rely on Your AI Test Quality Guardian

Given that AI tests can be wonky, using static analysis tools is pretty much essential. These tools automatically scan your code (including tests) against a huge rulebook, finding potential bugs, security issues, and just plain confusing code. When AI is potentially adding lots of code fast, you need this automated check.

Some of these tools even promote AI Code assurance, to keep AI-generated code in check, sometimes with even stricter rules. Makes sense – treat AI code with the same (or more) skepticism as human code.

One of these tools is SonarQube, which has 47 specific rules just for Java tests [https://rules.sonarsource.com/java/tag/tests/]. Let's break down the kinds of issues it catches, with quick examples showing how AI might mess up.

1. Assertions - Did You Actually Check Anything?

Purpose: Ensure tests make meaningful checks. It can be easy to forget assertions and the test will pass making it difficult to spot.
Example (Rule S2699):

// Noncompliant code
@Test
void testAddItem() {
    Cart cart = new Cart();
    Item item = new Item("Thing");
    cart.add(item);
    // Forgot to assert!
}

// Compliant code
@Test
void testAddItem() {
    Cart cart = new Cart();
    Item item = new Item("Thing");
    cart.add(item);
    assertEquals(1, cart.getItemCount()); // Added assertion
}

AI Trap: AI might just call the method and forget the assert.

2. Test Structure, Setup, Teardown - Getting the Basics Right

Purpose: Enforce standard test structure conventions needed by frameworks like JUnit.
Example (Rule S5786):

// Noncompliant code (JUnit 5)
@Test
private void myPrivateTest() { // Test methods shouldn't be private
    assertTrue(true);
}

// Compliant code (JUnit 5)
@Test
void myVisibleTest() { // Default visibility is fine, or public
    assertTrue(true);
}

AI Trap: Generating methods with wrong visibility (private, static) or return types.

3. Naming Conventions - Can Anyone Understand This?

Purpose: Make tests readable and understandable from their names. The BDD convention is widely adopted, but there are others.
Example (Rule S3577):

// Noncompliant code
@Test
void test1() {
    // ... complex setup and assert ...
}

// Compliant code
@Test
void shouldThrowIllegalArgumentException_WhenInputIsNull() {
    // ... clear test logic for null input ...
}

AI Trap: Using generic names like testMethod1 or test_feature_abc.

4. Using Test Frameworks Correctly - JUnit/TestNG Gotchas

Purpose: Ensure proper use of framework features and APIs. Test frameworks provide specific ways of handling different use cases. In this particular case, exceptions.
Example (Rule S5776):

// Noncompliant code (JUnit 5) - Old way to check exceptions
@Test
void testDivisionByZero_OldWay() {
    Calculator calc = new Calculator();
    try {
        calc.divide(1, 0);
        fail("Should have thrown ArithmeticException");
    } catch (ArithmeticException expected) {
        // Expected exception caught, test passes implicitly
    }
}

// Compliant code (JUnit 5) - Using assertThrows
@Test
void testDivisionByZero_NewWay() {
    Calculator calc = new Calculator();
    assertThrows(ArithmeticException.class, () -> {
        calc.divide(1, 0);
    });
}

AI Trap: Using outdated patterns (like the try/catch/fail for exceptions) or mixing framework versions.

5. Performance and Resource Usage - Don't Slow Down the Build

Purpose: Avoid bad practices like printing to console or leaking resources in tests. It is not easily configurable, can mess with build tools and it’s a sync process that will slow down the build.
Example (Rule S106):

// Noncompliant code
@Test
void testSomethingComplex() {
    // ... logic ...
    System.out.println("Debug: Intermediate value = " + value); // Avoid this
    // ... asserts ...
}

// Compliant code
@Test
void testSomethingComplex() {
    // ... logic ...
    log.debug("Debug: Intermediate value = " + value);
    // ... asserts ...
}

AI Trap: Leaving System.out.println calls used during generation/debugging.

6. Mocking Frameworks - Using Mocks Correctly

Purpose: Guide correct usage of mocking frameworks like Mockito, specially on the setup phase. If not done correctly can lead to unexpected issues.
Example (Rule S5979):

// Noncompliant code (Potential Issue: Forgetting to mock)@Testvoid testServiceUsingRepository() {
    // Missing mock setup for repository dependency
    MyRepository repo; // = mock(MyRepository.class);
    MyService service = new MyService(repo); // Might throw NPE if repo is null
    service.doWork();
    // Assertions might fail unpredictably
}

// Compliant code (Basic Mocking)
@Test
void testServiceUsingRepository() {
    MyRepository repo = mock(MyRepository.class); // Mock dependency
    when(repo.getData()).thenReturn("mock data"); // Stub method call
    MyService service = new MyService(repo);
    service.doWork();
    verify(repo).getData(); // Verify interaction
    // Add assertions based on service logic
}

AI Trap: Generating incomplete mock setups or incorrect verification logic.

7. Exception Handling - Be Specific

Purpose: Ensure tests checking for exceptions look for the specific expected exception. Generic exceptions can swallow several different use cases, and the code should catch those exceptions types that can handle.
Example (Rule S112):

// Noncompliant code
@Test
void testInvalidInput() {
    Processor processor = new Processor();
    // This is too broad, might catch unexpected runtime exceptions
    assertThrows(Exception.class, () -> {
        processor.process(null);
    });
}

// Compliant code
@Test
void testInvalidInput() {
    Processor processor = new Processor();
    // Be specific about the expected exception
    assertThrows(IllegalArgumentException.class, () -> {
        processor.process(null);
    });
}

AI Trap: Using generic Exception when a more specific one is appropriate.

Seeing these examples shows how easy it is for generated code (and human code!) to violate basic testing hygiene. SonarQube acts as your automated checklist for this stuff.

How to Use AI Test Tools Without Getting Burned

So, how do you actually use these tools without causing chaos?

Human Review is Mandatory (Really!): Never skip this. Check if the test makes sense, if the asserts are good, if it tests the requirement, if it covers edge cases, and if your static analysis guardian is happy.
Use Static Analysis Everywhere: Put a linter in your IDE. Put it in your CI pipeline. Fail the build if quality drops. Make it non-negotiable.
Let AI Do the Easy Stuff: Don't expect miracles. Use AI for:
- Boilerplate: Test methods, basic setup/teardown.
- Simple Mocking: Basic when/thenReturn.
- Test Variations: Generating different inputs for a test you already wrote and trust.
Give Better Instructions: Garbage prompts \= garbage tests.
- Add Context: Give it docs, requirements snippets, good examples.
- Be Specific: Tell it exactly what to test, what to mock, what to assert.
Iterate: Treat the AI output as a first draft. Review, fix, improve.
Learn the Tools: Figure out how your specific AI tool works best. Practice prompting. Learn to spot its common mistakes quickly.
Start Small: Try it on a safe project first. See if it really saves time after you account for fixing its output.

Wrapping Up

AI test generation? It's here and it can write test code fast, which is cool. But don't just trust it blindly. AI often gets things wrong, misses assertions, or writes tests that just confirm your bugs are still there.

Think of AI as a helper, not the expert. Let it write first drafts or boring bits. But you need to review everything. Does it test the actual requirement? Is the logic sound? Use static analysis to automatically check for common mistakes in your pipeline. Keep your brain engaged, and you can probably get some real speed benefits from AI without sacrificing quality.

The post AI Test Generation: A Dev’s Guide Without Shooting Yourself in the Foot appeared first on foojay.

Mutation Testing in Rust

Nicolas Frankel — Wed, 09 Apr 2025 12:39:50 +0000

Table of Contents

Starting with cargo-mutantsFinding and fixing the issueConclusion

I've been a big fan of Mutation Testing since I discovered PIT. As I dive deeper into Rust, I wanted to check the state of mutation testing in Rust.

Starting with `cargo-mutants`

I found two crates for mutation testing in Rust:

cargo-mutants
and mutagen

mutagen hasn't been maintained for three years, while cargo-mutants is still under active development.

I've ported the sample code from my previous Java code to Rust:

struct LowPassPredicate {
    threshold: i32,
}

impl LowPassPredicate {
    pub fn new(threshold: i32) -> Self {
        LowPassPredicate { threshold }
    }

    pub fn test(&self, value: i32) -> bool {
        value < self.threshold
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn should_return_true_when_under_limit() {
        let low_pass_predicate = LowPassPredicate::new(5);
        assert_eq!(low_pass_predicate.test(4), true);
    }

    #[test]
    fn should_return_false_when_above_limit() {
        let low_pass_predicate = LowPassPredicate::new(5);
        assert_eq!(low_pass_predicate.test(6), false);
    }
}

Using cargo-mutants is a two-step process:

Install it, cargo install --locked cargo-mutants
Use it, cargo mutants

Found 4 mutants to test
ok       Unmutated baseline in 0.1s build + 0.3s test
 INFO Auto-set test timeout to 20s
4 mutants tested in 1s: 4 caught

I expected a mutant to survive, as I didn't test the boundary when the test value equals the limit. Strangely enough, cargo-mutants didn't detect it.

Finding and fixing the issue

I investigated the source code and found the place where it mutates operators:

// We try replacing logical ops with == and !=, which are effectively
// XNOR and XOR when applied to booleans. However, they're often unviable
// because they require parenthesis for disambiguation in many expressions.
BinOp::Eq(_) => vec![quote! { != }],
BinOp::Ne(_) => vec![quote! { == }],
BinOp::And(_) => vec![quote! { || }],
BinOp::Or(_) => vec![quote! { && }],
BinOp::Lt(_) => vec![quote! { == }, quote! {>}],
BinOp::Gt(_) => vec![quote! { == }, quote! {<}],
BinOp::Le(_) => vec![quote! {>}],
BinOp::Ge(_) => vec![quote! {<}],
BinOp::Add(_) => vec![quote! {-}, quote! {*}],

Indeed, , but not to <=. I forked the repo and updated the code accordingly:

BinOp::Lt(_) => vec![quote! { == }, quote! {>}, quote!{ <= }],
BinOp::Gt(_) => vec![quote! { == }, quote! {<}, quote!{ => }],

I installed the new forked version:

cargo install --git https://github.com/nfrankel/cargo-mutants.git --locked

I reran the command:

cargo mutants

The output is the following:

Found 5 mutants to test
ok       Unmutated baseline in 0.1s build + 0.3s test
 INFO Auto-set test timeout to 20s
MISSED   src/lib.rs:11:15: replace < with <= in LowPassPredicate::test in 0.2s build + 0.2s test
5 mutants tested in 2s: 1 missed, 4 caught

You can find the same information in the missed.txt file. I thought I fixed it and was ready to make a Pull Request to the cargo-mutants repo. I just needed to add the test at the boundary:

#[test]
fn should_return_false_when_equals_limit() {
    let low_pass_predicate = LowPassPredicate::new(5);
    assert_eq!(low_pass_predicate.test(5), false);
}

cargo test

running 3 tests
test tests::should_return_false_when_above_limit ... ok
test tests::should_return_false_when_equals_limit ... ok
test tests::should_return_true_when_under_limit ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

cargo mutants

And all mutants are killed!

Found 5 mutants to test
ok       Unmutated baseline in 0.1s build + 0.2s test
 INFO Auto-set test timeout to 20s
5 mutants tested in 2s: 5 caught

Conclusion

Not many blog posts end with a Pull Request, but this one does. Unfortunately, I couldn't manage to make the tests pass; fortunately, the repository maintainer helped me–a lot. The Pull Request is merged: enjoy this slight improvement.

I learned more about cargo-mutants and could improve the code in the process.

To go further:

Originally published at A Java Geek on March 30^th, 2025

The post Mutation Testing in Rust appeared first on foojay.

Pull request testing on Kubernetes: testing locally and on GitHub workflows

Nicolas Frankel — Sun, 16 Mar 2025 17:32:27 +0000

Table of Contents

Unit testing vs. integration testingTestcontainersUse-case: application with database"Unit" testing"Integration" testingThe GitHub workflowAlternative "Unit testing" on GitHubConclusion

Imagine an organization with the following practices:

Commits code on GitHub
Runs its CI/CD pipelines with GitHub Actions
Runs its production workload on Kubernetes
Uses Google Cloud

A new engineer manager arrives and asks for the following:

On every PR, run integration tests in a Kubernetes cluster similar to the production one.

It sounds reasonable.

Engineering manager: I want #integrationtests to run on the app deployed on #Cloud infra for each #GitHub PR ✅

Me, thinking it's a no-brainer: sure thing! 🤦‍♂️

Me, after 85 runs: I have content for my next blog post/talk 😅

[image or embed]

— Nicolas Fränkel 🇺🇦🇬🇪 (@frankel.ch) December 20, 2024 at 10:52 AM

In this series of posts, I'll show how you can do it. My plan is the following:

This blog post focuses on the app, the basic GitHub workflow setup, and testing both locally and during the workflow run
The second blog post will detail the setup of a Google Kubernetes Engine instance and how to adapt the workflow to use it
The third and final post will describe how to isolate each run in a dedicate virtual Kubernetes cluster

Unit testing vs. integration testing

I wrote the book Integration Testing from the Trenches. In there, I defined Integration testing as:

Integration Testing is a strategy to test the collaboration of at least two components.

I translated it in OOP as:

Integration Testing is a strategy to test the collaboration of at least two classes.

I doubled down on the definition a couple of years later:

Let’s consider the making of a car. Single-class testing is akin to testing each nut and bolt separately. Imagine testing of such components brought no issue to light. Still, it would be very risky to mass manufacture the car without having built a prototype and sent it to a test drive.

However, technology has evolved since that time.

Testcontainers

I use the word "technology" very generally, but I have Testcontainers in mind:

Unit tests with real dependencies

Testcontainers is an open source library for providing throwaway, lightweight instances of databases, message brokers, web browsers, or just about anything that can run in a Docker container.

In effect, Testcontainers replaces mocks with "real" dependencies-containerized. It's a real game-changer: instead of painfully writing mocking code to stub dependencies, just set them up regularly.

For example, without Testcontainers, you'd need to provide mocks for your data access objects in tests; with it, you only need to start a database container, and off you go.

At the time, the cost of having a local Docker daemon in your testing environment offset many benefits. It's not the case anymore, as Docker daemons are available (nearly) everywhere.

My definition of Integration Testing has changed a bit:

Integration Testing is testing that requires significant setup.

The definition is vague on purpose, as significance has a different meaning depending on the organization, the team, and the individual. Note that Google defines two categories of tests: fast and slow. Their definition is equally vague, meant to adapt to different contexts.

In any case, the golden rule still applies: the closer you are to the final environment, the more risks you cover and the more valuable your tests are. If our target production environment is Kubernetes, we will reap the most benefits from running the app on Kubernetes and testing it as a black box. It doesn't mean that white box testing in a more distant environment is not beneficial; it means that the more significant the gap between the testing environment and the target environment, the fewer issues we will uncover.

For the purposes of this blog post, we will use GitHub as the base testing environment for unit testing and a full-fledged Kubernetes cluster for integration testing. There is no absolute truth regarding what is the best practice™, as contexts vary widely across organizations and even across teams within the same organization. It's up to every engineer to decide within their specific context the ROI of setting up such an environment because the closer you are to production, the more complex and, thus, expensive it will be.

Use-case: application with database

Let's jump into how to test an app that uses a database to store its data. I don't want anything fancy, just solid, standard engineering practices. I'll be using a CRUD JVM-based app, but most of the following can easily apply to other stacks as well. The following blog posts will involve less language-specific content.

Here are the details:

Kotlin, because I love the language
Spring Boot: it's the most widespread framework for JVM-based applications
Maven-there's nothing else
Project Reactor and coroutines, because it makes things more interesting
PostgreSQL-at the moment, it's a very popular database, and it's well-supported by Spring
Flyway

If you don't know Flyway, it allows you to track database schemas and data in a code repository and manage changes, known as migrations, between versions. Each migration has a unique version, e.g., v1.0, v1.1, v2.1.2, etc. Flyway tries to apply migration in order. If it has already applied a migration, it skips it. Flyway stores its data in a dedicated table to track the applied migrations.

This approach is a must-have; Liquibase is an alternative that follows the same principles.

Spring Boot fully integrates Flyway and Liquibase. When the app starts, the framework will kickstart them. If a pod is killed and restarted, Flyway will first check the migrations table to apply only the one that didn't run previously.

I don't want to bore you with the app details; you can find the code at GitHub.

"Unit" testing

Per my definition above, unit testing should be easy to set up. With Testcontainers, it is.

The testing code counts the number of items in a table, inserts a new item, and counts the number of items again. It then checks that:

There's one additional item compared to the initial count
That the new item is the one we inserted

@SpringBootTest                                                              //1
class VClusterPipelineTest @Autowired constructor(private val repository: ProductRepository) { //2

    @Test
    fun `When inserting a new Product, there should be one more Product in the database and the last inserted Product should be the one inserted`() { //3
        runBlocking {                                                        //4
            val initialCount = repository.count()                            //5
            // The rest of the test
        }
    }
}

Initialize the Spring context
Insert the repository
Praise Kotlin to allow for descriptive function names
Run non-blocking code in a blocking function
Use the repository

We now need a PostgreSQL database; Testcontainers can provide one for us. However, to avoid conflicts, it will choose a random port until it finds an unused one. We need it to connect to the database, run the Flyway migration, and run the testing code.

For this reason, we must write a bit of additional code:

@Profile("local")                                                              //1
class TestContainerConfig {

    companion object {
        val name = "test"
        val userName = "test"
        val pass = "test"
        val postgres = PostgreSQLContainer("postgres:17.2").apply {   //1
            withDatabaseName(name)
            withUsername(userName)
            withPassword(pass)
            start()
        }
    }
}

class TestContainerInitializer : ApplicationContextInitializer {
    override fun initialize(applicationContext: ConfigurableApplicationContext) {
        if (applicationContext.environment.activeProfiles.contains("local")) {
            TestPropertyValues.of(                                             //2
                "spring.r2dbc.url=r2dbc:postgresql://${TestContainerConfig.postgres.host}:${TestContainerConfig.postgres.firstMappedPort}/$name",
                "spring.r2dbc.username=$name",
                "spring.r2dbc.password=$pass",
                "spring.flyway.url=jdbc:postgresql://${TestContainerConfig.postgres.host}:${TestContainerConfig.postgres.firstMappedPort}/$name",
                "spring.flyway.user=$name",
                "spring.flyway.password=$pass"
            ).applyTo(applicationContext.environment)
        }
    }
}

Start the container, but only if the Spring Boot profile local is active
Override the configuration values

We need to specify neither the spring.flyway.user nor the spring.flyway.password if we hacked the application.yaml to reuse the R2BC parameters of the same name:

spring:
  application:
    name: vcluster-pipeline
  r2dbc:
    username: test
    password: test
    url: r2dbc:postgresql://localhost:8082/flyway-test-db
  flyway:
    user: ${SPRING_R2DBC_USERNAME}                                             #1
    password: ${SPRING_R2DBC_PASSWORD}                                         #1
    url: jdbc:postgresql://localhost:8082/flyway-test-db

Smart hack to DRY configuration further down

We also annotate the previous test class to use the initializer:

@SpringBootTest
@ContextConfiguration(initializers = [TestContainerInitializer::class])
class VClusterPipelineTest @Autowired constructor(private val repository: ProductRepository) {

    // No change
}

Spring Boot offers a couple of options to activate profiles. For local development, we can use a simple JVM property, e.g., mvn test -Dspring.profiles.active=local; in the CI pipeline, we will use environment variables instead.

"Integration" testing

I'll also use Flyway to create the database structure for integration testing. In the scope of this example, the System Under Test will be the entire app; hence, I'll test from the HTTP endpoints. It's end-to-end testing for APIs. The code will test the same behavior, albeit treating the SUT as a black box.

class VClusterPipelineIT {

    val logger = LoggerFactory.getLogger(this::class.java)

    @Test
    fun `When inserting a new Product, there should be one more Product in the database and the last inserted Product should be the one inserted`() {

        val baseUrl = System.getenv("APP_BASE_URL") ?: "http://localhost:8080" //1

        logger.info("Using base URL: $baseUrl")

        val client = WebTestClient.bindToServer()                              //2
            .baseUrl(baseUrl)
            .build()

        val initialResponse: EntityExchangeResult?> = client.get() //3
            .uri("/products")
            .exchange()
            .expectStatus().isOk
            .expectBodyList(Product::class.java)
            .returnResult()

        val initialCount = initialResponse.responseBody?.size?.toLong()        //4

        val now = LocalDateTime.now()
        val product = Product(
            id = UUID.randomUUID(),
            name = "My awesome product",
            description = "Really awesome product",
            price = 100.0,
            createdAt = now
        )

        client.post()                                                          //5
            .uri("/products")
            .bodyValue(product)
            .exchange()
            .expectStatus().isOk
            .expectBody(Product::class.java)

        client.get()                                                           //6
            .uri("/products")
            .exchange()
            .expectStatus().isOk
            .expectBodyList(Product::class.java)
            .hasSize((initialCount!! + 1).toInt())
    }
}

Get the deployed app URL
Create a web client that uses the former
Get the initial item list
Get the size; we definitely should offer a count function if there are too many items
Insert a new item and assert everything works out fine
Get the list of items and assert the item count is higher by one

Before going further, let's run the tests in a GitHub workflow.

The GitHub workflow

I'll assume you're familiar with GitHub workflows. If you aren't, a GitHub workflow is a declarative description of an automated job. A job consists of several steps. GitHub offers several triggers: Manual, scheduled, or depending on an event.

We want the workflow to run on each Pull Request to verify that tests run as expected.

name: Test on PR                                                               #1

on:
  pull_request:
    branches: [ "master" ]                                                     #2

Set a descriptive name
Trigger on a PR to the master branch

The first steps are pretty standard:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Install JRE
        uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: 21
          cache: maven                                                         #1

The setup-java action includes a caching option for build tools. Here, it will cache dependencies across runs, speeding up consecutive runs. Unless you have good reasons not to, I recommend using this option.

For the same reason, we should cache our built artifacts. While researching for this post, I learned that GitHub discards them across runs *and steps in the same run. Hence, we can speed up the runs by caching them explicitly:

      - name: Cache build artifacts
        uses: actions/cache@v4                                                 <1>
        with:
          path: target
          key: ${{ runner.os }}-build-${{ github.sha }}                        <2>
          restore-keys:
            ${{ runner.os }}-build                                             <3>

1. Use the same action that actions/setup-java uses under the hood
2. Compute the cache key. In our case, the runner.os should be immutable, but this should be how you run matrices across different operating systems.
3. Reuse the cache if it's the same OS
```
      - name: Run "unit" tests
        run: ./mvnw -B test
        env:
          SPRING_PROFILES_ACTIVE: local                                        <1>
```
1. Activate the local profile. The workflow's environment provides a Docker daemon. Hence, Testcontainer successfully downloads and runs the database container.
At this point, we should run the integration test. Yet, we need the app deployed to run this test. For this, we need available infrastructure.

Alternative "Unit testing" on GitHub

The above works perfectly on GitHub, but we can move closer to the deployment setup by leveraging GitHub service containers. Let's migrate PostgreSQL from Testcontainers to a GitHub service container.

Removing Testcontainers is pretty straightforward: we do not activate the local profile.

Using GitHub's service container requires an additional section in our workflow:
```
jobs:
  build:
    runs-on: ubuntu-latest
    env:
      GH_PG_USER: testuser                                                     #1
      GH_PG_PASSWORD: testpassword                                             #1
      GH_PG_DB: testdb                                                         #1
    services:
      postgres:
        image: postgres:15
        options: >-                                                            #2
          --health-cmd "pg_isready -U $POSTGRES_USER"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
            - 5432/tcp                                                         #3
        env:
          POSTGRES_USER: ${{ env.GH_PG_USER }}                                 #4
          POSTGRES_PASSWORD: ${{ env.GH_PG_PASSWORD }}                         #4
          POSTGRES_DB: ${{ env.GH_PG_DB }}                                     #4
```
1. Define environment variables at the job level to use them across steps. You can use secrets, but in this case, the database instance is not exposed outside the workflow and will be switched off when the latter finishes. Environment variables are good enough to avoid adding unnecessary secrets.
2. Make sure that PostgreSQL works before going further
3. Assign a random port and map it to the underlying 5432 port
4. Use the environment variables
To run the tests using the above configuration is straightforward.
```
- name: Run "unit" tests
  run: ./mvnw -B test
  env:
    SPRING_FLYWAY_URL: jdbc:postgresql://localhost:${{ job.services.postgres.ports['5432'] }}/${{ env.GH_PG_DB }} #1
    SPRING_R2DBC_URL: r2dbc:postgresql://localhost:${{ job.services.postgres.ports['5432'] }}/${{ env.GH_PG_DB }} #1
    SPRING_R2DBC_USERNAME: ${{ env.GH_PG_USER }}
    SPRING_R2DBC_PASSWORD: ${{ env.GH_PG_PASSWORD }}
```
1. GitHub runs PostgreSQL on a local Docker, so the host is localhost. We can get the random port with the ${{ job.services.postgres.ports['5432'] }} syntax.
For more information on job.services., please check the GitHub documentation.

Conclusion

In this article, we laid the ground for a simple app's unit- and integration-testing, leveraging Testcontainers in the local environment. We then proceeded to automate unit testing via a GitHub workflow with the help of GitHub service containers. In the next post, we will prepare the Kubernetes environment on a Cloud provider infrastructure, build the image, and deploy it to the latter.

The complete source code for this post can be found on GitHub.

Go further:
Originally published on A Java Geek on February 9^th, 2025

The post Pull request testing on Kubernetes: testing locally and on GitHub workflows appeared first on foojay.

foojay – a place for friends of OpenJDK

Eliminating Flaky Tests to End World Hunger

Why Do Flaky Tests Matter?

Common Causes of Flaky Tests

Strategies to Keep Tests Reliable

1. Awareness of Flaky Tests

2. Fix One Flaky Test Each Sprint

3. Use New Test Data

4. Wait for Conditions to Be Met

5. Run Tests in Parallel

6. Temporarily Quarantine Flaky Tests

7. Split Up End-to-End and Integration Tests

Building a Reliable Test Suite: A Cultural Shift

Conclusion

Introducing BoxLings! An interactive teacher for BoxLang and TDD/BDD

What Is BoxLings?

The Full Learning Path

🟢 Phase 1 — Core Fundamentals (50 Exercises)

🟡 Phase 2 — Intermediate (40 Exercises)

🔴 Phase 3 — Advanced (48 Exercises)

The TDD/BDD Learning Journey

How It Works

Built for Learners, Classrooms & Workshops

Get Started

Join the Community

TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM

TestBox RUN: A Browser IDE for Your Tests

What You GetWhat You Get

Keyboard Shortcuts

Getting Started

Coming Soon: TestBox RUN Desktop App

Streaming Test Execution via SSE

StreamingRunner (Programmatic)StreamingRunner (Programmatic)

BoxLang CLI --stream Flag

Dry Run & Spec Discovery

Programmatic Dry Run

CLI Dry Run

JSON Output

BoxLang CLI Runner — New Power Options

Focus on Failures

Stack Trace Control

Output & Performance Flags

Application Mappings Auto-Load (TESTBOX-440)

Other Notable Improvements

ConsoleReporter — Hide Skipped Tests (TESTBOX-433)

Suite Filtering Fixes (TESTBOX-435)

TestBox CLI Updates (v1.8.0)

Engine Support

Upgrade Now

How to Customize JaCoCo Report Styling in Your Java Project

The Problem

The Strategy: CSS Overlay

Step 1: Create Your Custom report.css

Step 2: Overlay CSS During Maven Build

Step 3: Handle CI Deployment (Optional)

Watch Out: Output Directory Paths

The Result

Quick Start Checklist

Full Example

Testing Emails with Testcontainers and Mailpit

What is Mailpit?

Why Testcontainers fits perfectly

The Mailpit Testcontainer module

Maven dependency

Using Spring Boot with @ServiceConnection

Using Mailpit without Spring Boot

Fluent AssertJ assertions

Waiting for asynchronous emails

Why this approach works well

Conclusion

Flaky Tests: a journey to beat them all

What’s a flaky test?

First try: retry them all!

Second try: fix them all!

Third try: embrace the inevitability!

Conclusion

JC-AI Newsletter #7

AI Test Generation: A Dev’s Guide Without Shooting Yourself in the Foot

So, AI Can Write Tests Now? Cool, But...

How AI Learns to Code (And Why That's a Problem for Tests)

BoxLang CLI `--stream` Flag

`ConsoleReporter` — Hide Skipped Tests (TESTBOX-433)

Step 1: Create Your Custom `report.css`

Starting with `cargo-mutants`