foojay – a place for friends of OpenJDK

AWS Nitro and CPU Graviton Meets Unikernels

Angelo Rubini — Fri, 10 Apr 2026 16:26:25 +0000

Table of Contents

From Virtual Machines and Containers to UnikernelsProof of Concept Overview

Reproducibility and Artifacts
Local Build and Image Creation
Instance Creation on AWS
PoC Environment

Architectural Diagram of the PoCContainers vs Unikernels: A Stack Comparison

Container Stack
Unikernel Stack

Quarkus, Semeru, and Nanos on AWS Nitro GravitonAWS Nitro: Cloud-Native Capabilities Without KubernetesHypervisor IndependenceWhy This Matters for Java and Jakarta EEConclusion

The key message of this article is simple and strong: any Java or Jakarta EE application is already ready to benefit from the advantages of unikernels.

Java and Jakarta EE, including modern frameworks such as Quarkus, can immediately take advantage of the unikernel model without waiting for new languages, new runtimes, or radical rewrites.

With Nanos Unikernel, Java applications run unchanged: the JVM is not modified, the application is not rewritten, and the well-known Java promise of write once, run anywhere remains fully valid — even in a unikernel environment.

This represents a major shift: unikernels are no longer a research topic or a niche experiment, but a first-class deployment target for enterprise Java workloads.

From Virtual Machines and Containers to Unikernels

Cloud-native Java applications have traditionally been deployed on virtual machines and, more recently, inside containers orchestrated by Kubernetes. While containers brought improvements in portability and density, they also introduced additional layers of software complexity.

Unikernels remove these layers by compiling the application and its required runtime components into a single-purpose, minimal image that runs directly on a hypervisor.

With Nanos Unikernel, this model becomes practical for Java and Jakarta EE workloads.

Proof of Concept Overview

Reproducibility and Artifacts

For this Proof of Concept, all the build and deployment steps are fully reproducible.

The artifacts used are publicly available and were generated by the GitHub Actions workflow:

Repository and CI run: https://github.com/AngeloRubens/quarkus-faces-nanos/actions/runs/21218011985

These artifacts include the packaged Quarkus application and the IBM Semeru Runtime bundled for execution on Nanos Unikernel (ARM64).

Local Build and Image Creation

In the local environment, the Nanos unikernel image for AWS ARM64 was created using the following command:

ops image create \
  --imagename quarkusNanosArm64 \
  --package AngeloRubens/SemeruJREarm64Linux:25.0.1 \
  --arch arm64 \
  --nightly \
  -c configAws.json \
  -t aws

This step produces a bootable Nanos unikernel image containing:

the Quarkus application
the unmodified IBM Semeru Runtime (OpenJ9)
the minimal Nanos kernel required to run on ARM64

Instance Creation on AWS

Once the image was created and uploaded, the EC2 instance was launched directly from the unikernel image using:

ops instance create \
  quarkusNanosArm64 \
  --arch arm64 \
  -c configAws.json \
  -t aws

This workflow highlights a key point of the article: Java applications can be deployed as unikernels without introducing container images, container runtimes, or orchestration layers.

This article builds on the original Foojay article, where the application image was executed on Oracle Cloud Infrastructure (OCI). The goal of this Proof of Concept is to demonstrate that the same Nanos Unikernel image can be deployed on a different cloud provider and hypervisor without modification.

PoC Environment

Cloud provider: AWS
Instance type: t4g.small (ARM64)
CPU: AWS Graviton2
Hypervisor: AWS Nitro
Unikernel: Nanos (ARM64)
Runtime JAVA: IBM Semeru Runtime 25 for Arm64
Framework: Quarkus

This demonstrates that Nanos unikernels are hypervisor-agnostic and cloud-independent.

Architectural Diagram of the PoC

The application runs as a single unikernel image directly on top of the AWS Nitro hypervisor, without a guest operating system, container runtime, or Kubernetes node.

Containers vs Unikernels: A Stack Comparison

Container Stack

Hardware
Hypervisor
Guest Operating System
Container Runtime
Container Image
Java Runtime
Application

Unikernel Stack

Hardware
Hypervisor
Unikernel (Application + Runtime)

By removing unnecessary layers, unikernels reduce boot time, memory footprint, and attack surface.

Quarkus, Semeru, and Nanos on AWS Nitro Graviton

Quarkus is particularly well suited for this model thanks to its fast startup and low memory usage, while IBM Semeru provides a production-grade OpenJDK runtime. Combined with Nanos, the result is a highly efficient Java unikernel.

AWS Nitro: Cloud-Native Capabilities Without Kubernetes

AWS Nitro, like all modern hypervisors, already provides many of the foundational capabilities often associated with Kubernetes:

Strong isolation and security boundaries
Elastic scaling at the VM level
High-performance networking and storage
Hardware offloading for I/O and virtualization
Integrated observability and telemetry
Native IAM integration

Because these features are built directly into the cloud infrastructure, there is no technical requirement to add additional orchestration layers for many workloads.

Running Java applications as unikernels allows teams to:

Exploit hardware more efficiently
Reduce operational complexity
Lower infrastructure and operational costs

Hypervisor Independence

In the original article, the application image runs on the Oracle OCI hypervisor. In this Proof of Concept, the same image runs on AWS Nitro.

This confirms that Nanos unikernels are not tied to a specific cloud provider or hypervisor. The same Java application can be deployed consistently across environments.

Why This Matters for Java and Jakarta EE

Java and Jakarta EE can benefit today from unikernel advantages:

Faster boot times
Smaller memory footprint
Reduced attack surface
Simpler deployment model

Most importantly, this comes without changing existing applications or the JVM.

Unikernels with Nanos represent an evolutionary step, not a disruptive rewrite, for the Java ecosystem.

Conclusion

The cloud is evolving, and unikernels are becoming a practical deployment option for real-world workloads.

With Nanos Unikernel, Java, Jakarta EE, Quarkus, and IBM Semeru Runtime can fully exploit modern ARM64 cloud platforms such as AWS Graviton2 and the Nitro hypervisor — today, without compromise.

Reference:

The poc on aws Free Tier t4g.small until 31-Dec-2026: http://3.127.237.11:8080/index.xhtml

FooJay Link: https://foojay.io/today/java-jakarta-ee-and-the-evolution-of-the-cloud-with-nanos-unikernel/

Nanovms Link: https://nanovms.com/

Semeru JRE Repo ops nanos package https://repo.ops.city/v2/packages/AngeloRubens/SemeruJREarm64Linux/25.0.1/arm64/show

The post AWS Nitro and CPU Graviton Meets Unikernels appeared first on foojay.

Thread-Safe Native Memory in Java: VarHandle Access Modes Explained

David Vlijmincx — Tue, 07 Apr 2026 10:00:43 +0000

Table of Contents

What is Memory Order and Why Does It Matter for Native Memory?

Why do you need all of this?

Testing it using JCStressPlain Access (Get/Set)Opaque AccessAcquire/ReleaseVolatileTL;DRConclusionBonus: Word Tearing

What is Memory Order and Why Does It Matter for Native Memory?

The Foreign Function and Memory (FFM) API is Java's way of interacting with native code and memory. In the previous post, you learned how to do so using Java's built-in Arena types. The Arena provides temporal safety and bounds checks, but what about thread safety? MemorySegments created by .ofShared(), .auto(), and .global() can be used by multiple threads at the same time. Using a VarHandle with just get/set can backfire if you don't use something like locking. The downside is that locks are slow and heavy. So let us take a look at a more granular, hardware-aware approach: using VarHandle access modes.

Why do you need all of this?

When you write concurrent code, you rely on the hardware to keep things in sync. Different CPU architectures handle memory ordering differently. On x86, the memory model is relatively strong. Reads and writes are mostly kept in order, meaning you can often get away with loose synchronization. ARM, however, has a weak memory model. The CPU is free to reorder reads and writes aggressively to optimize performance. If you write code assuming x86's strict ordering and run it on an ARM processor (like Apple Silicon or AWS Graviton), your application will break in unpredictable ways. VarHandle has methods that help with these situations to make sure your code works everywhere.

To see exactly how these mechanics work, we will start with the least restrictive access mode and build our way up to a full memory fence. But before we do that, I want to show you how to actually test this.

Testing it using JCStress

Java Concurrency Stress is an experimental harness that helps you test the correctness of your concurrent code. It does this by running your test concurrently, accessing the same shared state. During execution, it collects the results of the observed state. The goal is to see how your code got rearranged and/or optimized and how it affected the state. One of the ways it does this is by running each thread using a different compilation mode like: interpreter, C1, or C2. JCStress tests each combination of compilation modes (interpreter, C1, C2) across the actors. With two actors, that's nine combinations per run.

Creating these tests requires a bit of a different mindset. Normally, you want two threads to play nicely. Inside the JCStress test you want them to clash as often as possible to observe the possible states your code could end up in. This gets kind of confusing, so let's use this example and let's say you have two threads running the following code:

synchronized (lock) {
    // What you want to test
}

If you used this with JCStress the threads would basically run synchronized one after the other. Of course, it'll work, but it doesn't prove anything. So in the examples to come, keep in mind that we want the threads interleaving with each other and just hammer the state to see what happens. Just as it would in the real world. Another tip for when using JCStress is to not test too much inside a single test. You get a big state with lots of possibilities. To keep the tests fast and snappy, focus the test to tackle one synchronization/thread interleaving problem.

So what does the output look like? Like this:

  RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
      -1  104,287,516   37.93%   Acceptable  Ready flag not seen yet.
       0        1,364   <0.01%  Interesting  Visibility failure: saw ready flag but missed the data.
      42  170,677,819   62.07%   Acceptable  Data seen correctly.

It shows the sampling results and how often they were encountered. The developer sets the expectations and descriptions, so they depend on the case.

Plain Access (Get/Set)

Plain access is the simplest mode there is. No rules or any fence! This works like any other get/set/read/assignment that you are used to in Java like var x = 1 for example. Get/Set work the same way for MemorySegments, it simply sets and gets a value. This is perfectly fine if you are working inside a single thread and don't share your state with other threads. In this mode the compilers, CPU, and cache are allowed to optimize your code and reorder the instructions. As long as the end result looks like it executed your code as you wrote it. This illusion is true as long as you don't create race conditions using multiple threads. So what does this look like? Let's break this illusion with two threads and JCStress. The next example, has a shared MemorySegment that is used to communicate a ready flag and some data. One thread set the data, and the other thread reads the result.

@JCStressTest
@Outcome(id = "42", expect = Expect.ACCEPTABLE, desc = "Data seen correctly.")
@Outcome(id = "-1", expect = Expect.ACCEPTABLE, desc = "Ready flag not seen yet.")
@Outcome(id = "0", expect = Expect.ACCEPTABLE_INTERESTING, desc = "Saw ready flag but missed the data.")
@State
public class NativeMemoryPlainAccess {

    private final MemorySegment segment;
    private static final VarHandle VH_INT = ValueLayout.JAVA_INT.varHandle();

    public NativeMemoryPlainAccess() {
        this.segment = Arena.ofAuto().allocate(8);
    }

    @Actor
    public void actor1() {
        VH_INT.set(segment, 0L, 42);
        VH_INT.set(segment, 4L, 1);
    }

    @Actor
    public void actor2(I_Result r) {
        int ready = (int) VH_INT.get(segment, 4L);
        if (ready == 1) {
            r.r1 = (int) VH_INT.get(segment, 0L);
        } else {
            r.r1 = -1;
        }
    }
}

These threads are passing a message, one thread sets some data, and the other thread reads it. There is no synchronization or fence inside this example, so everything is free to be reordered. This is going to introduce race conditions. This table shows all the different states observed while running the code:

  RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
      -1  104,287,516   37.93%   Acceptable  Ready flag not seen yet.
       0        1,364   <0.01%  Interesting  Visibility failure: saw ready flag but missed the data.
      42  170,677,819   62.07%   Acceptable  Data seen correctly.

JCStress ran the code using different combinations of compilers (interpreter, C1, C2), and as you can see we got three different combinations. Some of the time the ready flag was set, and it got the value 42, other times the flag wasn't set. Both of these are correct states. But 0 is an interesting state... It means that the flag was set, but the data wasn't there yet. The code got reordered! This is not a correct state to be in as 0 shouldn't be possible, right? To fix this issue, we need Acquire/Release, but let's look at Opaque first as it is the next mode in the hierarchy.

Opaque Access

Opaque is the odd one out. Opaque doesn't insert memory fences and provides no ordering guarantees between different variables. What it does provide is: bitwise atomicity (no word tearing), coherence (all threads see writes to the same variable in the same order), and progress (writes will eventually become visible). It also prevents the compiler from eliminating access to that specific variable. This is handy for liveness checks, for example. Let’s say you have two threads. thread_1 runs a while loop until it gets the signal to stop. Thread_0 is in control of this signal. Without Opaque, the compiler is allowed to turn that loop into a while(true), Thread_1 would never stop. JCStress is not really made for this specific scenario, so let's look at another example instead. In the example, Thread_1 will write 1 and 2 to the same place in the MemorySegment. Thread_2 does two reads to see the intermediate/end results. Again, the goal is to make the threads clash as often as possible.

@JCStressTest
@Outcome(id = "1, 2", expect = Expect.ACCEPTABLE_INTERESTING, desc = "Observed intermediate state reliably.")
@Outcome(expect = Expect.ACCEPTABLE, desc = "Other observable states (0,0 / 2,2 / 0,2).")
@State
public class OpaqueNativeOpaqueAccess {

    private final MemorySegment segment;
    private static final VarHandle VH_INT = ValueLayout.JAVA_INT.varHandle();

    public OpaqueNativeOpaqueAccess() {
        this.segment = Arena.ofAuto().allocate(4);
    }

    @Actor
    public void actor1() {
        VH_INT.setOpaque(segment, 0L, 1);
        VH_INT.setOpaque(segment, 0L, 2);
    }

    @Actor
    public void actor2(II_Result r) {
        r.r1 = (int) VH_INT.getOpaque(segment, 0L);
        r.r2 = (int) VH_INT.getOpaque(segment, 0L);
    }
}

The results show that even though Opaque prevents extreme compiler optimizations, it does not guarantee immediate visibility across threads. The vast majority of the time, the second actor sees either the initial state (0, 0) or the final state (2, 2). However, we also observe intermediate states like (1, 2) or ordered reads like (0, 2). Because there are no ordering constraints or memory fences, the CPU and caches can still delay when the writes from actor1 become visible to actor2. The presence of (1, 2) confirms that the intermediate write of 1 is occasionally caught in transit.

  --- OPAQUE ACCESS ---
  RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0  114,615,126   41.45%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 1       73,653    0.03%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 2      577,951    0.21%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 1      283,707    0.10%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 2      114,550    0.04%  Interesting  Observed intermediate state reliably.
    2, 2  160,817,232   58.17%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).

    --- PLAIN ACCESS ---
    RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0  125,004,639   45.55%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 1       38,798    0.01%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 2      311,919    0.11%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 1      362,391    0.13%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 2       68,803    0.03%  Interesting  Observed intermediate state.
    2, 2  148,668,149   54.17%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).

When the C2 compiler steps in, it optimizes the code heavily. With Plain access, C2 often optimizes away the intermediate write entirely, assuming it's redundant since the final value is 2. This is why you see almost zero (1, 2) results in the Plain Access C2 table. Opaque access, however, explicitly forbids the compiler from removing that intermediate write. Consequently, the C2 table for Opaque still shows a noticeable number of (1, 2) results. The compiler was forced to keep both writes, and the hardware's lack of fencing allowed the intermediate state to be observed.

  --- OPAQUE ACCESS  C2 ---
  RESULT     SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0   9,721,273   34.26%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 1         677   <0.01%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 2      32,666    0.12%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 1      17,747    0.06%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 2         561   <0.01%  Interesting  Observed intermediate state reliably.
    2, 2  18,600,087   65.56%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).

    --- PLAIN ACCESS C2 ---
   RESULT     SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0  14,047,744   50.31%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 1           2   <0.01%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    0, 2         106   <0.01%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 1           1   <0.01%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).
    1, 2           4   <0.01%  Interesting  Observed intermediate state.
    2, 2  13,874,594   49.69%   Acceptable  Other observable states (0,0 / 2,2 / 0,2).

So in the end, Opaque is the combination of:

Plain access: the get and set from the section above.
Access atomicity: Reads and writes happen as a single, indivisible unit. No word tearing, even for 64-bit types like long and double.
Coherence: writes to the same variable are observed in the same order for all observers.
Progress: The writes will be eventually visible.

Opaque is useful in specific scenarios but too weak for most concurrent patterns. One example that would be a good fit is a variable that you want to broadcast to other reader(s). A counter that is owned by one thread and collected by other threads would be an example of this.

Let's go one level deeper and see what happens when you add causality to the mix.

Acquire/Release

Acquire and Release offer a stricter mode than Opaque by including all of Opaque's guarantees and adding a happens-before relationship. This means it is stricter than Opaque, but still lighter than volatile. Release and Acquire are two separate methods:

setRelease(): The compiler/CPU is not allowed to move a read or write instruction that happens before the Release to happen after it.
getAcquire(): All reads and writes after this point are guaranteed to see at least the data that was visible at the point of the corresponding setRelease(). The compiler/CPU is not allowed to move an instruction that happens after the Acquire to before it.

Let’s see how these rules play out in the real world. In the following code actor1 sets three values to a MemorySegment. The setRelease is used to set a flag that the data is ready to be read. Actor2 watches the flag for a change. When it reads a 1, it fetches the data from the segment.

@JCStressTest
@State
public class HappensBeforeAndAfter {

    private final MemorySegment segment;
    private static final VarHandle VH_INT = ValueLayout.JAVA_INT.varHandle();

    public HappensBeforeAndAfter() {
        this.segment = Arena.ofAuto().allocate(64);
    }

    @Actor
    public void actor1() {
        VH_INT.set(segment, 0L, 1); // Plain writes — made visible by the setRelease below
        VH_INT.set(segment, 0L, 2);
        VH_INT.set(segment, 0L, 3);
        VH_INT.setRelease(segment, 12L, 1);
    }

    @Actor
    public void actor2(I_Result r) {
        int ready = (int) VH_INT.getAcquire(segment, 12L);
        if (ready == 1) {
            r.r1 = (int) VH_INT.get(segment, 0L);
        } else {
            r.r1 = -1;
        }
    }
}

When running this code with JCStress, I got the following results. Both results are valid -1 just means that the flag wasn't set yet and there was no attempt to read the data. And 3 means that the data from the last write was read.

  RESULT      SAMPLES     FREQ
      -1  167,565,777   62.57%
       3  100,243,162   37.43%

Doing the same with just a plain set/get will result in the compiler and CPU reordering the code as there is no happens-before anymore. The result of running it with plain access is like this:

  RESULT      SAMPLES     FREQ
      -1  174,727,852   64.82%
       0        3,906   <0.01%
       1           75   <0.01%
       2           94   <0.01%
       3   94,828,052   35.18%

This isn't pretty when we want to only read the data only when it is actually available. The values 0, 1, and 2 mean the ready flag appeared set before the data was actually written. This shows that Release/Acquire excels in cases like producer-consumer designs, message-passing designs.

Volatile

This is the last and strictest mode, designed to enforce sequential consistency and a total order of operations across all threads. When you use volatile, it prevents aggressive instruction reorderings. Such as a newer read bypassing an older write. Or by ensuring that prior volatile stores are drained from the hardware's store buffer to the coherent cache before processing new loads. While it doesn't freeze the CPU to provide instant, wall-clock visibility, it guarantees that all cores will eventually agree on the exact same sequence of events. Because every read will observe the most recently written value within that agreed-upon memory order, volatile is the safest and most reliable mode for synchronizing state across multiple variables.

@JCStressTest
@Outcome(id = "0, 0", expect = Expect.FORBIDDEN, desc = "Read visibility failure")
@Outcome(id = "1, 1", expect = Expect.ACCEPTABLE, desc = "Data seen correctly.")
@Outcome(id = "1, 0", expect = Expect.ACCEPTABLE_INTERESTING, desc = "One actor got shuffled")
@Outcome(id = "0, 1", expect = Expect.ACCEPTABLE_INTERESTING, desc = "One actor got shuffled")
@State
public class NativeMemoryFullFence {

    public static final GroupLayout LAYOUT = MemoryLayout.structLayout(
            ValueLayout.JAVA_INT.withName("x"),
            ValueLayout.JAVA_INT.withName("y")
    );
    private static final VarHandle VH_X = LAYOUT.varHandle(groupElement("x"));
    private static final VarHandle VH_Y = LAYOUT.varHandle(groupElement("y"));

    private final MemorySegment segment;

    public NativeMemoryFullFence() {
        this.segment = Arena.ofAuto().allocate(LAYOUT);
        VH_X.set(segment, 0L, 0);
        VH_Y.set(segment, 0L, 0);
    }

    @Actor
    public void actor1(II_Result r) {
        VH_X.setVolatile(segment, 0L, 1);           // Store X
         r.r1 = (int) VH_Y.getVolatile(segment, 0L); // Load Y
    }

    @Actor
    public void actor2(II_Result r) {
        VH_Y.setVolatile(segment, 0L, 1);           // Store Y
        r.r2 = (int) VH_X.getVolatile(segment, 0L); // Load X
    }

}

The two actors are reading and writing to two different places inside the memorySegment. By using volatile, the write is guaranteed to be fully visible to all threads before any subsequent operation in this thread proceeds. This is slower but guarantees all threads agree on the order of operations.

  RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0            0    0.00%    Forbidden  Read visibility failure
    0, 1  142,965,429   53.47%  Interesting  One actor got shuffled
    1, 0  123,397,941   46.15%  Interesting  One actor got shuffled
    1, 1      995,009    0.37%   Acceptable  Data seen correctly.

If a weaker model like Release/Acquire is used, the CPU does not wait for the prior write to be drained to the coherent cache before executing the next read to a different address. You fire the write action and continue directly with the next read. Because each CPU is its own environment with its own physical clock, they have no clue how time is progressing on other cores. This lack of a total order means you can end up with a cycle in your memory order, observing the forbidden 0, 0 state when synchronizing across two or more variables, as shown here:

  RESULT      SAMPLES     FREQ       EXPECT  DESCRIPTION
    0, 0      205,803    0.08%    Forbidden  Read visibility failure
    0, 1  123,619,023   47.36%  Interesting  One actor got shuffled
    1, 0  136,207,596   52.18%  Interesting  One actor got shuffled
    1, 1      977,157    0.37%   Acceptable  Data seen correctly.

Release/Acquire is fine when you have a single variable that you care about, but when you need to synchronize across two or more variables, it fails and you need the stronger volatile mode.

TL;DR

Just use Get/Set and Volatile and have a peaceful life. If that's really not enough, and you really need this fine-grained control. Maybe consider still using get/set and volatile. If I really can't convince you, then the other modes are great for those special cases where volatile causes too much of a performance issue.

Access Mode	Guarantees	Best Used For
Plain (Get/Set)	None. Freely reordered by compiler and CPU.	Single-threaded memory access, or when thread safety is handled by external locks.
Opaque	Bitwise atomic, no compiler elimination, no memory fences.	Liveness checks, counters, or flags where exact ordering doesn't matter.
Acquire/Release	Happens-before ordering. Prevents specific reorderings around the access.	Message passing, producer-consumer patterns, single-variable handoffs.
Volatile	Sequential consistency and total ordering. Stores are drained to the coherent cache.	Multi-variable state synchronization, critical shared state where eventually consistent total order is required.

Conclusion

Working with native memory across multiple threads forces you to confront how hardware actually executes your code. While the FFM API provides a direct bridge to native memory, it doesn't shield you from CPU reordering or cache visibility issues. Plain access is perfectly fine for single-threaded tasks, but once you share memory segments, you need to apply the right VarHandle access modes. Volatile is the safest default, providing strict ordering at the cost of performance. If profiling indicates volatile is a bottleneck, you can step down to Acquire/Release or Opaque, but you take on the responsibility of managing the memory order yourself. Always test concurrent memory access thoroughly, as architectural differences between x86 and ARM will easily expose any flaws in your assumptions.

Bonus: Word Tearing

Word tearing occurs when a read or write operation on a piece of memory is not atomic. If you write a 64-bit value to unaligned memory, or on a 32-bit system, the CPU might execute it as two separate 32-bit operations. If another thread reads that memory in between those two operations, it will get half of the old value and half of the new value. Using Opaque would prevent this from happening. For demonstration purposes let's look at an example using unaligned memory access.

@JCStressTest
@Outcome(id = "0, 0", expect = Expect.ACCEPTABLE)
@Outcome(id = "0, 9223372036854775806", expect = Expect.ACCEPTABLE)
@Outcome(id = "0, 9223372036854775807", expect = Expect.ACCEPTABLE)
@Outcome(id = "9223372036854775806, 9223372036854775806", expect = Expect.ACCEPTABLE)
@Outcome(id = "9223372036854775806, 9223372036854775807", expect = Expect.ACCEPTABLE)
@Outcome(id = "9223372036854775807, 9223372036854775807", expect = Expect.ACCEPTABLE)
@Outcome(expect = Expect.ACCEPTABLE_INTERESTING)
@State
public class WordTearingWithPlain {

    private final MemorySegment segment;
    private static final VarHandle VH_LONG = JAVA_LONG.withByteAlignment(1).varHandle();

    public WordTearingWithPlain() {
        this.segment = Arena.ofAuto().allocate(JAVA_LONG.byteSize() * 2,1);
    }

    @Actor
    public void actor1() {
        VH_LONG.set(segment, 4L, Long.MAX_VALUE - 1);
        VH_LONG.set(segment, 4L, Long.MAX_VALUE);
    }

    @Actor
    public void actor2(LL_Result r) {
        r.r1 = (long) VH_LONG.get(segment, 4L);
        r.r2 = (long) VH_LONG.get(segment, 4L);
    }

}

The results explicitly show word tearing in action. Value 4294967294 is not the initial 0 or the intended Long.MAX_VALUE or Long.MAX_VALUE - 1. Because the MemorySegment was accessed with an unaligned layout (1-byte alignment for an 8-byte long), the JVM and CPU could not write the 64-bit long in a single, atomic hardware instruction. Instead, it was split. Actor2 managed to read the memory exactly when only half of the new value had been written, resulting in a corrupted, blended value. This highlights why alignment and proper access modes are necessary when managing memory manually.

                                    RESULT      SAMPLES     FREQ       EXPECT
                                      0, 0   24,373,979    9.07%   Acceptable
                             0, 4294967294            2   <0.01%  Interesting
                    0, 9223372036854775806       38,613    0.01%   Acceptable
                    0, 9223372036854775807      726,094    0.27%   Acceptable
                    4294967294, 4294967294            2   <0.01%  Interesting
           4294967294, 9223372036854775806            1   <0.01%  Interesting
  9223372036854775806, 9223372036854775806       21,715   <0.01%   Acceptable
  9223372036854775806, 9223372036854775807      145,851    0.05%   Acceptable
  9223372036854775807, 9223372036854775807  243,455,002   90.58%   Acceptable

The post Thread-Safe Native Memory in Java: VarHandle Access Modes Explained appeared first on foojay.

TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM

Cristobal Escobar — Tue, 24 Mar 2026 16:58:53 +0000

Table of Contents

Keyboard Shortcuts
Streaming Test Execution via SSE

Dry Run & Spec Discovery

BoxLang CLI Runner — New Power Options
Other Notable Improvements
TestBox CLI Updates (v1.8.0)
Upgrade Now

TestBox 7.x focuses on improving testing workflows for BoxLang and CFML applications. This release introduces improvements to the BoxLang CLI runner, real-time streaming test execution via SSE, dry run capabilities, a browser-based TestBox RUN interface, and several developer experience enhancements.

Check out the what's new here: https://testbox.ortusbooks.com/readme/release-history/whats-new-with-7.0.0

TestBox RUN: A Browser IDE for Your Tests

The centerpiece of TestBox 7 is TestBox RUN: a self-hosted, single-page web app (bx/tests/index.bxm) that you drop into any BoxLang project and open in a browser. No build toolchain. No external service. Just BoxLang.

It communicates with your existing runner.bxm or runner.cfm endpoints and streams spec results in real time via Server-Sent Events. Results appear in the test tree as each spec finishes, green for passing, red for failures, with full error messages; long before the full suite completes.

What You GetWhat You Get

Real-time streaming test tree — live updates per spec, not per suite
Dark / Light theme with localStorage persistence

Live search + status filters — filter by bundle, suite, or spec name; chips for Passed / Failed / Errored / Skipped

Per-bundle Run button — re-run a single bundle without touching the rest

Debug Buffer Panel — captured TestBox debug output surfaced per-bundle

Floating progress widget — current bundle, specs completed vs. total, animated progress bar

Configurable settings — runner URL, directory, bundle pattern, labels, excludes — all saved in localStorage

Every setting is also overridable via URL query params, making CI integration clean:

/tests/?directory=tests.specs.integration&labels=slow&runnerUrl=/tests/runner.bxm

Keyboard Shortcuts

Shortcut	Action
⌘/Ctrl + K	Focus search bar
⌘/Ctrl + Enter	Run all tests
⌘/Ctrl + .	Reload / rediscover tests
⌘/Ctrl + ,	Open Settings
⌘/Ctrl + B	Toggle expand/collapse all bundles
⌘/Ctrl + D	Toggle dark/light mode

Getting Started

TestBox RUN ships automatically with every TestBox 7 install under bx/tests/. ColdBox apps generated via the ColdBox CLI include it out of the box. For new projects:

testbox generate harness --help

Note: TestBox RUN requires a running web server and a runner.bxm endpoint with SSE support via BoxLang. For pure CLI apps, use the BoxLang runner with --stream (see below).

Coming Soon: TestBox RUN Desktop App

We're actively building a native desktop app version of TestBox RUN on the BoxLang Desktop Runtime — connect to any local or remote runner URL and get the same streaming UI without a browser. Watch testbox.run for early access.

Streaming Test Execution via SSE

TestBox 7 ships a brand-new StreamingRunner that pushes each spec result to the client the moment it completes, rather than buffering the entire suite.

StreamingRunner (Programmatic)StreamingRunner (Programmatic)

component {
    function streamTests( event, rc, prc ) {
        event.setHTTPHeader( name="Content-Type", value="text/event-stream" );
        event.setHTTPHeader( name="Cache-Control", value="no-cache" );

        new testbox.system.runners.StreamingRunner(
            bundles  = "tests.specs",
            options  = {},
            reporter = "text"
        ).run();
    }
}

BoxLang CLI `--stream` Flag

The BoxLang CLI runner gets native streaming support:

./testbox/run --stream
./testbox/run --directory=tests.specs --stream

This is especially useful in CI pipelines where live progress matters more than waiting for a buffered final report.

Dry Run & Spec Discovery

Two long-requested features land in TestBox 7: spec discovery and dry run mode. Audit exactly what would run before committing to a full suite execution.

Runner Dry Run
If you call the runner.bxm|cfm with a ?dryRun=true it will return back to you a JSON representation of what the test executions would look like.

Programmatic Dry Run

var tb      = new testbox.system.TestBox( bundles = "tests.specs" );
var results = tb.dryRun();

CLI Dry Run

./testbox/run --dry-run

Lists every suite and spec that would execute, with labels and skip reasons — perfect for coverage audits and CI test inventory reporting.

JSON Output

Need to feed results into another tool?

./testbox/run --dry-run=json
./testbox/run --dry-run=json --bundles=tests.specs.MySpec | jq .

Dry run respects all the same filters as a normal run: --labels, --bundles, --directory, --testSuites, --testSpecs.

BoxLang CLI Runner — New Power Options

The BoxLang runner gets a substantial set of new flags for fine-grained control over output, failures, and performance analysis.

Focus on Failures

./testbox/run --show-failed-only

Stack Trace Control

./testbox/run --stacktrace=short   # condensed (default)
./testbox/run --stacktrace=full    # complete Java/BoxLang trace

Output & Performance Flags

# Suppress passing or skipped specs
./testbox/run --show-passed=false
./testbox/run --show-skipped=false

# Abort after N failures
./testbox/run --max-failures=10

# Flag slow specs
./testbox/run --slow-threshold-ms=500

# Report the N slowest specs at the end
./testbox/run --top-slowest=5

Combine them for a tight CI workflow:

./testbox/run --show-failed-only --stacktrace=short --max-failures=5 --top-slowest=3

Application Mappings Auto-Load (TESTBOX-440)

The BoxLang runner now automatically loads Application.bx mappings from your project root before running tests. Custom path mappings, datasources, and settings are available to your specs with zero extra configuration — bringing the CLI experience much closer to a full web server environment.

Other Notable Improvements

`ConsoleReporter` — Hide Skipped Tests (TESTBOX-433)

Stop noisy skipped-spec output when you have many pending specs:

var testbox = new testbox.system.TestBox(
    bundles  = "tests.specs",
    reporter = {
        type    : "testbox.system.reports.ConsoleReporter",
        options : { hideSkipped : true }
    }
);

Or from the CLI: --show-skipped=false

Suite Filtering Fixes (TESTBOX-435)

Direct suite name matching is now reliable at any nesting depth. If a suite's name exactly matches testSuites, it always runs — no more surprises with nested suites getting skipped.

./testbox/run --testSuites="My Integration Suite"

TestBox CLI Updates (v1.8.0)

The testbox-cli CommandBox module hits 1.8.0 with two new commands:

# Show installed version, path, and project config
testbox info

# Force a clean reinstall of the CLI module
testbox reinstall

Streaming is also available via the CLI:

testbox run --streaming
testbox run --streaming --verbose   # include passing specs in live output

Engine Support

Engine	Status
BoxLang 1.x+	✅ PREFERRED
Lucee 7.x	✅ NEW
Lucee 6.x	✅
Lucee 5.x	⚠️ DEPRECATED
Adobe 2025	✅
Adobe 2023	⚠️ DEPRECATED
Adobe 2021	❌ Dropped

Adobe 2021 is no longer supported. Upgrade to Adobe 2023+ or migrate to BoxLang.

Upgrade Now

TestBox 7 is available today via CommandBox:

box install testbox

Or pin to 7.x:

box install testbox@^7.0.0

Full release notes and issue links are in the TestBox documentation. As always, file bugs and feature requests in our JIRA. You can also check out the what's new guide here: https://testbox.ortusbooks.com/readme/release-history/whats-new-with-7.0.0

The post TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM appeared first on foojay.

Managing Native Memory in Java: Arenas, Malloc, and Custom Pools

David Vlijmincx — Fri, 20 Mar 2026 10:20:04 +0000

Table of Contents

What is the Memory APIArenas

Using Arenas
Creating your own arena

Native Memory allocation methods

Using Malloc and Free

Pool of reusable memory

Why you would use them
How to use a memory pool

Slicing

How to use them

TL;DRConclusion

What is the Memory API

The Foreign Function & Memory (FFM) API is Java's new way of interacting with native code and memory. It is mostly useful when storing data off-heap or passing arguments to a native method. Handling this native memory comes down to balancing how much control you need against the risk of memory leaks. You can rely on the provided Arenas to bind memory to safe scopes, or you do it yourself using malloc and free for absolute control, or implement custom pools and slices to optimize allocations for your use case.

Arenas

Arenas are the built-in way to allocate and manage native memory. Arenas work with a scope, meaning that an Arena gets opened, allocates memory, and gets closed again. The Arena acts as a guard to make sure the memory is valid while the scope is open and freed when it is closed.

There are basically two types of Arenas you get out of the box with Java 22: those with a deterministic lifetime and those with a non-deterministic lifetime. Let's look at the deterministic lifetime first. These are the Arena.ofConfined and Arena.ofShared. These arenas have a specific start and end. You have to open them, and close them in your code. The other type of Arena is non-closeable, such as Arena.global and Arena.ofAuto. You can open these arenas but you cannot close them. The memory allocated using Arena.global() is freed when the application exits. The Arena.ofAuto() uses the garbage collector to decide when a MemorySegment should be deallocated.

Below is an overview of the different properties each arena has.

Arena Type	Bounded Lifetime	Manually Closeable	Multi-thread Access
Global	No	No	Yes
Auto	Yes	No	Yes
Confined	Yes	Yes	No
Shared	Yes	Yes	Yes

Using Arenas

Using an arena is quite straightforward. You can use the Arena interface to create each of the four arena types. The first one is the global arena. The memory you allocate with it won't be deallocated till the application exits.

Arena arena = Arena.global();
MemorySegment segment = arena.allocate(42);

In the example, we allocate 42 bytes and create a MemorySegment that you can use. Next up is the Arena.ofAuto() that uses the garbage collector to decide when to free memory.

Arena arena = Arena.ofAuto();
MemorySegment segment = arena.allocate(42);

The previous two examples showed Arenas that are not explicitly closed. You can see that we didn't call any close method. The next Arena is closeable, it also implements the Closeable interface. As you can see in the next example with the try-with-resources statement:

try (Arena arena = Arena.ofShared()) {
    MemorySegment segment = arena.allocate(42);
} // The segment is deallocated here

Here we explicitly opened and closed the Arena. The MemorySegment is only valid inside that scope, because when we exit the try the arena is closed and the memory freed. ofShared and ofConfined work mostly the same. The only difference is that the MemorySegment created with a Confined arena can't be shared by different threads. You can only use it in the thread that created the arena.

Creating your own arena

If the provided arenas are not working out for you, you can create your own by implementing the Arena interface. This could be useful for when you want to have a different allocation strategy, or need to do something else on allocation. Below is a LoggingArena, that implements the basics and logs when an allocation happens.

import java.lang.foreign.Arena;
import java.lang.foreign.MemorySegment;

public class LoggingArena implements Arena {
    private final Arena backingArena = Arena.ofConfined();

    @Override
    public MemorySegment allocate(long byteSize, long byteAlignment) {
        System.out.println("Allocating segment of size: " + byteSize + " bytes");
        return backingArena.allocate(byteSize, byteAlignment);
    }

    @Override
    public MemorySegment.Scope scope() {
        return backingArena.scope();
    }

    @Override
    public void close() {
        System.out.println("Closing arena and freeing memory.");
        backingArena.close();
    }
}

Yeah... it's not a true allocator because it doesn't manage memory itself. That is something you can't really do because the actual allocation of memory is closed off to the outside. When you create your own arena you are always wrapping an existing Arena like ofConfined. With that being said, let's see how we can break free from these limitations and manage the memory ourselves.

Native Memory allocation methods

When you need more control, you can opt to use the allocation methods provided by C. Meaning that you use methods like malloc, calloc and free just as you would in C, but without leaving Java! This gives you the most control over the memory lifetimes. The downside is that you have to manage the MemorySegment yourself, meaning there is a chance that you go out of bounds, or introduce a memory leak if you forget to free a segment.

Using Malloc and Free

To use malloc and free you basically do the same thing as with any other downcall you want to make. You use the Linker and create the downcallHandle for each C method.

Linker linker = Linker.nativeLinker();

malloc = linker.downcallHandle(
    linker.defaultLookup().find("malloc").orElseThrow(),
    FunctionDescriptor.of(ADDRESS, JAVA_LONG)
);

MethodHandle free = linker.downcallHandle(
    linker.defaultLookup().find("free").orElseThrow(),
    FunctionDescriptor.ofVoid(JAVA_LONG)
);

Now all you need to do is call these methods to allocate and free memory. In the following example you can see how these downcalls are used.

MemorySegment segment =  ((MemorySegment) malloc.invokeExact(size)).reinterpret(size);
free.invokeExact(segment.address());

The return value of malloc is cast to a MemorySegment. This segment now has the correct address but a size of zero... it is basically a pointer. So we have to call the reinterpret method to set the correct size and make it usable. The second line in the example frees the memory that was just allocated.

Small side note: You don't have to use MemorySegments if you want to work with addresses as longs and pass those around in your code; that is also totally fine. It saves you from having to create two MemorySegment instances. The malloc call itself and reinterpret both create a new MemorySegment instance. If you want to work with a long, you need to use this descriptor for malloc: FunctionDescriptor.of(JAVA_LONG, JAVA_LONG).

Pool of reusable memory

Memory pools in this context aren't provided by a specific built-in FFM API class. Instead, they are an architectural pattern you implement yourself. The two most common types are:

Segment Queues/Stacks: Pre-allocating a fixed number of identically sized MemorySegment objects and holding them in a thread-safe data structure (like an ArrayBlockingQueue or ConcurrentLinkedQueue).
Large Block Managers: Allocating one massive MemorySegment upfront and writing custom logic to hand out logical slices of it to requesters, tracking which offsets are free or in use.

Why you would use them

Native allocation (malloc or system-level calls behind Arena creation) is relatively slow. Creating Java object wrappers like MemorySegment for every single native allocation generates garbage for the GC to clean up. If you are writing high-frequency, low-latency code (like packet processing or using IO_uring!), you want to avoid hitting the OS memory allocator repeatedly. A pool allows you to pay the allocation cost once at startup and reuse the memory indefinitely.

How to use a memory pool

The implementation depends on your needs, but a basic fixed-size pool could look like this.

import java.lang.foreign.Arena;
import java.lang.foreign.MemorySegment;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;

public class SegmentPool implements AutoCloseable {
    private final Arena arena;
    private final Queue availableSegments;

    public SegmentPool(int poolSize, long segmentSize) {
        this.arena = Arena.ofShared();
        this.availableSegments = new ConcurrentLinkedQueue();
        for (int i = 0; i < poolSize; i++) {
            availableSegments.offer(arena.allocate(segmentSize));
        }
    }

    public MemorySegment borrow() {
        MemorySegment segment = availableSegments.poll();
        if (segment == null) {
            throw new OutOfMemoryError("Pool is exhausted");
        }
        return segment;
    }

    public void returnSegment(MemorySegment segment) {
        // Optional: Zero out the memory before returning it
        segment.fill((byte) 0);
        availableSegments.offer(segment);
    }

    @Override
    public void close() {
        arena.close();
    }
}

We are trading complexity and memory usage for performance. This pool is holding onto memory even when you aren't actively using it, and you have to trust your application logic to actually return the segments to the pool to avoid running out of memory. The example uses a ConcurrentLinkedQueue. This isn't necessarily the most performant way, but can work. The most performant way totally depends on your use case.

Slicing

Slicing is taking an existing, already-allocated block of memory and working with a section of it. The Memory API gives you a few ways to do this:

MemorySegment.asSlice(offset, size)
SegmentAllocator.slicingAllocator(MemorySegment)
Working with pure Offsets.

Slicing is useful when you have a block of memory and you want to pass a specific chunk of it to a method without giving that method access to the entire parent segment. Using a slicing allocator is good for when you want to allocate several smaller segments sequentially out of one large backing segment without the cost of separate native allocations. The downside is that you can only free the whole block of memory, not the individual slices.

How to use them

Using asSlice creates a new MemorySegment instance that acts as a view into the original memory. It shares the same lifetime scope as the parent, but operates within smaller, strictly defined spatial bounds.

try (Arena arena = Arena.ofConfined()) {
    MemorySegment parent = arena.allocate(100);

    // Create a slice starting at byte 10, with a length of 20 bytes
    MemorySegment child = parent.asSlice(10, 20);
}

You could also use a slicing allocator to hand out segments sequentially:

try (Arena arena = Arena.ofConfined()) {
    MemorySegment block = arena.allocate(1024);
    SegmentAllocator allocator = SegmentAllocator.slicingAllocator(block);

    // Allocates the first 8 bytes of the block
    MemorySegment firstLong = allocator.allocate(ValueLayout.JAVA_LONG);

    // Allocates the next 4 bytes
    MemorySegment nextInt = allocator.allocate(ValueLayout.JAVA_INT);
}

The trade off with asSlice and slicingAllocator is that they still instantiate Java MemorySegment objects, causing GC pressure. If you want zero allocation overhead, just calculate and pass the long offsets manually, in essence it is just adding up two long values. At that point you are doing pointer arithmetic in Java.

TL;DR

Strategy	Allocation Speed	Lifetime / Safety	Best Used For
Arena.ofConfined / Shared	Moderate	Deterministic / Safe (bounds & scope checked)	Standard off-heap usage. Short-to-medium lifespan tasks.
Arena.ofAuto	Moderate	Non-deterministic (tied to GC)	When you want native memory but prefer the GC to handle cleanup.
Arena.global	Moderate	Lives until JVM exit	Static native data, application-wide lookup tables.
Manual Malloc / Free	Fast (C-level)	Manual / Unsafe (leaks & crashes possible)	Maximum control, integrating with C libraries that expect manual frees.
Memory Pools	Very Fast	Manual (must return segments)	High-frequency allocations, avoiding OS overhead and GC pressure.
Slicing Allocator	Very Fast	Bound to parent segment	Breaking up large allocations, parsing sequential structs.

Conclusion

Managing native memory in Java used to be a hassle, but the Memory API makes it standard and manageable. The built-in Arena classes provide a balance between safety and control, catching out-of-bounds access and handling cleanup for you based on the scope. When those don't fit your use case, you can bypass them entirely and use malloc calls or build your own memory pools to cut down on allocation overhead.

Just remember that moving data off-heap introduces complexity, and dropping down to manual memory management means you take on the risk of the errors that come with it, like messing up your bounds or forgetting to free a segment. Stick to the standard Arena unless your profiling shows you actually need the raw performance of a custom pool or something like malloc.

The post Managing Native Memory in Java: Arenas, Malloc, and Custom Pools appeared first on foojay.

How is Leyden improving Java Performance? Part 3 of 3

María Arias de Reyna Domínguez — Thu, 19 Mar 2026 12:10:22 +0000

Table of Contents

What is inside the Ahead of Time Cache?

JVM Metadata
JVM Profile and Linkage Data
JVM Code and Code Management Data
Leyden Training Data

How Do I Know Leyden Is Helping?Training the applicationAnalyzing the Cache

Are we training the right thing?
Did we load all relevant classes during Training?
Are our methods properly trained?

In part 1 of this series of 3 blog posts we introduced the specific performance challenges OpenJDK faces lowering application ‘startup’, ‘warmup’ and ‘initial footprint’ costs and provided an overview of what Leyden is doing to address those challenges.

Part 2 described how to use the new capabilities offered by Leyden and presented test results which show that very significant progress has already been made and is set to continue.

Part 3 provides a more detailed account of how Leyden’s proposed solution operates and presents a first look at tooling that allows you to assess the benefits that result and tune your application to make the most of what Leyden offers.

What is inside the Ahead of Time Cache?

Ideally, an AOT cache would simply include everything needed to allow a production run to skip straight through to its warmed up state. However, in practice training runs don’t always cover all the things that can happen at runtime and hence that the assets contained in any generated AOT cache will be more or less complete.

In order to have some idea of how effective a training run has been it’s helpful to be able to look at a cache and see what is in it. Full details of the tooling that allows you to do that are presented in part 3 of this blog series. However, in order to prepare for that, we need to provide an overview of the JVM assets that end up in the cache and how the JVM uses them. We will follow up with some examples to show how effectively this improves startup and warmup.

So, let’s take a deeper look at what exactly is inside the AOT Cache. There are several different ways of classifying the contents:

The most straightforward way to classify AOT cache assets is to distinguish between Static and Dynamic data.

Static assets are data that are available in or directly derived from bytecode, data that exist, even if only implicitly, at build time.

Dynamic assets are data that get generated, or are collected, at runtime as a side-effect of execution. Some of them record information that can be used to trigger compilation and drive feedback-driven optimizations, including speculative optimization, beyond what an ahead-of-time compiler would be able to do. They can also include the compiled code that is generated as a result of that compilation.

Finally, they include training data, created as a training run progresses to track what the JVM has done and why. Training data identify what JVM assets need to be stored into the cache when it is created. They are also installed in the cache, indexing the other assets and helping identify how to use them in production.

We can also distinguish two types of data depending on their purpose:

On one hand, there is the JVM data — metadata, heap data and code. This is a network of C++ objects that are used during normal JVM running to define and regulate Java execution. These objects must always exist, even when running without a cache, in order for the JVM to be able to run an app. This object network needs to be dumped to the archive on disk in a format that allows it to be quickly and correctly restored to the relevant memory areas of the production JVM in a valid (C++ Object) format and layout that matches the JVM’s expectations.

On the other hand we also have Leyden’s own Cache Management Data, i.e. training data, which exists specifically to support creation and consumption of AOT cache. Training data are also saved and restored as C++ object data but the format and layout of these objects is determined solely by the Leyden cache management code. Its sole purpose is to track and regulate what assets get written to the AOT cache after training completes and what assets can or should be restored in production.

Let’s see in detail what each data type means.

JVM Metadata

Metadata stored in the AOT cache is a superset of what was stored in a CDS archive. The subset which overlaps with CDS is the static metadata. The latter represents the structure and hierarchy of classes in JDK and application code. Primarily, it helps avoid the cost of parsing bytecode, as it is in the same format as the JVM’s own internal metadata model: classes, methods, fields, inheritance between classes,... which can be mapped directly into memory. Having this information stored in the cache speeds up the time the Hotspot takes to decode the different class files, and to build the dependency graph.

While starting the application, the Java Heap memory gets filled with objects and instances that are going to be used during runtime. Some of those heap data objects can be cached too because they are quite predictable, like Strings hardcoded in the source code, java.lang.Class instances, some content of class static fields, objects needed to run lambdas, the class graph module,... Those are all assets that are created in memory in the same way on every run.

The heap data cached at the moment is restricted to very specific cases as it has to behave exactly the same on each and every run, but the type of data cached is expanding on each JDK version.

JVM Profile and Linkage Data

The cache also includes dynamic JVM metadata i.e. MethodCounter, MethodData and ConstantPoolCache objects. These objects are created and attached to the static metadata methods and classes and their content is updated as a side-effect of executing method code.

MethodCounter objects track how often the method they are attached to has been called. They are primarily used to trigger compilation via the baseline (C1) or optimizing (C2) compiler. The interpreter increments a method’s call count up to a threshold before scheduling a C1 compile, possibly including code that gathers further profile information. Instrumentation code in (Tier 2 andImage description 3) C1 compiled methods also updates the call count and when a higher threshold is reached either upgrades to (Tier 4) C2 compiled code or reverts to (Tier 1) C1 compiled code which includes no instrumentation

MethodProfile objects record detailed information about their associated method’s hot and cold paths, argument types and other details of how it executes, most notably any history of speculative deoptimization. Apart from the deoptimization case, which applies for both C1 and C2 code, MethodProfile objects only receive updates via instrumentation code in (Tier 2 and 3) C1 compiled methods.

ConstantPoolCache objects are attached to a clasImage descriptions and track the linkage of call and field access sites in any of the class’s methods. Prelinking avoids work at the first call or first field access and this is especially valuable when the call is an invokedynamic i.e. the bytecode that implements a lambda invocation.

Linking a lambda involves running Java ‘bootstrap’ code that identifies a private class that owns the bytecode for the lambda body, asking it to construct and return a MethodHandle that can be used to execute the target. If a lambda can be run during training then the target class and method can be pre-loaded and the MethodHandle stored in the heap and linked from the ConstantPoolCache, avoiding the need to run the ‘bootstrap’ in production. If the lambda is executed repeatedly in production the called bytecode may even be inlined into the compiled code for the caller. Effectively, executing as lambda in training removes all setup overheads in production, making lambdas as cheap to use as a direct method call.

JVM Code and Code Management Data

AdapterHandlers are a set of utilities used by the Hotspot to marshall method parameters when performing certain types of call. AdapterHandlers can be cached, avoiding the need to generate them on demand. They are identified by their AdapterFingerprint and indexed via a table of AdapterHandleEntry objects.

Alongside these handlers various StubBlobs needed by the runtime are also cached. These blobs contain JITted code that implements one or more ‘stub’ routines. Stub routines include architecture- and OS-specific code used by the JVM to perform operations that are hard to write in a platform agnostic way. Examples include: flushing code regions after update by the JIT or call linker, unwinding the stack when an exception occurs, replacing a compiled stackframe with one or more interpreter frames when execution of a deopt trap forces a bail-out etc. There are also many stubs that provide hand crafted, high-performance implementations of math, crypto or memory copy methods that are used in place of Java implementations on some architectures, especially where hand-crafted code can use specialized hardware instructions to outperform the JIT compiler. Much of the stub and adapter code has to be generated before the JDK can fully startup. Storing it in the cache and reloading it in production provides a small but noticeable performance improvement.

Leyden premain also includes CompiledMethods, i.e. pre-compiled Java methods, in the cache. This includes both C1 and C2 (Tiers 1 - 4) and in some cases different tier compiled versions of the same method. Having compiled code immediately available, especially Tier 4 code, is an enormous boost to performance. Lower tier code may be useful when the method only reached that tier during training or as a fallback if we need to deoptimize and reprofile. Pre-compiled Java methods are an enhancement we expect to add soon to the mainline JDK.

Leyden Training Data

Training data is part of the Leyden specific code. It tracks which methods have actually been loaded, executed, and used during the training run and how they have been used. Normally all loaded classes have associated class training data, but these may be omitted if, say, the class is loaded by a custom (user-defined) loader, is modified by an agent or fails to resolve because of linkage errors.There is a usage threshold which means that only methods that have been executed above that threshold will have associated method training data. Likewise, compiled method training data only exists for methods actually compiled during training. This helps both in keeping a smaller footprint in the cache and removing less useful data so processing the cache is faster.

How Do I Know Leyden Is Helping?

Depending on how well you train your deployment you may see different improvements in time to reach application start (startup time) and time to reach peak performance (warmup time). Log output is one useful way to measure these two metrics but the details will depend on what monitoring capabilities are available in your test or production environment. However, simply measuring these two times (or even recording warmup profiles) doesn’t help with the problem of explaining why, for some given training regime, you get a specific improvement or perhaps, in some cases, no measurable improvement.

For any given AOT cache (or set of alternative caches) it is very helpful to have some idea of what assets were included or excluded in the training set, which ones were written into the cache and what benefit they provide during a production run. In particular, it is useful to have both aggregate statistics and information on individual assets and their relationships. The Leyden project has provided a tool precisely to address these needs. Let's see a practical example of how to diagnose an AOT Cache.

For the purpose of this article, we are going to use the following example application: https://github.com/Delawen/bad-good-cache

This is a web application that has a simple API and a basic html interface to use it.

The first thing we need to do is to compile this application on the root folder:

$ mvn clean package

Training the application

Once we have the jar created, we use it to start a training run:

$ java -XX:AOTCacheOutput=target/app.aot -Xlog:aot+map=trace,aot+map+oops=trace:file=target/aot.map:none:filesize=0 -Xlog:class+load=info,aot+resolve*=trace,aot+codecache+exit=debug,aot*=warning:file=target/training.log:level,tags -jar target/quarkus-app/quarkus-run.jar

The arguments we are going to use are the following:

-XX:AOTCacheOutput=target/app.aot Which will create an AOT file called app.aot
-Xlog:aot+map=trace,aot+map+oops=trace:file=target/aot.map:none:filesize=0 Which will create a map file that indexes and describes the previously created AOT file.
-Xlog:class+load=info,aot+resolve*=trace,aot+codecache+exit=debug,aot*=warning:file=target/training.log:level,tags Which will generate training log files with relevant information

To help us train the application, we are going to use the \oha\ tool, that helps us run a series of requests that will showcase a user using the application:

$ oha --urls-from-file src/main/resources/urls.txt -n 100

Now that we have trained the application, let's stop it with ctrl+c. It will take some time to stop while it builds the cache. It will do both the training and assembly steps at once.

We should have created three types of files:

target/app.aot : The AOT cache itself
target/aot.map : The map file
target/training.log : The logs for the training run

Now that we have the AOT cache, we can start a production run, in which we will also save log files:

$ exit

The arguments we are going to use are the following:

-XX:AOTCache=target/app.aot Make use of the AOT file called app.aot
-Xlog:class+load=info,aot+resolve*=trace,aot+codecache+exit=debug,aot*=warning:file=target/production.log:level,tags Which will generate a production log file with relevant information

And we can use the application normally. Let's play a bit on http://localhost:8080/

On this run, we created the production.log file.

Analyzing the Cache

After using it, we can stop it and analyze how the AOT Cache behaved with our AOT Cache diagnostics tool: https://github.com/Delawen/leyden-analyzer

The first step is loading all the information into the tool, to run a proper analysis:

\> load aotCache --background target/aot.map

*\> load trainingLog --background target/training.log**

*\> load productionLog --background target/production.log**

Now we are ready to start our analysis. A good place to start is the info command that shows a summarized version of what is inside the cache:

Are we training the right thing?

The first thing that should catch our attention is that there's more than 10% of classes that were used on the production run but were not cached. That's not usual, so let's dig into whatImage description those classes are. There are hundreds of them, so if we filter by our package name, that would make our exploration easier:

What does this mean? Let's take a closer look:

This class was not loaded during training but it was loaded during production. Something went wrong with our training.

We can explore the class org.cutecats.rest.json.CatPhotoGenerator by looking at the source code. There, we discover that it should be used by org.cutecats.rest.json.CatResource.

So, this class was loaded both on training and production runs, and the metadata is included in the AOT Cache. But for some reason, none of its methods were profiled during the training run. This means that our training run did not make extensive use of this class. Maybe we should take a look at our training run.

And indeed, there is an obvious mistake: the urls.txt file that oha used to create the requests only contains the static html pages. None of our Java classes are executed, although Quarkus loaded them at the beginning as services.

Let's run again the training, changing the url to the Java endpoints instead of the html pages: http://localhost:8080/cats and http://localhost:8080/list

Don't forget to remove the log and aot files from target/ after each try to have clean runs (the clean on the maven command should do that).

If we analyze the results again with our tool, we should see a different result:

We have increased the percentage of the classes used (96%) in production that were cached compared to our last attempt (89%). That's an improvement.

Did we load all relevant classes during Training?

Let's check again for classes loaded in production that were not cached:

Something is still not working as intended. Maybe we should approach this from the other side: are we executing some testing code that replaces the real production code that should be executed during training?

Let's check if there's something being stored in the cache that we don't really need:

We can see a suspicious class called DummyPhotoGenerator. That's supposed to be used only for testing purposes, not for real training and production. Using DummyPhotoGenerator instead of the CatPhotoGenerator class is making the code and classes used by CatPhotoGenerator not being used. If we explore our source code, we will discover that there is a “test” argument on the /cats endpoint that distinguishes between testing and production.

To fix our training, we have to call the endpoint /cats with a test=false argument. Because the training run is not a test run.

The training run has to be as close to production as possible. If we use test classes, not only will they be stored in the cache and be loaded on production run, but they may also hide real production code from being trained.

Let's try again, now using http://localhost:8080/cats?test=false in the urls.txt file.

We have increased a bit more the percentage of classes loaded, which is always a good sign.
Image description
Do we have any other classes loaded during production that were not cached during training?
Are there any testing classes loaded during training or production runs?

We made sure that:

- All our classes used in production are included in the cache

- None of our testing classes are included in the cache

Although we still don't have the aspiring ideal 100% classes cached, we are really close (98%) and we can be happy with the list of classes cached. We can now focus on how good the profiling of the methods is.

Are our methods properly trained?

Maybe you already noticed another important information we have been ignoring until now: all our classes are labelled as "[Untrained]". Let's take a closer look at that.

Profiling is done on each method independently, so let's take a look at one of our methods that we know should be well trained. The describe command is pretty self explanatory on this case:

Let's follow the recommendation and do more requests during the training run.

$ oha --urls-from-file src/main/resources/urls.txt -n 10k

And with 10 000 requests done, we can see that we got our method completely profiled and compiled to the higher level:

There’s still more improvements that can be done to the training that will greatly depend on your application, but now we have all the basics covered.

The post How is Leyden improving Java Performance? Part 3 of 3 appeared first on foojay.

How is Leyden improving Java Performance? Part 2 of 3

María Arias de Reyna Domínguez — Wed, 18 Mar 2026 12:05:34 +0000

Table of Contents

How to use an AOT Cache

How to properly execute the Training Run?

Should I start using AOT Cache in Java already?

Heavy Mathematical Example
Simple REST API

Part 2 describes how to use the new AOT capabilities offered by Leyden and presents test results which show that very significant progress has already been made and is set to continue.

Part 3 provides a more detailed account of how Leyden’s proposed solution operates, and offers a first look at tooling that allows you to assess the benefits that result and tune your application to make the most of what Leyden offers.

How to use an AOT Cache

To use an AOT cache (on JDK 25+), you need to add some JVM arguments to your app launch command. There are two ways of doing it, in 2 or 3 steps.

Joint Training and Assembly steps — writing of the AOT cache is performed in a forked Java runtime at training run exit:

Training+Assembly Run: java -XX:AOTCacheOutput=${aot-cache-file} -jar app.jar
Production Run: java -XX:AOTCache=${aot-cache-file} -jar app.jar

Step 1 of the two step model runs your application until it exits (whether by means of some exit mechanism built into the application or simply by typing Ctrl-C on the console). At that point a separate Assembly JVM is forked to consume the training data collected during the training run and generate an AOT cache using the name supplied via the AOTCache command line option. The training JVM waits for the Assembly JVM to finish writing this file before it completes its own exit.

Step 2 runs the production application using the AOT cache specified by the AOTCache command line option.

Separate Training and Assembly steps — allows the assembly run to be executed independently without delaying the training run exit:

Training Run: java - XX:AOTMode=record -XX:AOTConfiguration=${aot-cache-conf-file} -jar app.jar
Assembly Run: java -XX:AOTMode=create -XX:AOTConfiguration=${aot-cache-conf-file} -XX:AOTCacheOutput=${aot-cache-file} -jar app.jar
Production Run: java -XX:AOTCache=${aot-cache-file} -jar app.jar

The three step model allows you to manage training and assembly as independent steps.

Step 1 runs your application until it exits, at which the training data collected during the training run is dumped to an AOT configuration file specified using the AOTConfiguration command line option.

In step 2 this training data is passed to a new JVM using the same command line option and is used to generate an AOT cache to the file specified using the AOTCacheOutput command line option.

Step 3 runs the production application using the AOT cache specified by the AOTCache command line option.

The 3 step workflow is sometimes preferable because it allows the training JVM to exit more quickly. Dumping of training data is usually quick even if it is not instantaneous. Generation of the AOT cache takes substantially longer because there is a lot more work involved in sorting and laying out that data in a format that meets the JVM’s needs.

Also, with the Leyden premain release, the Assembly JVM will perform a ‘cleanroom’ compilation of all the methods to be included in the cache, possibly compiling them at more than one compilation level. This adds more time to the cache generation step.

How to properly execute the Training Run?

The best way to train your application and generate the AOT cache is a canary deployment, where you run your application in the real production environment with training enabled, allowing it to collect training data as it runs. However, that’s not always feasible, especially on containerized production environments that don’t have disk-write privileges.

Depending on how your deployment is set up this may be the type of circumstance where you choose the 3 step training model, allowing your training JVM to exit quickly and relegating the assembly to a separate, follow-on deployment. Note that the assembly JVM does not run your application code so will not need access to resources like networks or databases.

Recording requests made to your application and replaying them on a test server (either in real time or delayed), is also a very good way to generate the AOT cache, as it reproduces exactly the same kind of behaviour you can expect on production. Alternatively, you can generate synthetic request data that simulates the behaviour you expect to encounter in real production, although that may reduce the relevance or accuracy of the resulting AOT cache assets.

If you have a strong testing framework, and you are using Quarkus, you can always generate the AOT cache using integration tests. Note that you will need to run the methods repeatedly (probably several thousands of calls) to generate the proper compilation optimizations.

The best results arise when the raining run resembles a production run as closely as possible. However, whatever training method you employ, on any cache produced will only be usable in production if you run with the same JVM and the same command line JVM options.

You can add extra jars at the end of the production run classpath but the initial segment must be the same as the classpath provided during training.

At the moment of writing this article, you also need to deploy on the same CPU family and operating system. In upcoming versions which will include compiled code in the cache, the production hardware must implement the exact same CPU features as the hardware used for the training run. If the CPU features are not identical then compiled and stub code assets will be ignored (other cache assets will still be usable).

Remember to follow these basic constraints when generating the cache: same hardware, same Java version, same Operating System, and same JVM arguments.

Should I start using AOT Cache in Java already?

The short answer is yes.

Whether your application gets significantly faster now, or if you are interested in testing it to help Leyden development move towards your interests, you should start using the AOT cache already.

Note that you need at least JDK 25 to be able to use it. Performance gains are incremental with each new JDK release. The actual improvements that you can achieve using Leyden depend strongly on your application and how you use it.

Let’s see some examples. We are going to run them over JDK 26.

Heavy Mathematical Example

First we are going to use a benchmark application that runs heavy mathematical operations via a REST API. We are going to train this application twice to compare how different training affects performance on production.

This application makes use of an aot-jar from Quarkus which is optimized for Leyden and available since version 3.32.0.

We are going to use a training run that randomly calls the following urls:

/nqueens/16 : to calculate the nqueens problem with a 16 board size
/fibonacci/100 : to calculate fibonacci series with input 100
/nqueens : to calculate the nqueens problem with either 16 or 8 as the board size
/fibonacci : to calculate fibonacci with a random number between 1 and 100

The idea is to have a load that is partly random (as an API with real users would be) but has a preference over specific branches or loop unrolling sizes.

We will do a training with 1000 requests and a second training with 60 000 requests. That should represent how different training affects final performance. We are going to run the application on a Linux machine assigning 2 cores to our application.

Due to the kind of things Java 26 is storing in the AOT cache, we can theorize that there won’t be much difference between the different trainings when comparing startup time (startup of the application and opening the port), as most of the code run during initialization is run only once, so adding more requests to the training won’t improve that initialization time. This is something that may change in future developments as more assets are included in the AOT cache.

By using the Java AOT Cache Diagnostics Tool, we can compare the contents of the cache generated on each training.

1000 requests training	60 000 requests training

As expected, both trainings seem to have cached the same amount of metadata (slightly above 99% of classes used), because the code loaded into memory on both cases should be more or less the same (some timeout or runtime exception thrown may explain differences). This means that the startup time using any of the generated caches should be similar.

The training with 60 000 requests has many more methods that are profiled and compiled at a higher level because it had a longer time to profile and optimize. That should lead to better outcomes on the warmup time.

Anyway, we should notice an improvement on startup time compared to the regular Java deployment, because we have a lot of metadata, profile and linkage data and some heap data already cached during AOT. And as we can see on the following graph, time to first response (which includes initialization) is already cut in half.

The other interesting measurement is how much and for how long response times are disrupted during the early stages of application execution. There is always some small variation in response times even when an app is fully warmed up – often referred to as jitter.

However, during warmup the housekeeping work that the JVM has to do can significantly increase jitter. Individual responses may be delayed because they require the thread to execute one-off events like load or initialize a class, link a call site or field access site, or update profile data. Background JIT compilation will also steal CPU cycles, potentially pre-empting request handling in Java threads. Finally, early requests will mostly execute relatively slowly in the interpreter, while later requests will gradually respond more quickly as the JIT compiler delivers compiled code.

In theory, this is where longer training sessions should have a bigger impact. Better training results in more cached classes and heap objects needed in production, more pre-linking of calls and accesses, more method profile data to allow earlier and better informed compilation. So, we should see not only an improvement compared to the regular java version, but also a difference between the two trained caches.

The graph above shows individual response times for requests for each of the three deployments, Traditional Java (no AOT), AOT trained with 1000 requests, and AOT trained with 60 000 requests.

In all cases, the request rate is constant and within the peak capacity of the server. In all 3 cases the jitter slowly decays as the request count increases, eventually converging to a low, random variation.

However, it is also very clear that

There is a lot more JVM housework being performed in the non-AOT case than when using AOT, reaching peak performance later.
The well trained cache suffers less jitter, i.e. removes a lot more housework, than the weakly trained app.

Note that JDK26 does store training data on the cache, but does not store compiled code. This means, future versions of the JDK will show a much larger difference between weak and strong training regimes.

Simple REST API

Now, we are going to do the same with a simple REST API application using Quarkus that connects to a database to extract data. This time we employ the Leyden premain JVM which caches code compiled code as well as all the other AOT cache assets mentioned earlier.

We use a single training run with 10 000 requests, executing a test that calls the endpoint /fruits repeatedly. In contrast with the previous example, in this case we are not going to observe as much advantage from speculative compilation because the code is simpler and the entrypoint is always called with the same parameters. But we should still see an improvement all the same.

Let’s take a look at the time to first response to see if it is improved by Leyden AOT:

Startup time in this example is slower than in the previous example because we have to initialize connections to the database and load the database model.

Now, let’s take a look at the response times and see if the warmup time is also improved thanks to the AOT cache.

Both runs suffer jitter during the first requests, at which point most of the housekeeping work is completed. The dramatic drop off in jitter for the AOT run around request 45 indicates that at this point almost all loading, initialization and linking costs have been met and the necessary compiled code has been delivered. By contrast the non-AOT run is still suffering jitter even after 100 requests i.e. it has still not warmed up to reach peak performance.

These are only a couple of examples that showcase how Leyden is already improving your startup and warmup time.

How far Leyden can help your application can only be discovered by trying it.

The post How is Leyden improving Java Performance? Part 2 of 3 appeared first on foojay.

How is Leyden improving Java Performance? Part 1 of 3

María Arias de Reyna Domínguez — Tue, 17 Mar 2026 12:00:45 +0000

Table of Contents

A Brief History of Java Performance

Why Java takes time to reach peak performance

Housekeeping considered harmfulLeyden Project ‘premain’ Experiment

Training and Production Runs

In this series of 3 blog posts we will explain how OpenJDK project Leyden is helping to improve a specific area of performance where Java has notably lagged behind other languages i.e. application ‘startup’, ‘warmup’, and ‘initial footprint’.

Part 1 explains what those terms mean and why Java faces challenges in matching the behaviour of other languages. It then provides an overview of what Leyden has done to improve startup and warmup in existing JDK releases and what is planned for upcoming releases.

Part 2 describes how to use the new capabilities offered by Leyden and presents test results which show that very significant progress has already been made and is set to continue.

A Brief History of Java Performance

Java has been one of the most popular programming object-oriented languages for decades. Its success relies heavily on the fact that it offers a portable, managed runtime that makes it easy and safe to resolve many common programming challenges. In particular, Java was the first portable language to make it straightforward for programmers to deliver multi-threaded applications which allocate and manage storage at runtime without risk of invalid memory accesses.

The fact that Java remains popular still surprises some programmers, given that it belongs to the family of dynamic languages that, most notably, includes Lisp, Smalltalk and Self. Dynamic languages allow their code base to be incrementally defined as the program executes. That code base is often implemented using a language-specific virtual machine. Dynamic languages were traditionally executed by interpreting either the source code or an intermediate bytecode derived from the source. This often caused lower performance than native-compiled, non-dynamic languages.

However, modern Java runtimes rely on powerful ‘just-in-time’ (JIT) compilers to translate bytecode to native machine code at runtime. JIT compilation, a technique originally tried in Smalltalk nearly 40 years ago, has improved Java performance by orders of magnitude from the early days of an interpreter-only runtime. The use of runtime execution profiling supports feedback directed optimization and speculative optimization. This has allowed Java JIT compilers to achieve peak performance that far exceeds what can be achieved with programs that are compiled ahead-of-time (AOT).

Why Java takes time to reach peak performance

The downside of dynamic class loading and JIT compilation is that a Java runtime takes some time to achieve this impressive peak performance.

When a new Java application is launched, it is normally a ‘cold start’. Details of all the classes and methods the application needs to use are only available in a compact bytecode representation, stored on disk either in application supplied class files or embedded in the Java platform’s jmod files.The Java Virtual Machine (JVM) has to parse and unwrap this bytecode, constructing its own ‘metadata’ model of the class and method base, one that the interpreter and compiled code can efficiently operate over. It also has to set the base state of each loaded class, running Java ‘static init’ code to populate the class’s static fields, before it can execute any of the class’s methods.

In addition, the JVM has to perform dynamic linkage. When compilation or execution of a Java method first encounters a call (invoke bytecode) or a data access (get/putfield bytecode) the JVM has to link that call or data access site. That involves replacing references to the target class and method/field, which occur as symbol names in the bytecode, with a direct memory reference. This identifies first the target metadata class, and then the target metadata method or field. If the target class has not yet been encountered during execution, this linking step may trigger further bytecode loading, parsing, and class initialization.

The JVM normally starts off executing Java methods in the interpreter. Of course, it could always execute native code, compiling the Java method bytecode either immediately at load or lazily at first call. However, compilation takes time to complete so it is normally better done in the background while proceeding to interpret. Indeed, JIT compilation frequently pays off more when done selectively. Methods that only get called once or twice can take more cycles to compile than to simply interpret the bytecode.

Furthermore, without runtime execution profile data as input, the compiler is unable to make informed, feedback-directed optimizations that significantly improve performance of the compiled code. Most importantly, it cannot simplify the compiled code by speculating that previous execution patterns will continue, replacing code that lies on untaken ’cold’ branches with traps. Speculative compilation, an optimization first used in the Self compiler over 30 years ago, reduces both the size and the complexity of bytecode that feeds into a specific compilation. That, in turn, enables deep inlining of method calls and offers the possibility to identify many more derived optimizations. The rare case where a trap on a cold branch gets executed is handled by deoptimizing i.e. jumping back into the interpreter and recompiling the method with an updated branch profile.

Housekeeping considered harmful

During early stages of application execution, the JVM housekeeping overheads listed above are at their highest. Class loading and initialization, class linking, and recording of method execution profile data occur frequently as side effects of execution, for both application and JDK runtime methods, impeding direct forward progress of the application. Method compilation proceeds in dedicated, background compiler threads, but this still steals CPU cycles, once again, impeding application progress.

The impedance of JVM housekeeping work gradually decreases, as more and more of the required JDK code and application code is gradually linked into the runtime. At the same time delivery of compiled code improves application execution speed incrementally.

After some time, a steady state is reached where most or all classes are loaded and linked, most or all methods have been profiled, and all ‘hot’ methods have been compiled with highly efficient code. Very occasionally variation in input data or a phase change in program behaviour drives the application down a cold path, triggering deoptimization and incurring extra JVM overheads. However, by and large, applications mostly warm up and continue to run with steady peak performance.

Leyden Project ‘premain’ Experiment

Project Leyden has been experimenting with reducing the impedance of JVM house keeping tasks in the ‘premain’ branch of the project repository. The observation that drives the Leyden premain experiment is that, most of the time, the housekeeping operations that occur during an application run involve doing exactly or almost exactly the same work with the same result, certainly in the early stages where the impedance is high. On every run a lot of the same byecode gets loaded and linked, the same classes get initialized, the same methods turn out to be hot, and end up getting compiled with the same or very similar profile information.

This is especially true for the JDK runtime code that runs before entering the application main method, likewise for JDK library code that the application calls out to. The JVM will always load base classes like java.lang.Object, java.lang.Class, or java.util.String. The same String instances, hard coded as literals in JDK methods, are added to the heap on every single run. Container classes like List and HashTable are commonly reused for the same purposes.

JDK classes are fixed for any given release so their class, method and field metadata will always be the same and they will always cross-reference each other (i.e. be linked) in exactly the same way. In fact the Leyden premain branch gets its name from its original focus, which was optimizing this JDK execution that happens before entering application main.

The idea of profiting from this identity of JDK metadata across runs is not new. Since JDK13 Class Data Sharing (CDS) has been able to optimize away class loading and bytecode parsing for JDK classes by storing the JVM’s metadata model of the JDK classes in a CDS archive, allowing it to be reloaded ’oven-ready’ on subsequent runs.

That version of CDS provided an effective, albeit limited, limited warm-start capability for the JDK, halving the time taken for the JDK to start up i.e. complete JDK initialization and enter the application main routine. CDS also helped application warmup by lowering initial costs involved in callouts to JDK library code.

With application classes there is no strong guarantee that the same classes will be present in the same format between one run and the next. Or that classes loaded and used on one run will always be loaded and used in the same way on subsequent runs. However, so long as the same jars appear in the classpath and the class bytecode is loaded without runtime-specific agent transformations, then it is possible and, for many classes, quite probable that saved metadata will be reusable.

More recent versions of CDS have supported save and restore metadata for application classes via a dynamic CDS archive, allowing the JVM to bypass loading and bytecode parsing costs for those classes on subsequent runs, improving both application startup and warmup.

Leyden’s premain branch builds on this success but it is addressing a bigger prize than just archived metadata. The broader internal JVM state — not just metadata but static field data, linkage data, method profiles, compiled code — which is slowly constructed during warmup, may vary depending on precisely what happens on each run. However, most of what is created on one run, if it could be saved in an archive – as CDS currently does with metadata, ought to be reusable on a subsequent run, short circuiting the housekeeping overheads normally incurred to create it.

Even if some saved state might turn out not to be useful, because, say, a class was not referenced or a method not called in the subsequent run, the ability to reuse some of the state should still pay off. The cost of reloading the required state can be made much lower than the cost of recreating, meaning the application can reach peak performance earlier, with less impedance from the JVM. The more reusable state that can be saved the greater the reduction in impedance.

Training and Production Runs

So, the basic idea behind is to run your application twice:

Training Run: in which we cache the metadata, the profiling statistics, some heap data, compiled code,...
Production Run: loads the previously (ahead of time) cached information, so the run starts hot. This is the “real” run in which we make use of our app.

Of course, this only makes sense if the training run accurately represents the production run.

To achieve this, we need to respect the following constraints:

Same Hardware: Or the compiled code may not be able to run, and the optimizations made may even be against performance in our production run.
Same Java version and source code: If we change the source code, anything cached related to the source code gets deprecated and becomes useless.
Same Operative System Family: There are pieces of the JVM that behave differently on Linux, Windows or MacOS. We can’t just reuse our cached information if we change it.
Same JVM options (mostly): We could maybe change some JVM options (like use a different garbage collector). But then, profiling statistics that we cache, and information about how the application behaves, may no longer be valid to our new configuration. Better not to play with these settings.
[optional] No Custom Classloaders: The cache will ignore (for now) the classes loaded with a custom classloader. This means that part of the application will not be hot when run for the second time.

Some of the AOT improvements developed in the Leyden branch have already been made available in the latest JDK LTS version at the moment of writing this article (25). But the plan for subsequent releases is that more features will be migrated, more things will be cached, and the performance gains will get better and better. The performance gains strongly depend on your app usage, the JDK version you are using, and how good is the training you are doing.

In the next post we will explain how to use the new AOT capabilities that are available in both JDK25. We will also present test results which show that very significant progress has already been made and is set to continue on capabilities already present in JDK 26 and on.

The post How is Leyden improving Java Performance? Part 1 of 3 appeared first on foojay.

BoxLang 1.11.0 Release

Cristobal Escobar — Tue, 17 Mar 2026 09:40:04 +0000

Table of Contents

What's New in 1.11.0

Performance Wave — 15+ Targeted Runtime Speedups
Concurrency & Lock Safety — Critical Fix
DateTime Casting Reliability
enforceUDFTypeChecks Configuration Setting
getTickCount() — Nanosecond & Second Precision
New BIF: ExecutorDelete()

Core Runtime Updates

Class System Improvements
Thread & Execution Fixes
Query System
String & Type Improvements
XML Handling
Transaction & Stored Procedures

MiniServer Runtime Updates

.boxlang.json Convention
Undertow / Socket / WebSocket Options
Logging Directory Output
Undertow Upgraded to 2.3.23.Final

Web Support Updates

Pre-Request Interception for Request Rerouting

Developer Experience

Enhanced --bx-printast Tooling
SOAP Client — Binary and Map Type Support
Session Configuration in boxlang.json
Improved CLI Error Messages

Notable Bug Fixes Notable Bug Fixes Configuration Updates Summary Dependency Updates UpgradingJoin the BoxLang Community

We're proud to announce BoxLang 1.11.0, a highly focused performance and stability release that delivers measurable speed improvements across every BoxLang application, with zero code changes required. The team invested deeply in bytecode generation, class loading, lock management, and type casting to produce one of the most impactful runtime optimization releases to date. Alongside the performance wave, this release resolves critical concurrency bugs, hardens DateTime handling, and ships powerful new developer tooling.

🚀 What's New in 1.11.0

You can find the full release notes here:
https://boxlang.ortusbooks.com/readme/release-history/1.11.0

⚡ Performance Wave — 15+ Targeted Runtime Speedups

BoxLang 1.11.0 includes over 15 targeted performance improvements spanning bytecode compilation, runtime execution, memory management, and concurrency. Every BoxLang application benefits immediately.

Bytecode & Compilation
The compiler has been significantly tightened:

Optimized bytecode generation avoids unnecessary casts during value operations
Cached isFinal and isAbstract flags at compile time instead of computing them at runtime
Reworked FQN parsing eliminates expensive regex operations on every class lookup
Improved ClassInfo lookup during compilation using better caching strategies
Optimized ClassLocator cache key generation via improved hashCode() creation
Runtime Execution
Core runtime operations are noticeably faster:

// All of these are faster in 1.11.0 — no code changes needed
result = myClass.doWork()           // Faster class construction via this.get()
found  = myArray.find( "value" )    // arrayFind optimized, avoids stream overhead
flag   = isBoolean( "true" )        // Faster boolean string parsing
someBif( arg1, arg2 )               // Arg/return type casting via keys, not reflection

Memory & Concurrency

Cached closest variables scope reference in function contexts
Cached web request config instead of re-resolving per request
Case-insensitive string matching uses an optimized algorithm
Reduced toRealPath() calls that were silently adding overhead on every file operation
Simplified constructor path for Box Classes reduces object creation overhead
Removed function inner classes, reducing class loading and GC pressure
Avoided Map.containsValue() in UDF invocation (linear scan → constant time)
The cumulative effect is meaningful: applications under load will see reduced latency, lower GC pressure, and better throughput — all with zero migration effort.

🔒 Concurrency & Lock Safety — Critical Fix

Two critical bugs in the exclusive lock system have been resolved. Before 1.11.0, exclusive locks could occasionally allow more than one thread into a supposedly exclusive section under high load (BL-2203, BL-2205).

// This critical section is now truly exclusive under concurrent load
lock name="processPayment_#orderId#" type="exclusive" timeout="30" {
    // Only ONE thread will be here at a time — guaranteed in 1.11.0
    if ( !paymentProcessed( orderId ) ) {
        processPayment( orderId )
    }
}

Lock storage has also been improved (BL-2201) for better performance and memory efficiency. If you rely on exclusive locks for payment processing, inventory management, or any critical section — this is an important upgrade.

🗓️ DateTime Casting Reliability

A comprehensive sweep of DateTime casting fixes ensures robust date handling across all common formats and edge cases:

// All of these now work reliably in 1.11.0
date1 = createDateTime( "01-31-2026 23:59:  59" )          // BL-2189
date2 = createDateTime( "9-30-2010" )                     // BL-2222
date3 = parseDateTime( "2026-01-31 00:00: 00.000" )        // ODBC Timestamp (BL-2143)

// Query of Queries with ODBC Timestamp columns now compiles correctly
qoq = queryExecute(
    "SELECT * FROM myQuery WHERE dateCol > :dt",
    { dt : now() },
    { dbtype : "query" }
)  // BL-2144

// DateTimeCaster now handles ODBC Date/Time formats
cast1 = dateTimeFormat( odbcDate, "yyyy-mm-dd" )          // BL-2188

🆕 `enforceUDFTypeChecks` Configuration Setting

A new runtime setting allows you to skip UDF argument and return type validation — useful for trusted high-performance codebases:

// boxlang.json
{
    "enforceUDFTypeChecks": false
}

When false, BoxLang skips argument type validation and return type casting on function calls — similar to how the Java compiler performs generic type erasure. This can improve performance but removes the safety net of runtime type checks.

⏱️ `getTickCount()` — Nanosecond & Second Precision

getTickCount() now supports nano and second units alongside the existing millisecond support:

// Micro-benchmark with nanosecond precision
start   = getTickCount( "nano" )
doExpensiveWork()
elapsed = getTickCount( "nano" ) - start
println( "Elapsed: #elapsed# ns" )

// Coarse timing in seconds
start   = getTickCount( "second" )
sleep( 2000 )
elapsed = getTickCount( "second" ) - start
println( "Elapsed: #elapsed# seconds" )  // 2

🗑️ New BIF: `ExecutorDelete()`

The missing ExecutorDelete() BIF has been added, completing the executor lifecycle management API. Previously, shutting down an executor did not remove it from the executor registry (BL-2168), causing issues when recreating executors with the same name.

// Create an executor
myExecutor = executorNew( "myPool", "fixed", 10 )

// Submit work
future = executorSubmit( myExecutor, () => doWork() )
future.get()

// Full cleanup — now properly removes it from the registry
executorDelete( "myPool" )

🤖 Core Runtime Updates

🏗️ Class System Improvements

Super class loading improved to handle complex inheritance hierarchies reliably (BL-2211)
Abstract class enforcement relaxed — abstract classes are no longer required to implement all interface methods (BL-2251), matching Java and CFML semantics
Typed array returns no longer throw NPE when a class is instantiated via a different invocation path (BL-2237)
Implicit accessors now generate the correct return type in method signatures instead of always using any (BL-2195)

🧵 Thread & Execution Fixes
Duplicate bytecode methods no longer generated in edge cases (BL-2207)
Incompatible stack heights when not assigning new Foo() resolved (BL-2213)
Illegal exception table range in class files fixed (BL-1916)
Parser concurrency issue in LSP fixed when getting cache size (BL-2253)

📊 Query System
QueryNew() and queryAddRow() now properly validate column types (BL-2247)
distinct(col) no longer confused with a function name in QoQ (BL-2221)
QoQ with ODBC Timestamp format columns now compiles correctly (BL-2144)
Query column scope no longer found in loops for assignment (BL-2208), fixing variable scoping edge cases

🔤 String & Type Improvements
quotedValueList() now correctly wraps values in single quotes per CFML spec (BL-2185)
println() can now be called with no arguments to output an empty line — no more println( "" ) workaround (BL-2200)
compareTo() date member method no longer incorrectly attaches to zero-valued BigDecimal (BL-2166)

🌐 XML Handling
Deleting a non-existent key from XMLAttribute no longer throws an error (BL-2231)
XMLChildren now updates correctly in all mutation cases (BL-2240)
WDDX now properly escapes special characters in attribute values (BL-2216)

🔐 Transaction & Stored Procedures
Transaction end action no longer throws an error when a stored procedure was executed within the transaction (BL-2157)
Transaction action attribute is now case-insensitive (BL-2238)

📡 MiniServer Runtime Updates

📁 `.boxlang.json` Convention

The MiniServer now automatically detects and loads a .boxlang.json file from the current working directory, merging it with the base BoxLang configuration (BL-2218):

# Start the server — .boxlang.json is automatically picked up
$ boxlang server start

// .boxlang.json — project-level configuration, committed to source control
{
    "enforceUDFTypeChecks": false,
    "defaultDatasource": "mydb"
}

This makes project-level BoxLang configuration portable and self-contained — ideal for containerized deployments and team environments.

⚙️ Undertow / Socket / WebSocket Options

You can now tune Undertow, socket, and WebSocket low-level options directly from miniserver.json:

{
    "undertow": {
        "ioThreads": 8,
        "workerThreads": 64,
        "bufferSize": 16384
    },
    "socket": {
        "tcpNoDelay": true,
        "reuseAddress": true
    },
    "websocket": {
        "maxFrameSize": 65536,
        "maxTextMessageSize": 65536
    }
}

📂 Logging Directory Output

The MiniServer now logs the logging directory path during startup (BL-1342) — a small but welcome quality-of-life improvement:

[BoxLang] MiniServer starting...
[BoxLang] Logging directory: /home/app/.boxlang/logs
[BoxLang] Server started on http://localhost:8080

🔄 Undertow Upgraded to 2.3.23.Final

The MiniServer now runs on Undertow 2.3.23.Final, bringing the latest HTTP server fixes and security patches.

🌐 Web Support Updates

🔀 Pre-Request Interception for Request Rerouting

A new interception point fires before onRequestStart, enabling interceptors to reroute requests before the application lifecycle begins (BL-2164). This unlocks powerful request gateway patterns:

A/B routing and feature flags
Maintenance mode bypasses
Multi-tenant request routing
Authentication redirects

All handled before any application overhead kicks in.

🛠️ Developer Experience

🌳 Enhanced `--bx-printast` Tooling

The --bx-printast CLI flag now supports file paths and standard input piping (BL-2187), making it far more useful for debugging parser output and build tooling integration:

# Print AST for a specific file
boxlang --bx-printast /path/to/MyClass.bx

# Pipe source code directly
echo 'result = 1 + 2' | boxlang --bx-printast

# Integrate with editors and build pipelines
cat MyComponent.bx | boxlang --bx-printast | jq '.body[0]'

🧩 SOAP Client — Binary and Map Type Support

The SOAP client now supports binary data and map/struct types for both requests and responses. It also allows you to call service methods directly without going through invoke():

ws = soap( "http://example.com/DataService?wsdl" )

// Send binary data
result = ws.uploadDocument( {
    name : "report.pdf",
    data : fileReadBinary( "/reports/annual.pdf" )  // Binary now supported
} )

// Send map/struct data
result = ws.updateRecord( {
    id       : 123,
    metadata : { region : "US", tier : "premium" }  // Map/Struct now supported
} )

🔧 Session Configuration in `boxlang.json`

Two previously missing session configuration settings are now supported (BL-1859):

{
    "sessionManagement": true,
    "sessionCluster": false
}

📋 Improved CLI Error Messages

CLI error messages now provide clearer context and actionable information when BoxLang scripts fail (BL-2212).

🐛 Notable Bug Fixes🐛 Notable Bug Fixes

Ticket	Summary
BL-2203	Exclusive locks sometimes allowed multiple threads into the locked section
BL-2205	cflock race condition under high concurrency
BL-2189	Can't cast 01-31-2026 23:59: 59 to a DateTime
BL-2143	DateTime Default ODBC Timestamp format was incorrectly quoted
BL-2157	Transaction end threw error when a stored procedure was executed within
BL-2165	getCurrentTemplatePath() didn't work inside a catch block
BL-2196	ENV secrets expand issue on Docker images due to *_FILE greediness
BL-2206	Parser error with extra pound signs
BL-2217	Module public remote class requests did not fire Application lifecycle events
BL-2236	form, url, and CGI scopes incorrectly scope-hunted during assignment
BL-2242	Null in switch statement threw error
BL-2251	Abstract class incorrectly required to implement all interface methods

🔧 Configuration Updates Summary

Setting	Description
`enforceUDFTypeChecks`	New boolean in `runtime` to disable UDF argument/return type validation
`sessionManagement`	Enable/disable session management globally in `boxlang.json`
`sessionCluster`	Enable distributed session clustering in `boxlang.json`
`.boxlang.json`	MiniServer now auto-loads this file from the working directory

📦 Dependency Updates

Undertow upgraded to 2.3.23.Final
Gradle wrapper updated to 9.3.1
Jackson Jr bumped to 2.21.1
Logback Classic bumped to 1.5.32

🎯 Upgrading

BoxLang 1.11.0 is a drop-in upgrade. No code changes are required to benefit from the performance improvements.

# CommandBox
box install boxlang@1.11.0

# BVM
bvm install 1.11.0 && bvm use 1.11.0

# Docker
FROM ortussolutions/boxlang:1.11.0

Full release notes, documentation, and downloads are available at boxlang.io and boxlang.ortusbooks.com.

Join the BoxLang Community ⚡️

Be part of the movement shaping the future of web development. Stay connected and receive the latest updates on Into the Box 2025, product launches, tool updates, and more.

Subscribe to our newsletter for exclusive content.

Follow Us on Social media and don’t miss any news and updates:

Join the BoxLang and CFML legends at Into the Box 2025. Let’s learn, share, and code together for a modern, cutting-edge web development future.

The post BoxLang 1.11.0 Release appeared first on foojay.

I Benchmarked Java on Single-Board Computers: Orange Pi 5 Ultra and Raspberry Pi 5 Lead the Pack

Frank Delporte — Wed, 04 Mar 2026 06:52:00 +0000

Table of Contents

Benchmark Tool

BenchmarkRunner.java - The User Tool
SummarizeReports.java - The Automation Tool

About The Renaissance Benchmark SuiteThe Results

The Dashboard
Analyzing the Results
Selecting a Winner

Conclusion

Try It Yourself!
What's Next?

In my "Java on Single Board Computers" series, I already published several posts and videos in which I unpack the board, connect it for the first time, and try to install and run some simple Java code. In this post, I want to share some benchmarks of Java on these boards to get a better idea of the performance we can expect from Java on these platforms.

Already published in this series:

Benchmark Tool

To make the benchmark testing as easy as possible, I created a simple tool (written in Java of course!) that can be executed with JBang. The complete project is available on GitHub at github.com/FDelporte/sbc-java-comparison.

The project consists of two main parts: a runner and a summarizer.

BenchmarkRunner.java - The User Tool

This is the tool you run on your single-board computer. It's a JBang script that:

Detects system information: This uses the OSHI library to gather details about the board, CPU, memory, JVM, and operating system.
Downloads the Renaissance benchmark suite: Automatically fetches the Renaissance Suite if it's not already cached.
Runs the benchmarks: Executes each benchmark three times and calculates the average score.
Saves results locally: Creates a JSON file with all the benchmark data.
Uploads to GitHub: Optionally, the results file can be pushed to the repository via the GitHub API for inclusion in the comparison.

You can run it directly from GitHub with a single command, after setting up your GitHub API token and the repository details (if needed):

export GITHUB_TOKEN={ghp_yourtoken}
export BENCH_GITHUB_OWNER={your_github_account}
export BENCH_GITHUB_REPO={your_fork}
export BENCH_GITHUB_BRANCH={your_branch}

jbang https://github.com/FDelporte/sbc-java-comparison/raw/main/BenchmarkRunner.java

If you want to run it without uploading results, add the --skip-push flag. And if your board has memory constraints, you can limit the heap size with --heap-limit 768m for example.

The script generates a report that looks like this:

{
  "systemInfo" : {
    "boardInfo" : {
      "model" : "RK3588 OPi 5 Ultra",
      ...
    },
    "cpuInfo" : {
      "model" : "RK3588 OPi 5 Ultra",
      "identifier" : "ARM Family 8 Model 0xd0b Stepping r2p0",
      "logicalCores" : 8,
      ...
    },
    "memoryInfo" : {
      "totalMB" : 15964,
      ...
    },
    "jvmInfo" : {
      "version" : "25.0.1",
      ...
    },
    "osInfo" : {
      "family" : "Ubuntu",
      ...
    }
  },
  "results" : [ 
      {
        "name" : "akka-uct",
        "score" : 17702.0,
        "unit" : "ms",
        "description" : "Actor-based concurrency. Interesting for comparing how well thread scheduling works across ARM, x86, and RISC-V kernels."
      },
    ...
  ],
  "timestamp" : "2026-02-18T10:47:46.198052372Z"
}

SummarizeReports.java - The Automation Tool

This tool runs automatically via a GitHub Action whenever a new benchmark result gets added to the repository. This tool:

Loads all files from the report directory.
Finds the unique platforms by CPU model and keeps only the most recent result for each unique CPU.
Generates summary.json, a consolidated file in the data directory.

This summary file is then used by the web dashboard to visualize all the results.

About The Renaissance Benchmark Suite

Rather than creating benchmarks from scratch, I chose to use the Renaissance Benchmark Suite, an open-source project designed specifically for JVM performance evaluation. Renaissance is maintained by researchers and includes a variety of real-world workloads that stress different aspects of the JVM and hardware.

The suite includes benchmarks for parallel computing, functional programming, machine learning, and more. I selected seven benchmarks that are related to the restrictions of single-board computers. I also found out that the scrabble benchmark can't run with Java 25, so it's not included.

akka-uct: Actor-based concurrency, tests thread scheduling across different architectures.
fj-kmeans: Fork/join parallelism with K-Means clustering, stresses CPU and multi-core utilization.
scala-kmeans: Single-threaded K-Means in Scala, provides a single-core performance baseline.
future-genetic: Genetic algorithm using futures, exercises the thread pool and garbage collector.
mnemonics: Serial JDK Streams, baseline for stream processing.
par-mnemonics: Parallel JDK Streams, reveals how well different architectures handle parallelism.
db-shootout: In-memory databases, tests memory bandwidth and subsystem performance.

I specifically avoided the Apache Spark benchmarks from Renaissance, as they're very memory-hungry and designed for multicore server machines. They would either crash with OutOfMemoryErrors or take forever on these constrained boards.

The Results

Based on the automatically generated summary.json and other configuration files in the repository's data directory, I created (with a lot of help of Claude.ai...) an interactive dashboard to visualize all the results. You can explore it yourself at webtechie.be/sbc/.

Important note: I ran the tests on the default Ubuntu system provided by the manufacturer of the board, after doing an update/upgrade and installation of OpenJDK 25 and JBang. The runner itself also uses some resources of the board, so this influences the numbers, but the same approach has been used on all boards to have a similar comparison.

The Dashboard

The "Vanilla JavaScript" (= no libraries, just HTML/CSS/JS) dashboard presents the benchmark results in an easy-to-understand format. For each board, you can see:

System information: Board model, CPU details, memory, Java version, and operating system.
Individual benchmark scores for each of the seven tests.
Visual comparisons across all tested boards.

The dashboard pulls the latest summary and other data files from the repository, so it's a living comparison that grows as more test reports become available.

These are screenshots from a first test round. Check the actual last status at webtechie.be/sbc/.

Analyzing the Results

Some general remarks about the charts:

The overview chart at the top gives the best board (based on your selection) a score of 100% (best), and the other boards are compared to it.
For the other charts, per benchmark, the lower scores are better. All benchmarks measure execution time in milliseconds, so lower numbers indicate faster performance.
Because of the limited amount of memory on the BeagleV-Fire, the benchmarks were executed with --heap-limit 768m on this board to avoid JVM crashes. This is also a factor leading to the lower scores for this board.

Performance Leaders

Not surprisingly, my Apple M2 workstation dominates the charts with the fastest scores across all benchmarks. This is expected, it's a high-end desktop processor with 12 cores and significantly more power budget than any single-board computer. The LattePanda IOTA with its Intel N150 x86 processor also performs exceptionally well, coming in second place on most tests.

The LattePanda IOTA deserves special attention here. While I initially excluded it from the "true" single-board computer competition, the pricing actually deserves reconsideration. At 110€, it's comparable to a Raspberry Pi 5 with additional M.2 expansion. The IOTA comes pre-equipped with an M.2 slot for NVMe storage expansion, full-size HDMI, three USB ports, and GPIO headers, all integrated into one board. However, it does require active cooling (the "do not operate without a heatsink" warning is serious), has a slightly larger form factor than traditional Raspberry Pi boards, and demands more power. It's still more of a compact x86 PC than a Raspberry Pi competitor, but for users who specifically need x86 compatibility (running existing x86 applications, Windows support if needed, or specific libraries), it offers exceptional performance for the price.

For traditional single-board computer use cases like IoT, robotics, GPIO-heavy projects, etc. the ARM-based boards remain the better choice. Definitely the GPIO headers have an advantage on the Raspberry Pi, Orange Pi, and similar boards. I tried the GPIOs on the LattePanda IOTA, and found out that they are exposed by a RP2040 co-processor which connects to the Intel CPU via USB. A strange approach which will be hard to use with the Pi4J library.

The Real Single-Board Computer Competition

Among the boards that fit the traditional SBC form factor (Raspberry Pi-sized, passive or small fan cooling, under $200), the results tell a more nuanced story:

Orange Pi 5 Ultra (ARM RK3588, 8 cores) shows the best results in most of the tests, particularly on multithreaded workloads.
Raspberry Pi 5 (ARM Cortex-A76, 4 cores @ 2.4GHz) delivers solid, consistent performance across all benchmarks, very close to the Orange Pi 5 Ultra.
Raspberry Pi 4 (ARM Cortex-A72, 4 cores @ 1.8GHz) still performs admirably despite being an older generation.
All others show significantly lower performance, which is expected given their lower core counts and less powerful CPUs.

RISC-V Performance

The RISC-V boards present an interesting case. The Orange Pi RV2, BeagleV-Fire, and StarFive VisionFive 2 Lite show that RISC-V is absolutely viable for running Java applications, though performance still lags behind ARM equivalents. This is expected given that RISC-V is a newer architecture with less mature tooling and optimization.

Selecting a Winner

If I had to pick a winner from the "true" single-board computers (excluding my Apple workstation and the LattePanda IOTA), the OrangePi 5 Ultra comes out on top, but the Raspberry Pi 5 is a very close second. These boards offer the best combination of performance and price. Raspberry Pi has the added advantage of a big ecosystem support, and ease of use with the Imager Tool. Orange Pi can still learn a lot from the extensive documentation, and huge library of compatible accessories and software that is available for Raspberry Pi.

Conclusion

I wanted to run benchmarks on multiple boards, but needed a solution to do this as quickly and easily as possible. After all, this is a pet project for me. I got the boards for free, but comparing them is a personal challenge in my "free" time... With the current setup of the two scripts, I found a solution that is fast to execute. I only need access from my work PC, copy the GitHub token and JBang command, and let the test run in the background. The GitHub Action then automatically pushes the results to the repository and updates the dashboard. So, goal achieved!

The results confirm: Java runs without problems on all these platforms, from ARM to x86 to RISC-V! Of course, there are performance differences between architectures and specific boards, but every single one of the tested devices can run real-world Java applications with a good performance.

Try It Yourself!

I encourage you to run the benchmark on your own boards and contribute the results! The process is simple:

# Make sure you have Java 25 and JBang installed
jbang https://github.com/FDelporte/sbc-java-comparison/raw/main/BenchmarkRunner.java  --skip-push

If you want your results added to the public dashboard, follow the instructions in the repository README to set up GitHub API access and submit your results.

What's Next?

This is just the beginning! I also received two Banana Pi boards, which I still need to unpack and test. And it would also be great to add a benchmark for JavaFX performance, but that will exclude RISC-V, as there is no JavaFX port for it (yet?). And, of course, I really want to get started exploring Pi4J on all these boards and see how it compares to the Raspberry Pi's GPIO.

If you have suggestions for additional benchmarks or want to see specific boards tested, let me know in the comments or reach out on Mastodon, Bluesky, or my YouTube community!

The post I Benchmarked Java on Single-Board Computers: Orange Pi 5 Ultra and Raspberry Pi 5 Lead the Pack appeared first on foojay.

JC-AI Newsletter #14

Miro Wengner — Tue, 03 Mar 2026 15:11:53 +0000

Two weeks have passed and a lot have been happening on the field of artificial-intelligence.
Two weeks have passed and a lot has been silently yet visibly happening in the field of artificial intelligence. This newsletter brings interesting developments, including Dario Amodei's (Anthropic) view on the progress achieved in the LLM field and his response to the utilization of these models for specific kinds of military purposes, as well as OpenAI's response to it. Aside from the fact that development may follow more sigmoids instead of exponential progress, it is important to have awareness of utilization across branches. Does prompting and clarifying the goal influence agent responses, and if so, how? How far are we from reliable robotics applications? How much bias is introduced when clinical data is being analyzed?
Let's jump in and happy reading!

article: Exclusive: Why are Chinese AI models dominating open-source as Western labs step back?
authors: Dashveenjit Kaur, AI News
date: 2026-02-09
desc.: A shift in what AI models are being used and where the models are being produced.
category: opinion

article: Machines of Loving Grace
authors: Dario Amodei
date: 2024-10-01
desc.: Although the article is older, it remains relevant for any author aiming to sketch a future in which everything with AI goes right. In light of recent developments, which appear to follow a sigmoid curve rather than exponential growth (marked by stagnation, with current models reaching a point where another breakthrough is required), the trajectory looks more measured than initially anticipated. Although the author discusses multiple risks (grandiosity, market forces, propaganda, sci-fi-like expectations, etc.), he also highlights the bright sides and explores areas where current AI may prove genuinely helpful. The question remains whether the current state of affairs can truly guarantee progress, rather than causing damage through non-deterministic outcomes (education, industry, human creativity etc.).
category: opinion

article: The Urgency of Interpretability
authors: Dario Amodai
date: 2025-04-01
desc.: The author describes lessons learned from current AI development and adds multiple valuable thoughts and facts to consider when interacting with AI models. The main point is that progress in the underlying technology is inexorable, driven by forces too powerful to stop, but what matters is the way in which it unfolds. Accepting that the current evolution of LLM-based AI cannot be halted, the author expresses hope that it may still be guided (this fact affect not only entire industry but also human kind thoughs and perception of reality), much like a bus controlled by a steering wheel, and warns of the dangers of ignorance, illustrating this through several concrete examples.
category: opinion

article: From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
authors: Suyash Fulay, Jocelyn Zhu, Michiel Bakker (MIT)
date: 2025-10-14
desc.: The article addresses the question of 'behavioral cloning', specifically, how accurately LLMs reproduce individuals' expressed preferences. Large language models have demonstrated promising accuracy in predicting survey responses and policy preferences, which has fueled growing interest in their potential to represent human interests across various domains. Drawing on theories of political representation, the article highlights an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, acting in users' broader interests. Models may align well with users' short-term preferences while failing to account for their long-term interests. Studies further indicate greater bias in topics where consensus is lacking.
category: research

article: DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
date: 2026-02-27
desc.: The article addresses the challenge posed by fast-growing demand for Large Language Models (LLMs) to tackle complex, multi-step data science tasks, which has created an urgent need for accurate benchmarking. Two major gaps are identified in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. While highlighting that even capable models (Anthropic, OpenAI, etc.) may struggle in performance, the article introduces the DARE-bench benchmark alongside supervised fine-tuning as approaches that may improve outcomes in specific applications. Although the results appear promising, they retain considerable potential for further improvement, as accuracy is not yet guaranteed.
category: research

article: Do LLMs Benefit From Their Own Words?
authors: Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas (MIT, IBM Research)
date: 2026-02-27
desc.: The article aims to answer the question of whether preserving past assistant responses is more beneficial than harmful. The study uses in-the-wild, multi-turn conversations and compares standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, evaluated across three open reasoning models and one state-of-the-art model. Surprisingly, omitting past assistant responses does not negatively affect response quality in a large fraction of turns and may also reduce token length. The article concludes with a discussion of findings and directions for future research.
category: research

article: SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
date: 2026-02-27
desc.: Safety-critical task planning in robotic systems remains a significant challenge: classical planners suffer from poor scalability, reinforcement learning (RL)-based methods generalize poorly, and base large language models (LLMs) cannot guarantee safety. To address this gap, the article proposes SafeGen-LLM, a safety-generalizable large language model framework. As part of this contribution, a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints is introduced, along with Supervised Fine-Tuning (SFT) on those constraints. Although the results appear optimistic, with minimal safety violations observed across tested domains, the approach still requires further research in more complex robotic settings.
category: research

article: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
authors: Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat
date: 2026-02-27
desc.: Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. The article introduces a novel approach leveraging state-of-the-art models (GPT-5, Gemini 2.5, Gemini 3, Claude Opus 4.5, and DeepSeek-R) by extracting lemmas from arXiv and updating them dynamically. This results in a benchmark that can be refreshed regularly with new problems drawn directly from current mathematical research, while previous instances can be used for training without compromising future evaluations. This approach achieves 10–15% accuracy in theorem proving and opens a new frontier for future research. Although the process may appear fully automated, a human in the loop, such as the article's author or reviewer, remains critically necessary to produce high-quality inputs and to effectively use LLM models.The results also indicate that it is considerably easier for a model to validate an existing proof than to produce one.
category: research

article: Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
authors: Donghao Huang, Zhaoxia Wang
date: 2026-02-27
desc.: It is a well-established narrative that reasoning in large language models (LLMs) universally improves performance across language tasks. This article aims to test that claim through a comprehensive evaluation of 504 configurations across seven models, considering different reasoning architectures such as adaptive, conditional, and reinforcement-based approaches. The findings reveal that the effectiveness of reasoning is strongly task-dependent and degrades for simpler tasks. The article provides quantitative findings alongside error analysis and outlines directions for future research.
category: research

article: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
authors: Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi and others
date: 2026-03-02
desc.: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series, however, the impact of information bias on clinically significant events, such as sustained abnormalities, remains poorly understood. The article presents the Technology-Integrated Health Management (TIHM) framework to address these questions, introducing a protocol that measures abnormality recall, duration recall, and measurement coverage, while utilizing GPT-4o-mini as a proxy evaluator. Traditional models frequently exhibit near-zero abnormality recall, whereas the vision-based approach achieves the strongest event alignment, with 45.7% abnormality recall and 100% duration recall. These results underscore the need for event-aware evaluation methods in future research to ensure reliable clinical time-series summarization.
category: research

article: Full interview: Anthropic CEO responds to Trump order, Pentagon clash
authors: CBS News
date: 2026-02-28
desc.: Anthropic CEO Dario Amodei sat down with CBS News for an exclusive interview, hours after Defense Secretary Pete Hegseth declared the company a supply chain risk to national security, which restricts military contractors from doing business with the AI giant. Amodei called the move "retaliatory and punitive," and he said Anthropic sought to draw "red lines" in the government's use of its technology because "we believe that crossing those lines is contrary to American values, and we wanted to stand up for American values.". Response of the OpenAI striking a deal with Pentagon causes many questions.
category: youtube

article: Scary Agent Skills: Hidden Unicode Instructions in Skills ...And How To Catch Them
authors: Embrace The Red
date: 2026-02-11
desc.: Skills introduce common threats such as prompt injection, supply chain attacks, remote code execution (RCE), and data exfiltration, among others. This post discusses the fundamentals, highlights the most straightforward prompt injection vector, and demonstrates how a real Skill from OpenAI can be back-doored using invisible Unicode Tag code-points, a technique that certain models, including Gemini, Claude, and Grok, are known to interpret as instructions. From a security perspective, Skills present serious concerns, as they represent a typical supply chain risk with limited governance or security controls. The author identified that some Skills instruct the AI to embed API tokens directly in curl requests and similar constructs , a poor design practice. This means that credentials are passed through the LLM, making them susceptible to leakage and leaving them vulnerable to being overwritten by an attacker via indirect prompt injection.
category: tutorial

The post JC-AI Newsletter #14 appeared first on foojay.

foojay – a place for friends of OpenJDK

AWS Nitro and CPU Graviton Meets Unikernels

From Virtual Machines and Containers to Unikernels

Proof of Concept Overview

Reproducibility and Artifacts

Local Build and Image Creation

Instance Creation on AWS

PoC Environment

Architectural Diagram of the PoC

Containers vs Unikernels: A Stack Comparison

Container Stack

Unikernel Stack

Quarkus, Semeru, and Nanos on AWS Nitro Graviton

AWS Nitro: Cloud-Native Capabilities Without Kubernetes

Hypervisor Independence

Why This Matters for Java and Jakarta EE

Conclusion

Thread-Safe Native Memory in Java: VarHandle Access Modes Explained

What is Memory Order and Why Does It Matter for Native Memory?

Why do you need all of this?

Testing it using JCStress

Plain Access (Get/Set)

Opaque Access

Acquire/Release

Volatile

TL;DR

Conclusion

Bonus: Word Tearing

TestBox 7: Real-Time Feedback, a Browser-Based IDE, and Modern Testing Workflows on the JVM

TestBox RUN: A Browser IDE for Your Tests

What You GetWhat You Get

Keyboard Shortcuts

Getting Started

Coming Soon: TestBox RUN Desktop App

Streaming Test Execution via SSE

StreamingRunner (Programmatic)StreamingRunner (Programmatic)

BoxLang CLI --stream Flag

Dry Run & Spec Discovery

Programmatic Dry Run

CLI Dry Run

JSON Output

BoxLang CLI Runner — New Power Options

Focus on Failures

Stack Trace Control

Output & Performance Flags

Application Mappings Auto-Load (TESTBOX-440)

Other Notable Improvements

ConsoleReporter — Hide Skipped Tests (TESTBOX-433)

Suite Filtering Fixes (TESTBOX-435)

TestBox CLI Updates (v1.8.0)

Engine Support

Upgrade Now

Managing Native Memory in Java: Arenas, Malloc, and Custom Pools

What is the Memory API

Arenas

Using Arenas

Creating your own arena

Native Memory allocation methods

Using Malloc and Free

Pool of reusable memory

Why you would use them

How to use a memory pool

Slicing

How to use them

TL;DR

Conclusion

How is Leyden improving Java Performance? Part 3 of 3

What is inside the Ahead of Time Cache?

JVM Metadata

JVM Profile and Linkage Data

JVM Code and Code Management Data

Leyden Training Data

How Do I Know Leyden Is Helping?

Training the application

Analyzing the Cache

Are we training the right thing?

Did we load all relevant classes during Training?

Are our methods properly trained?

How is Leyden improving Java Performance? Part 2 of 3

How to use an AOT Cache

BoxLang CLI `--stream` Flag

`ConsoleReporter` — Hide Skipped Tests (TESTBOX-433)

🆕 `enforceUDFTypeChecks` Configuration Setting

⏱️ `getTickCount()` — Nanosecond & Second Precision

🗑️ New BIF: `ExecutorDelete()`

📁 `.boxlang.json` Convention

🌳 Enhanced `--bx-printast` Tooling

🔧 Session Configuration in `boxlang.json`