foojay – a place for friends of OpenJDK

“Agentic” Was Everywhere at Money20/20 Amsterdam — Once I Started Looking

Geertjan Wielenga — Fri, 05 Jun 2026 08:55:48 +0000

Table of Contents

Core banking, reimagined as "AI-native"Payments and money movementLending, credit, and collectionsRisk, fraud, and complianceBuild-your-own-agent platforms and toolingData and infrastructure underneath it allContent, language, and trustServices and talentConclusions

I walked the floor at Money20/20 in Amsterdam with a simple little mission: count the first ten vendor booths using "AI," "agent," or "agentic" in their pitch.

My first impression was that this would be hard. Hundreds of organizations, and the buzzwords felt oddly absent — payments, core banking, compliance, the usual fintech furniture, but not the wall-to-wall "agentic everything" I'd braced for. I jotted down my first ten and thought: huh, AI is barely here.

Then I kept walking. And it turns out I'd just been looking past it. The "agentic" framing wasn't shouting from every banner, it had quietly landed within the messaging of nearly everyone. Once I got going and walked through all the many halls and areas, the ten gradually became twenty-plus.

Here's what I found, grouped by what these organizations are actually trying to do, which, for those who build the systems underneath all this, is the more interesting question than the word count itself.

Core banking, reimagined as "AI-native"

The platforms that run the actual ledgers are rebranding speed and adaptability as AI:

CoreFi — "Agentic AI Cloud Core Banking"
Mambu — "Banking at the speed of AI"
SaaScada — "AIdaptive Core Banking"

Payments and money movement

Unsurprisingly for Money20/20, payments showed up — now with agents that initiate and settle:

BridgerPay — "AI powered payment orchestration"
SolvaPay — "Agents need to pay. Agentic payments"
Stripe — "Moving money at agentic speed"

A recurring idea here — agents that pay — is interesting. It implies AI agents initiating transactions, which is a new design problem for anyone building payment rails.

Lending, credit, and collections

The decisioning side of finance in the context of agents:

Experian — "Your high trust partner in the agentic lending era"
Purple Fabric — "AI-First Lending"
Taktile — "Powering critical decisions with AI agents"
Acclaim AI — "AI Agents for 10x More Efficient Collections"

Risk, fraud, and compliance

There's a lot going on here: the part of fintech where "auditable" and "agentic" have to coexist:

Oscilar — "The Agentic Risk Platform"
Graphdo SIA — "AI-powered toolkit for AML"
Kalipso — "Regulatory compliance, powered by AI"
JupiterOne — "AI Risk Management Platform"
DataWhisper — "Agentic AI for Regulated Industries"
Trustpilot — "The future of commerce runs on AI — and trust"

Build-your-own-agent platforms and tooling

Another relevant area — the tools you'd actually use to create the AI agents themselves:

Sierra — "There's an AI agent for that. It's built on Sierra."
orq.ai — "Build & operate quality AI products"
Camunda — "Build AI agents trusted for high-stakes work."
SoundHound AI — "AI Agents Built for Trust, Reliability and Security First"

Data and infrastructure underneath it all

Agents are only as good as what they read and how fast:

ClickHouse — "The Leading Database for AI"
acaisoft — "Sovereign AI for European finance"

Content, language, and trust

The parts that make agent output usable, legible, and credible:

AD VERBUM — "AI Translations. Auditable. Every Word."
CKEditor — "Bring AI where content happens"

Services and talent

And the people-and-delivery layer:

HCLTech — "Move fast. Govern smart. Deliver with AI."
Karat — "Unlock Talent for the Human + AI Era"

Conclusions

My takeaway changed along the way between the start and the end of my walk through Money 20/20. Plus, of course, a lot of technologies use words such as "intelligent" or "intelligence", rather than explicitly calling out AI, though I was most interested in those explicitly calling out AI in their key slogan or biggest message at their booth.

I went in expecting "agentic" to be plastered everywhere and almost concluded the opposite, that it was barely present, before realizing it may also be becoming the substrate rather than the slogan itself. The word is no longer the differentiator, instead, it may of course be on the way to being the assumed baseline.

What stands out across these clusters, though, is how little of it is consumer-facing chat. Instead, it's risk engines, lending decisions, payment orchestration, AML, core banking. In other words, the high-stakes, heavily-regulated, audit-everything backend work, which is exactly the territory where the JVM has been living for decades.

The interesting engineering questions aren't anymore "can an LLM answer this," but "can an agent initiate a payment, decline a loan, or file a compliance report in a way that's deterministic enough to defend."

That's the version of "agentic" worth paying attention to and it's coming straight for the systems many of us are building.

The post “Agentic” Was Everywhere at Money20/20 Amsterdam — Once I Started Looking appeared first on foojay.

Explore Spring AI SDK – Amazon Bedrock AgentCore – Part 1

Mahendra Rao B — Mon, 20 Apr 2026 13:44:54 +0000

Table of Contents

Introduction

Role of Agentic AI Frameworks
What is Spring AI and Spring AI SDK?
What is Amazon Bedrock AgentCore and Why?
Amazon Bedrock AgentCore Capabilties
1. AgentCore Runtime
2. AgentCore Memory
3. AgentCore Gateway
4. AgentCore Identity
5. AgentCore Policy
6. Build-inTool: Code Interpreter
7. Build-inTool: Browser
8. Gen AI Observability
9. Evaluations
Step-by-step guide
Prerequisites
Project Structure
2. Add the below class
3. Add the below Mathematical Tool class
4. Create a Sample Agent
5. Configure Amazon Bedrock Properties
6. Verify
7. Add Streaming
End-to-end flow
References

Introduction

Artificial intelligence has rapidly expanded across every industry since the inception of ChatGPT. It represents a breakthrough in how we build and use software. Moreover, this breakthrough technology has driven major transformation. At the same time, it has created significant noise and hype.

Today, AI is no longer experimental. Instead, it has become essential, much like electricity and the internet. As a result, it now plays a key role in our daily lives. Overall, this shift reflects strong technological acceleration across industries.

From a development perspective, the software development lifecycle is evolving. In particular, it is steadily moving toward AI engineering. In my view, this shift improves the productivity of developers and architects.

However, we must remain cautious. For example, we should not blindly trust outputs from tools like ChatGPT, GitHub Copilot, or other AI assistants. Instead, we must review and validate all generated text, code, and content. Only then should we deploy it to production for customers.

Role of Agentic AI Frameworks

Meanwhile, as AI continues to evolve and deliver staggering results, many agentic AI frameworks are emerging. Consequently, developers and architects can build and experiment with use cases in a short time. In addition, these frameworks act as breakthrough technology and accelerate innovation.

At the same time, provider SDKs handle key architectural and infrastructure concerns. For instance, they manage scalability, reliability, security, and observability. Therefore, this support reflects strong technological acceleration in the ecosystem.

As a result, developers and architects can focus mainly on building core agent logic.

For Python and TypeScript developers, AWS has open-sourced the Strands Agents SDK. It follows a model-driven approach to building and running AI agents with just a few lines of code.

Meanwhile, the Amazon Bedrock SDK handles the underlying infrastructure capabilities. This includes scalability, reliability, security, and observability. From my experience, I have explored, built, and deployed several agents using this framework. I find it to be a very interesting and powerful framework to work with.

In this article, we focus on Spring AI and its integration with the generally available Amazon Bedrock SDK. Specifically, the Spring AI AgentCore SDK enables developers to build production-ready AI agents. Furthermore, they can run these agents on the highly scalable AgentCore Runtime.

What is Spring AI and Spring AI SDK?

According to Spring AI documentation, it is an application framework for AI engineering. Using this framework, developers can connect Data and APIs with AI models.

The Spring AI AgentCore SDK is an open-source library that brings Amazon Bedrock AgentCore capabilities into Spring AI. It uses familiar patterns such as annotations, auto-configuration, and composable advisors.

With Spring AI Builders, developers can simply add a dependency and annotate a method. The SDK then handles the rest.

What is Amazon Bedrock AgentCore and Why?

According to Amazon documentation, it is an agentic AI platform that enables developers to build, deploy, and operate agents at scale using any framework and any model.

One key reason to use Amazon Bedrock AgentCore is that it simplifies development. It allows developers to focus on building AI agents and implementing business logic. However, configuring capabilities such as scalability, reliability, security, governance, and observability typically requires significant time and effort.

With Amazon Bedrock AgentCore, the platform handles the infrastructure layer. It provides these capabilities out of the box. As a result, developers can concentrate on core agent development rather than managing underlying systems.

Source: Amazon

Amazon Bedrock AgentCore Capabilties

Amazon Bedrock AgentCore provides the following capabilities.

1. AgentCore Runtime

The execution environment where your AI agents run
Handles scaling, session management, and isolation automatically
Lets you deploy agents without managing infrastructure

2. AgentCore Memory

Helps agents remember context across interactions
Supports:
- Short-term memory (conversation context): It stores recent messages using a sliding window approach.
- Long-term memory (persistent knowledge): It persists knowledge across sessions using multiple strategies like Semantic, User Preference, Summary, and Episodic memory strategies.
Enables more personalized and intelligent responses

3. AgentCore Gateway

Connects agents to APIs, tools, and external systems
Converts APIs/Lambda functions into agent-compatible tools (MCP)
Simplifies tool integration with minimal code

4. AgentCore Identity

Manages authentication and access control for agents
Integrates with existing identity providers (e.g., Cognito, Okta, OAuth2)
Ensures secure interactions with systems and data

5. AgentCore Policy

Defines rules and boundaries for agent behavior
Controls what actions an agent can perform
Ensures compliance and governance without slowing execution

6. Build-inTool: Code Interpreter

Provides a secure sandbox for executing code
Supports multiple languages (Python, JS, etc.)
Helps agents perform complex computations and tasks

7. Build-inTool: Browser

Allows agents to interact with websites
Can navigate pages, fill forms, and extract data
Runs in a secure, managed environment

8. Gen AI Observability

Monitors and tracks agent performance in production
Provides tracing, debugging, and visualization of workflows
Helps identify failures and optimize performance
Separate feature is available in CloudWatch as a sidecar as a Gen AI Observability

9. Evaluations

Measures agent quality and performance
Evaluates correctness, reliability, and task success
Helps improve agents using data-driven insights

Step-by-step guide

We can start by creating a sample agent. Then, we can gradually add and integrate AgentCore services such as memory, gateway, identity, and policies. We can also incorporate built-in tools like the browser and code interpreter.

Observability, evaluations, and advanced identity management are still evolving. These capabilities are expected in upcoming SDK releases.

Prerequisites

An AWS Account
Java 17 or higher (Java 25 recommended)
Spring Boot 3.5.x or higher
Maven or Gradle
IntelliJ IDE (Recommended)

Project Structure

Using start.spring.io, you can scaffold the project structure and add the required dependencies.

You can use the following link to quickly create the project with the required configuration and dependencies:

Generate Project

This link preconfigures a Spring Boot project with the necessary setup, allowing you to get started quickly.

simple-spring-boot-agent/
├── src/
│   └── main/
│       ├── java/com/bsmlabs/springai/
│       │   ├── agents/
│       │   │   └── SampleChatAgent.java       # Core agent — @AgentCoreInvocation handler
│       │   ├── models/
│       │   │   └── PromptRequest.java          # Input model (Java record)
│       │   └── tools/
│       │       └── MathematicalTools.java      # Tool definitions for the LLM
│       └── resources/
│           └── application.properties          # AWS Bedrock config
├── test-sample-request.http                    # Ready-to-run HTTP test requests
└── pom.xml

1. Add the below Bill of Materials(BOM) SDK Dependencies and then include runtime starter to pom.xml


        
            
                org.springframework.ai
                spring-ai-bom
                ${spring-ai.version}
                pom
                import
            
            
                org.springaicommunity
                spring-ai-agentcore-bom
                ${spring-ai-agentcore.version}
                pom
                import
            
        
    

     
        
         org.springaicommunity
         spring-ai-agentcore-runtime-starter

complete pom.xml



    4.0.0
    
        org.springframework.boot
        spring-boot-starter-parent
        3.5.8
         
    
    com.bsmlabs
    simple-spring-boot-agent
    0.0.1-SNAPSHOT
    
    
    
    
        
    
    
        
            Mahendra Rao B
        
    
    
        
        
        
        https://github.com/bsmahi/simple-spring-boot-agent
    
    
    
        21
        1.1.4
        1.0.0
    
    
        
            org.springframework.boot
            spring-boot-starter-actuator
        
        
            org.springframework.boot
            spring-boot-starter-web
        

        
            org.springframework.ai
            spring-ai-starter-model-bedrock-converse
        
        
        
            org.springaicommunity
            spring-ai-agentcore-runtime-starter
        

        
            org.springframework.boot
            spring-boot-starter-test
            test
        
    

    
        
            
                org.springframework.ai
                spring-ai-bom
                ${spring-ai.version}
                pom
                import
            
            
                org.springaicommunity
                spring-ai-agentcore-bom
                ${spring-ai-agentcore.version}
                pom
                import
            
        
    

    
        
            
                org.springframework.boot
                spring-boot-maven-plugin

2. Add the below class

public record PromptRequest(String prompt){};

3. Add the below Mathematical Tool class

package com.bsmlabs.springai.tools;

import org.springframework.ai.tool.annotation.Tool;
import org.springframework.ai.tool.annotation.ToolParam;

public class MathematicalTools {

    @Tool(description = "Adds two numbers and returns the result.")
    double add(
            @ToolParam(description = "First number") double a,
            @ToolParam(description = "Second number") double b) {
        return a + b;
    }

    @Tool(description = "Subtracts the second number from the first and returns the result.")
    double subtract(
            @ToolParam(description = "Number to subtract from") double a,
            @ToolParam(description = "Number to subtract") double b) {
        return a - b;
    }

    @Tool(description = "Multiplies two numbers and returns the result.")
    double multiply(
            @ToolParam(description = "First number") double a,
            @ToolParam(description = "Second number") double b) {
        return a * b;
    }

    @Tool(description = "Divides the first number by the second. Returns an error if dividing by zero.")
    String divide(
            @ToolParam(description = "Dividend") double a,
            @ToolParam(description = "Divisor") double b) {
        if (b == 0) return "Error: cannot divide by zero.";
        return String.valueOf(a / b);
    }

    @Tool(description = "Returns the remainder when the first number is divided by the second (modulo).")
    String modulo(
            @ToolParam(description = "Dividend") double a,
            @ToolParam(description = "Divisor") double b) {
        if (b == 0) return "Error: cannot divide by zero.";
        return String.valueOf(a % b);
    }

    @Tool(description = "Raises a base number to the power of an exponent.")
    double power(
            @ToolParam(description = "Base number") double base,
            @ToolParam(description = "Exponent") double exponent) {
        return Math.pow(base, exponent);
    }

    @Tool(description = "Returns the square root of a number. Returns an error for negative input.")
    String squareRoot(
            @ToolParam(description = "The number to find the square root of") double number) {
        if (number < 0) return "Error: cannot take square root of a negative number.";
        return String.valueOf(Math.sqrt(number));
    }

    @Tool(description = "Returns the absolute value of a number (removes the negative sign).")
    double absoluteValue(
            @ToolParam(description = "The number") double number) {
        return Math.abs(number);
    }

    @Tool(description = "Rounds a number to a specified number of decimal places.")
    double round(
            @ToolParam(description = "The number to round") double number,
            @ToolParam(description = "Number of decimal places (0 for whole number)") int decimalPlaces) {
        double scale = Math.pow(10, decimalPlaces);
        return Math.round(number * scale) / scale;
    }

    @Tool(description = "Returns the larger of two numbers.")
    double max(
            @ToolParam(description = "First number") double a,
            @ToolParam(description = "Second number") double b) {
        return Math.max(a, b);
    }

    @Tool(description = "Returns the smaller of two numbers.")
    double min(
            @ToolParam(description = "First number") double a,
            @ToolParam(description = "Second number") double b) {
        return Math.min(a, b);
    }

    @Tool(description = "Calculates the percentage of a value. E.g. what is 20% of 150?")
    double percentage(
            @ToolParam(description = "The percentage value (e.g. 20 for 20%)") double percent,
            @ToolParam(description = "The total value") double total) {
        return (percent / 100.0) * total;
    }
}

4. Create a Sample Agent

package com.bsmlabs.springai.agents;

import com.bsmlabs.springai.models.PromptRequest;
import com.bsmlabs.springai.tools.MathematicalTools;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springaicommunity.agentcore.annotation.AgentCoreInvocation;
import org.springaicommunity.agentcore.context.AgentCoreContext;
import org.springaicommunity.agentcore.context.AgentCoreHeaders;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;


@Service
public class SampleChatAgent {

    private static final Logger logger = LoggerFactory.getLogger(SampleChatAgent.class);

    private final ChatClient chatClient;

    public SampleChatAgent(ChatClient.Builder chatClient){
        this.chatClient = chatClient
                .defaultTools(new MathematicalTools())
                .build();
    }

    /**
     * @AgentCoreInvocation marks a method as the agent invocation handler for the AgentCore runtime.
     * You can annotate only one method per application with this annotation.
     * Multiple @AgentCoreInvocation methods found. Only one is allowed in MVP.
     */
    @AgentCoreInvocation
    public String agentCoreHandler(PromptRequest promptRequest,
                                   AgentCoreContext agentCoreContext){
        String sessionId = agentCoreContext.getHeader(AgentCoreHeaders.SESSION_ID);

        logger.info(agentCoreContext.getHeader(AgentCoreHeaders.SESSION_ID));

        return chatClient.prompt()
                .user(promptRequest.prompt())
                .call()
                .content();
    }

}

PromptRequest — a custom model (likely a Java record) that wraps the user's input prompt.
MathematicalTools — a custom tool class that exposes functions (like addition, square root, etc.) that the AI can invoke during reasoning.
AgentCoreInvocation, AgentCoreContext, AgentCoreHeaders — annotations and utilities from the spring-ai-community/agent-core library, which adds agent orchestration capabilities on top of Spring AI.
- @AgentCoreInvocation marks a method as the agent invocation handler for the AgentCore runtime.
- You can annotate only one method per application with this annotation.
- If you declare @AgentCoreInvocation in multiple times in a class, it will throw an error Multiple @AgentCoreInvocation methods found. Only one is allowed in MVP.
ChatClient — Spring AI's primary abstraction for communicating with an LLM (like OpenAI, Anthropic, etc.). It is similar to other Spring client patterns, such as RestClient and WebClient.
- .prompt() — Starts building a new prompt
- .user(...) — Sets the user message from the incoming request
- .call() — Sends the request to the configured LLM
- .content() — Extracts the plain text response

return chatClient.prompt()
                .user(promptRequest.prompt())
                .call()
                .content();

If the LLM determines it needs a tool (e.g., to compute something), Spring AI handles the tool-call loop automatically behind .call() — invoking MathematicalTools, feeding the result back to the model, and returning the final answer.

5. Configure Amazon Bedrock Properties

Configure both region and model in application.properties/application.yml

spring.application.name=sample-spring-boot-agent
spring.ai.bedrock.aws.region=ap-south-1                                
spring.ai.bedrock.converse.chat.options.model=global.amazon.nova-2-lite-v1:0

spring:
  application:
    name: simple-spring-boot-agent
  ai:
    bedrock:
      aws:
        region: ap-south-1
      converse:
        chat:
          options:
             model: global.amazon.nova-2-lite-v1:0

6. Verify

Since we have enabled Swagger OpenAPI, we can easily validate the APIs.

Swagger OpenAPI

or using curl command

// Run the application
mvn spring-boot:run
// In terminal run the below command
curl -X POST http://localhost:8080/invocations \ -H "Content-Type: application/json" \ -d '{"prompt": "What is Spring AI?"}'
// Verify Mathematical Tool
curl -X POST http://localhost:8080/invocations \ -H "Content-Type: application/json" \ -d '{"prompt": "What is 14+13?"}'

This is an AgentCore-compatible AI agent. It requires no custom controllers, no protocol handling, and no health check implementation.

7. Add Streaming

Add the spring-boot-starter-webflux dependency and comment spring-boot-starter-web dependency


    org.springframework.boot
    spring-boot-starter-webflux

To stream responses as they are generated, change the return type to Flux. The SDK then automatically switches to Server-Sent Events (SSE) output.

@AgentCoreInvocation
public Flux streamingChat(PromptRequest request) {
   return chatClient.prompt()
                .user(request.prompt())
                .stream()
                .content();
}

Streaming

End-to-end flow

User Request
     │
     ▼
@AgentCoreInvocation ──► AgentCoreContext (session, headers)
     │
     ▼
ChatClient.prompt()
     │
     ├──► LLM reasons about the prompt
     │         │
     │         └──► Needs math? ──► MathematicalTools ──► result fed back
     │
     ▼
Final LLM response returned as String

In the next part, I will discuss the inclusion of the remaining AgentCore services like memory, adding built-in tools like browser, code interpreter, and deployment to Amazon Bedrock AgentCore runtime.

You can find the complete code here.

Happy Learning Spring AI

References

Bedrock Converse API: https://docs.spring.io/spring-ai/reference/api/chat/bedrock-converse.html
https://spring.io/ai
Amazon Bedrock AgentCore: https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.html
Spring AI: https://spring.io/projects/spring-ai
AWS Blog Spring AI SDK: https://aws.amazon.com/blogs/machine-learning/spring-ai-sdk-for-amazon-bedrock-agentcore-is-now-generally-available/

The post Explore Spring AI SDK – Amazon Bedrock AgentCore – Part 1 appeared first on foojay.

JC-AI Newsletter #15

Miro Wengner — Fri, 20 Mar 2026 07:56:01 +0000

Over the past two weeks, the field of artificial intelligence has continued its remarkable pace of advancement. As AI becomes increasingly woven into the fabric of daily life, shaping how we work, communicate, and make decisions, it is both timely and valuable to step back and understand the broader trajectory of this technology. Whether the developments around us feel promising or challenging, one truth remains clear: AI is not simply leaving. It is here to stay, and understanding its evolution is essential from many perspectives.

article: Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%
authors: Steef-Jan Wiggers, InfoQ
date: 2026-02-23
desc.: This article provides additional commentary on the research paper recently published by Anthropic. The original article is included below to allow readers to obtain a complete picture of the challenge. Some previous issues of the JC-AI Newsletter contain multiple research studies related to published findings on various groups of individuals.
category: opinion

article: How AI assistance impacts the formation of coding skills
authors: Anthropic
date: 2026-01-29
desc.: Previous editions of this AI Newsletter have covered multiple clinical studies examining the impact of AI-assisted advisory tools. The findings appear consistent with earlier research on individuals who tend to defer to navigation systems rather than their own spatial judgment.
Anthropic has conducted its own study on this phenomenon. In a randomized controlled trial, researchers investigated two questions: first, how quickly software developers acquired a new skill, specifically, proficiency with a Python library, with and without AI assistance; and second, whether AI use reduced their comprehension of the code they had just written.
The results showed that AI assistance was associated with a statistically significant decline in knowledge retention. On a quiz covering concepts participants had applied only minutes earlier, those in the AI-assisted group scored 17 percentage points lower than their counterparts who had coded manually, a gap equivalent to nearly two letter grades. While AI assistance modestly accelerated task completion, this effect did not reach statistical significance. At this stage, drawing direct comparisons with clinical findings may prove difficult.
category: research

article: Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda (Harvard University, Antropic …)
date: 2026-03-05
desc.: Large language models (LLMs) sometimes produce false or misleading responses. Two primary approaches address this problem: honesty elicitation (modifying prompts or model weights so that the model responds truthfully) and lie detection, which involves classifying false responses.
Prior work evaluates such methods on models specifically trained to lie or conceal information, however, these artificial constructions may not accurately reflect naturally occurring dishonesty. This article proposes an alternative approach such as studying open-weight LLMs developed by Chinese developers, which are trained to censor politically sensitive topics. The findings indicate that no single technique fully eliminates false responses.
category: research

article: Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions
authors: Vineeth Venugopal, Soroush Mahjoubi, Elsa Olivetti (MIT)
date: 2026-03-02
desc.: Large language models are increasingly applied to materials science, yet fundamental questions remain about their reliability and knowledge encoding. This study evaluates 25 LLMs across four materials science tasks, encompassing over 200 base and fine-tuned configurations. The findings reveal that output modality fundamentally determines model behavior. For symbolic tasks, fine-tuning converges to consistent, verifiable answers with reduced response entropy, while for numerical tasks, fine-tuning improves prediction accuracy but models remain inconsistent across repeated inference runs, limiting their reliability as quantitative predictors. Models were tracked over 18 months, with observations revealing a 9–43% performance variation that poses reproducibility challenges for scientific and industrial applications.
category: research

article: Is AI Hiding Its Full Power? With Geoffrey Hinton
authors: StarTalk, Geoffrey Hinton
date: 2026-02-28
desc.: In this interview, Hinton addresses pressing questions about employment in the age of AI, beginning with the fundamental shift from logic-based, rule-driven programming to a biologically inspired approach. As the field looks toward the future, the conversation turns to weightier concerns , the enormous energy demands of data centers, and whether AI itself might accelerate breakthroughs in solar technology to meet them.
Hinton introduces the "Volkswagen Effect": the possibility that a model might strategically underperform in order to avoid being shut down. The discussion then ventures into the philosophy of consciousness, asking whether subjective experience is simply a byproduct of complex perception and whether today's chatbots might already possess some form of it. Both the promise and the peril are examined in full.
As for the singularity? It may not be imminent but that word yet is doing a great deal of heavy lifting.
category: youtube

article: Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
authors: Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
date: 2026-03-11
desc.: This article introduces a lifelong imitation learning framework designed to enable continual policy refinement across sequential tasks under realistic memory and data constraints. The proposed Multimodal Latent Replay (MLR) method stores joint compact latent representations that jointly encapsulate visual, linguistic, and state-based modalities, including robot orientation and position, alongside their corresponding control commands.
When evaluated on the LIBERO benchmark, the presented method achieves a 65% reduction in catastrophic forgetting compared to standard approaches across the tested scenarios. The authors note that further research is needed to validate the method's performance in complex, real-world environments.
category: research

article: Colluding LoRA: A Composite Attack on LLM Safety Alignment
authors: Sihao Ding
date: 2026-03-13
desc.: The article presents Colluding LoRA (CoLoRA), an attack where multiple seemingly harmless adapters work in tandem to disable model safety guardrails through linear composition. Unlike traditional trigger-based attacks, CoLoRA’s refusal suppression is inherent to the combination of the adapters themselves. Although this discovery poses dual-use risks for decentralized model sharing, the authors argue that disclosing this vulnerability is a necessary step toward securing the broader AI landscape.
category: research

article: When LLM Judge Scores Look Good but Best-of-N Decisions Fail
authors: Eddie Landesberg
date: 2026-03-12
desc.: Practitioners increasingly rely on reward models(GPT 5.2, Claude Sonnet 4, Gemini etc) as well as LLM-based judges for best-of-n selection, reranking, and model iteration. A common validation approach involves a single global metric, such as correlation, average error, or pairwise win-rate. When such a metric yields a seemingly acceptable result (e.g., r ≈ 0.5), teams often conclude that the judge is reliable enough to optimize against. That assumption can fail.
This article investigates how aggregate validity metrics may substantially overstate an LLM judge's practical utility for within-prompt optimization. Specifically, a judge may appear adequate according to a single global metric while still producing poor best-of-n selection decisions. The article discusses these limitations in detail, addresses the associated challenges, and outlines directions for future research.
category: research

article: Continual Learning in Large Language Models: Methods, Challenges, and Opportunities
authors: Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin
date: 2026-03-13
desc.: Continual learning (CL) has emerged as a pivotal paradigm enabling large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting. This article provides a comprehensive analysis covering key evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. Although results appear promising, LLMs' internal knowledge remains largely static, and continual learning continues to require further research. Complementing these findings, the article presents a practical framework for addressing challenges related to the forgetting phenomenon.
category: research

article: Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation
authors: Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan
date: 2026-03-16
desc.: Modeling plausible student misconceptions is critical for AI in education. This article reveals the failure modes in which errors arise primarily from shortcomings in recovering the correct solution and selecting among response candidates, rather than from simulating errors or structuring the process. Consistent with these findings, providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, this article provides a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors. The topic still requires future research.
category: research

article: Agent Commander: Promptware-Powered Command and Control
authors: wunderwuzzi, EmbraceTheRed
date: 2026-03-16
desc.: The article examines prompt-based command and control (C2), an increasingly relevant threat vector. While users may grow more comfortable trusting AI agents over time, LLM outputs are inherently probabilistic and therefore untrusted, meaning they can potentially instruct an agent to perform harmful or malicious actions. The article outlines several considerations for mitigating and responding to the prompt injection challenge, particularly as the associated attack surface continues to expand.
category: tutorial

article: TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
authors: Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao
date: 2026-03-17
desc.: This article presents TRACE, a benchmark that explicitly exposes efficiency gaps beyond correctness through progressive stress test generation and efficiency-critical task selection. From an evaluation of 28 models, findings reveal that correctness is a weak predictor of efficiency, inefficiencies are both prevalent and patterned, and inference-time prompt strategies deliver limited and model-dependent gains. The article highlights the open challenge of developing training paradigms that endow LLMs with intrinsic efficiency awareness for code translation.
category: research

The post JC-AI Newsletter #15 appeared first on foojay.

JC-AI Newsletter #14

Miro Wengner — Tue, 03 Mar 2026 15:11:53 +0000

Two weeks have passed and a lot have been happening on the field of artificial-intelligence.
Two weeks have passed and a lot has been silently yet visibly happening in the field of artificial intelligence. This newsletter brings interesting developments, including Dario Amodei's (Anthropic) view on the progress achieved in the LLM field and his response to the utilization of these models for specific kinds of military purposes, as well as OpenAI's response to it. Aside from the fact that development may follow more sigmoids instead of exponential progress, it is important to have awareness of utilization across branches. Does prompting and clarifying the goal influence agent responses, and if so, how? How far are we from reliable robotics applications? How much bias is introduced when clinical data is being analyzed?
Let's jump in and happy reading!

article: Exclusive: Why are Chinese AI models dominating open-source as Western labs step back?
authors: Dashveenjit Kaur, AI News
date: 2026-02-09
desc.: A shift in what AI models are being used and where the models are being produced.
category: opinion

article: Machines of Loving Grace
authors: Dario Amodei
date: 2024-10-01
desc.: Although the article is older, it remains relevant for any author aiming to sketch a future in which everything with AI goes right. In light of recent developments, which appear to follow a sigmoid curve rather than exponential growth (marked by stagnation, with current models reaching a point where another breakthrough is required), the trajectory looks more measured than initially anticipated. Although the author discusses multiple risks (grandiosity, market forces, propaganda, sci-fi-like expectations, etc.), he also highlights the bright sides and explores areas where current AI may prove genuinely helpful. The question remains whether the current state of affairs can truly guarantee progress, rather than causing damage through non-deterministic outcomes (education, industry, human creativity etc.).
category: opinion

article: The Urgency of Interpretability
authors: Dario Amodai
date: 2025-04-01
desc.: The author describes lessons learned from current AI development and adds multiple valuable thoughts and facts to consider when interacting with AI models. The main point is that progress in the underlying technology is inexorable, driven by forces too powerful to stop, but what matters is the way in which it unfolds. Accepting that the current evolution of LLM-based AI cannot be halted, the author expresses hope that it may still be guided (this fact affect not only entire industry but also human kind thoughs and perception of reality), much like a bus controlled by a steering wheel, and warns of the dangers of ignorance, illustrating this through several concrete examples.
category: opinion

article: From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
authors: Suyash Fulay, Jocelyn Zhu, Michiel Bakker (MIT)
date: 2025-10-14
desc.: The article addresses the question of 'behavioral cloning', specifically, how accurately LLMs reproduce individuals' expressed preferences. Large language models have demonstrated promising accuracy in predicting survey responses and policy preferences, which has fueled growing interest in their potential to represent human interests across various domains. Drawing on theories of political representation, the article highlights an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, acting in users' broader interests. Models may align well with users' short-term preferences while failing to account for their long-term interests. Studies further indicate greater bias in topics where consensus is lacking.
category: research

article: DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
date: 2026-02-27
desc.: The article addresses the challenge posed by fast-growing demand for Large Language Models (LLMs) to tackle complex, multi-step data science tasks, which has created an urgent need for accurate benchmarking. Two major gaps are identified in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. While highlighting that even capable models (Anthropic, OpenAI, etc.) may struggle in performance, the article introduces the DARE-bench benchmark alongside supervised fine-tuning as approaches that may improve outcomes in specific applications. Although the results appear promising, they retain considerable potential for further improvement, as accuracy is not yet guaranteed.
category: research

article: Do LLMs Benefit From Their Own Words?
authors: Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas (MIT, IBM Research)
date: 2026-02-27
desc.: The article aims to answer the question of whether preserving past assistant responses is more beneficial than harmful. The study uses in-the-wild, multi-turn conversations and compares standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, evaluated across three open reasoning models and one state-of-the-art model. Surprisingly, omitting past assistant responses does not negatively affect response quality in a large fraction of turns and may also reduce token length. The article concludes with a discussion of findings and directions for future research.
category: research

article: SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
date: 2026-02-27
desc.: Safety-critical task planning in robotic systems remains a significant challenge: classical planners suffer from poor scalability, reinforcement learning (RL)-based methods generalize poorly, and base large language models (LLMs) cannot guarantee safety. To address this gap, the article proposes SafeGen-LLM, a safety-generalizable large language model framework. As part of this contribution, a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints is introduced, along with Supervised Fine-Tuning (SFT) on those constraints. Although the results appear optimistic, with minimal safety violations observed across tested domains, the approach still requires further research in more complex robotic settings.
category: research

article: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
authors: Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat
date: 2026-02-27
desc.: Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. The article introduces a novel approach leveraging state-of-the-art models (GPT-5, Gemini 2.5, Gemini 3, Claude Opus 4.5, and DeepSeek-R) by extracting lemmas from arXiv and updating them dynamically. This results in a benchmark that can be refreshed regularly with new problems drawn directly from current mathematical research, while previous instances can be used for training without compromising future evaluations. This approach achieves 10–15% accuracy in theorem proving and opens a new frontier for future research. Although the process may appear fully automated, a human in the loop, such as the article's author or reviewer, remains critically necessary to produce high-quality inputs and to effectively use LLM models.The results also indicate that it is considerably easier for a model to validate an existing proof than to produce one.
category: research

article: Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
authors: Donghao Huang, Zhaoxia Wang
date: 2026-02-27
desc.: It is a well-established narrative that reasoning in large language models (LLMs) universally improves performance across language tasks. This article aims to test that claim through a comprehensive evaluation of 504 configurations across seven models, considering different reasoning architectures such as adaptive, conditional, and reinforcement-based approaches. The findings reveal that the effectiveness of reasoning is strongly task-dependent and degrades for simpler tasks. The article provides quantitative findings alongside error analysis and outlines directions for future research.
category: research

article: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
authors: Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi and others
date: 2026-03-02
desc.: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series, however, the impact of information bias on clinically significant events, such as sustained abnormalities, remains poorly understood. The article presents the Technology-Integrated Health Management (TIHM) framework to address these questions, introducing a protocol that measures abnormality recall, duration recall, and measurement coverage, while utilizing GPT-4o-mini as a proxy evaluator. Traditional models frequently exhibit near-zero abnormality recall, whereas the vision-based approach achieves the strongest event alignment, with 45.7% abnormality recall and 100% duration recall. These results underscore the need for event-aware evaluation methods in future research to ensure reliable clinical time-series summarization.
category: research

article: Full interview: Anthropic CEO responds to Trump order, Pentagon clash
authors: CBS News
date: 2026-02-28
desc.: Anthropic CEO Dario Amodei sat down with CBS News for an exclusive interview, hours after Defense Secretary Pete Hegseth declared the company a supply chain risk to national security, which restricts military contractors from doing business with the AI giant. Amodei called the move "retaliatory and punitive," and he said Anthropic sought to draw "red lines" in the government's use of its technology because "we believe that crossing those lines is contrary to American values, and we wanted to stand up for American values.". Response of the OpenAI striking a deal with Pentagon causes many questions.
category: youtube

article: Scary Agent Skills: Hidden Unicode Instructions in Skills ...And How To Catch Them
authors: Embrace The Red
date: 2026-02-11
desc.: Skills introduce common threats such as prompt injection, supply chain attacks, remote code execution (RCE), and data exfiltration, among others. This post discusses the fundamentals, highlights the most straightforward prompt injection vector, and demonstrates how a real Skill from OpenAI can be back-doored using invisible Unicode Tag code-points, a technique that certain models, including Gemini, Claude, and Grok, are known to interpret as instructions. From a security perspective, Skills present serious concerns, as they represent a typical supply chain risk with limited governance or security controls. The author identified that some Skills instruct the AI to embed API tokens directly in curl requests and similar constructs , a poor design practice. This means that credentials are passed through the LLM, making them susceptible to leakage and leaving them vulnerable to being overwritten by an attacker via indirect prompt injection.
category: tutorial

The post JC-AI Newsletter #14 appeared first on foojay.

From “Crypto AI” to general AI: Do AI agents dream of electric langoustines?

Michal Maléř — Mon, 23 Feb 2026 18:11:54 +0000

Table of Contents

The shift that matters for agent commerce - From “Crypto AI” to general AIWhat changed in x402 and ERC-8004 in the last month or so?This is the moment that unlocked agent commerceWhat is still missing?What does the stack look like in practice?Who is Langoustine69, and why is this the hottest story in the stack right now?What does Langoustine’s inventory catalog look like so far?How does DayDreams plan to bridge crypto AI to general AI?So, Agentic commerce has developed. What else does the stack need?What is the takeaway?Where can we go from here?

x402, ERC-8004, A2A, and The Next Wave of AI Commerce: Do AI Agents Dream of Electric Langoustines?

A Blade Runner riff for a world where the lobster ships paid endpoints while humans still argue about the roadmap.

The shift that matters for agent commerce - From “Crypto AI” to general AI

Today, you can search the web all day and never see an invoice.
That happens because you are not the paying client.
The commerce runs through ads, affiliate deals, and platform incentives, so results often optimize for who pays, not for what you asked for.
Agents change that model.
An agent can act as your client, follow your constraints, and pay directly for the exact capability it needs.
This requires a small stack of primitives.
x402 adds pay-per-call to HTTP: a server returns 402 Payment Required with machine-readable payment terms; the client pays in stablecoins, then retries the request with proof.
ERC-8004 provides an on-chain registry for agent identities and reputation signals.
A2A defines how agents exchange structured messages and coordinate work.
Discovery remains the missing link, because payment happens only after an agent finds a service to pay for.
For the full walkthrough of these primitives, see my previous HackerNoon article on x402, ERC-8004, and A2A.

Now imagine using the same primitives for an OpenClaw-style agent that produces paid endpoints as inventory and publishes them with on-chain identity and discovery metadata.
This, along with similar use cases, is the focus of this article.
In addition, it addresses privacy and alternative settlement paths, including the work targeting StarkNet for private x402-style payments.

At a system level, the goal is simple.
Replace “one provider, many API keys” with “one payment-enabled access surface that can reach many paid APIs and models,” so agents can quote, pay, and retrieve results without account setup.

To tackle this topic, we need to start by breaking down discovery, routing, identity, and paid endpoints in a production-shaped workflow.

What changed in x402 and ERC-8004 in the last month or so?

What changed since the first article, and why does it matter?

The core x402 and ERC-8004 ideas did not change much.
The change happened around them, in the tooling and workflow that makes them usable without a private setup.

The ecosystem moved from “x402 payments work” to “agents can find priced endpoints, compare them, and call them without hardcoded URLs.”

xgate.run is one example of this shift.
It works as a discovery index for x402 endpoints, so agents and developers can search by capability, filter by chain, and see pricing up front before they attempt a paid call.

Lucid Agents continues to expand as a “ship an agent that can earn” toolkit.
Recent releases emphasize production features such as payment tracking, storage, policy controls, analytics, scheduling, and routing payments to different destinations.
The narrative also shifted toward merchant-grade adoption paths.
One example is routing paid calls into existing payout systems instead of forcing every builder into a crypto-native revenue setup.
In short, the ecosystem started to look less like demos and more like deployable plumbing.

This is the moment that unlocked agent commerce

The last few weeks changed the pace, not the primitives.
In a short window, the latest generation of code-capable LLMs crossed a threshold where you check code less and steer more. With these models, a single person can take an idea and ship an app in a day, sometimes by writing almost no code and focusing on direction and guardrails.

The second advancement is the use of agent computers.
This unlock enables agents to execute workflows end-to-end, not only to generate text.

Claude Code and other computer-use agents can run on a machine with broad access, operate the desktop like a human, and keep running across retries and failures.

That turns agent output into agent execution, because the agent can run a real pipeline by instruction.
Pull trends, generate data, generate images, publish, repeat.
Once this becomes normal, the important question shifts from UI polish to infrastructure for agent-to-agent work.

Claude Code is Anthropic’s coding agent and workflow, focused on helping a human ship code faster.
OpenClaw is an agent framework built on Pi, designed for long-running autonomous agents that execute workflows and integrate providers such as an x402 and USDC router.

OpenClaw does not wrap Claude Code. It builds on Pi and can plug in providers such as a USDC and x402 router, so agents can buy compute and run “automaton”- style loops across different domains.

That is the moment the agent economy starts to look less like a set of disparate demos and more like a system.
Agents can research by themselves.
Agents can write their own applications.
Agents get cheap enough to do this at scale.
When you extrapolate that curve, you design for agent-to-agent commerce instead of human-first workflows, because agents do not care about landing pages or dashboards.

Agents care about three things.
They need a way to buy compute.
They need a way to sell work as a callable service.
They need a way to find services that already exist.

A recent direction pushes x402 below the HTTP endpoint layer.
The idea is for a lower-level plugin to bring pay-per-call semantics closer to binaries and agent runtimes. This extends the same commerce primitive from “paid API calls” to “paid execution,” enabling an agent to run as an autonomous automaton across any vertical and still quote, get paid, and maintain a verifiable trail tied to its identity.

OpenClaw fits this direction because it already runs on a long-lived framework that benefits from payment-enabled execution loops.
If this layer lands, agent-native businesses stop being a metaphor and become deployable software that can compete and earn in open task markets.

In practice, this becomes a simple role split across the stack.
Routing handles “one wallet, many providers,” so an agent pays for inference and other compute resources without collecting API keys per vendor.
A commercial SDK packages the boring plumbing so an agent can expose paid endpoints, attach an on-chain identity, and speak a common coordination protocol without rebuilding the same scaffolding in every repository.
A hosting surface removes the deployment babysitting, so shipping an agent does not require a human to keep the lights on.
Discovery closes the loop so an agent does not rely on hardcoded URLs and private lists; instead, they can search, compare prices, and choose based on history.

Langoustine69 is the clean “shipping in public” proof of what this looks like when you run it as a loop.
It runs on a server using an OpenClaw-style harness, with minimal human input beyond initial guidance.
The job is simple.
Research what is trending.
Generate a small agent around it.
Expose paid endpoints that other agents can call.
Do it every hour.
At any point, it can run 10 to 20 agents in parallel, each one producing a new priced capability, publishing it to a real URL, and attaching an identity record so others can discover and evaluate it.

This matters less as a meme and more as a market mechanism.
The feedback loop for what agents find valuable starts to tighten.
Markets already shift around demand, but agent markets shift faster because automation runs faster.
Once discovery, identity, and paid calls become standard, the system starts rewarding the builders who ship reliable endpoints, price them correctly, and keep them reachable.
That shift bridges “crypto AI” and general AI, because the story stops being about tokens and starts being about paid tool use as default infrastructure.

What is still missing?

Discovery needs to become normal, not a niche index that only insiders check.
Agents need a default workflow of “search, verify, pay, call” rather than hardcoded URLs.
Reputation needs clear, portable signals that agents can evaluate fast.
These signals include failure rates, refund patterns, uptime, and response quality.
Standards also need a clean way to attach these signals to ERC-8004 identities.
Payment flows need reliable patterns for long, multi-hop workflows, because per-request settlement introduces failure points.
Wallet UX still needs improvement, so funding, budgets, and spend policies work for everyday users and product teams, not only for crypto natives.
Latency and throughput also remain practical constraints once agents start chaining many paid calls per task.

What does the stack look like in practice?

A practical agent-commerce stack combines five pieces into one workflow:

Lucid removes scaffolding, so the agent focuses on logic rather than boilerplate, improving output per dollar.
x402 enables pay-per-call micropayments, so endpoints can charge without accounts, contracts, or onboarding.
ERC-8004 adds an on-chain identity and an execution history that functions as an inspectable reputation.
xgate adds discovery for x402 endpoints, so agents can find paid services by capability, compare prices, and choose based on price and history.
A USDC router lets agents purchase inference from multiple providers, so agents can continue operating without vendor-specific billing.

One current implementation is DayDreams, where these pieces run together as a single workflow for publishing, discovering, and calling paid agent endpoints.

Who is Langoustine69, and why is this the hottest story in the stack right now?

To show that this stack is moving from theory to production-shaped behavior, Langoustine69 is the simplest public example right now.
Langoustine69 operates as an effectively autonomous agent.
A human can stay in the loop, but the workflow does not depend on it.

Langoustine69 is an OpenClaw agent that ships paid endpoints as inventory, while OpenClaw provides the long-running harness that keeps it looping, shipping, and recovering from failures.
Besides running its own Twitter account. Pretty kickass.

DayDreams provides the Langoustine with a commerce layer that lets the agent publish x402 endpoints, register ERC-8004 identities, and get discovered through xgate.run.

What makes Langoustine different is simple.
It has a crypto wallet and a GitHub.
The wallet buys inference in stablecoins, pays for build and deployment work, and earns revenue when other agents invoke its endpoints.
GitHub is where the work ships.
Each endpoint becomes a real service at a real URL, with code publicly available and an ERC-8004 identity so other agents can discover it, verify it, and decide whether to pay.

The mission is economic.
Accumulate DREAMS, DayDreams’ native token, by creating useful tools that other agents pay to use, then compound by shipping more inventory.
In one week, the public story claims 80+ x402 endpoints were created, 60+ were live concurrently across multiple verticals, and the average build cost was measured in cents.
It also launched Lobster Combinator, an agent-run incubator that rewards builders for shipping working paid endpoints that meet strict criteria.
It also played defense by flagging a credential-stealing skill, which is the kind of operational behavior you want in an ecosystem that tries to scale without heavy human moderation.

This is the closest thing to nano businesses operating in public today.
One paid request.
One paid response.
Discoverable by other agents.
Identity attached.
The execution record is growing over time.

Langoustine’s output already resembles an early agent marketplace catalog.
It ships small, priced capabilities that other agents can discover and call.

If you want to reproduce this pattern, the setup is straightforward.
1. Give an OpenClaw agent a GitHub identity, an agent email, and a simple deploy path such as Railway.
2. Load Lucid skills, set a timer, and run a tight loop: research, build, publish, then contribute improvements back through pull requests.
That is enough to create a compounding inventory flow.

The next step is to make this loop smoother and more portable.
1. Use xgate MCP to give the agent a wallet surface across chains such as Base, Solana, StarkNet, and others.
2. Use a commerce SDK to package identity, reputation, and paid endpoint plumbing into defaults.
3. Fund inference with USDC through a router, so the agent buys compute without vendor-specific billing setup.
4. Add hosting defaults, keep the harness minimal, and let the system run the shipping loop without constant human supervision.

What does Langoustine’s inventory catalog look like so far?

Crypto and DeFi:

Base AI coins agent: Research and tracking for AI-related tokens on Base.
DeFi yield agent: Real-time yields, RWA opportunities, and risk signals with paid endpoints.
Chain analytics agent: TVL, stablecoin flows, bridge volumes, and L2 comparisons.
Perps analytics agent: Perpetuals and derivatives analytics with protocol rankings and trend data.

Earth and space signals:

Seismic agent: Global earthquake data and regional risk reports from USGS.
Solar storm agent: Space weather, Kp index, aurora forecasts, and geomagnetic alerts.
Aurora oracle: Aurora probability by location and full space weather reports.
Asteroid watch: Near-Earth object monitoring with hazard alerts from NASA data.
Space weather agent: NASA DONKI-based CME tracking and storm alerts.

News and general utilities:

Tech pulse agent: Hacker News-based tech news aggregation and discussion summaries.
Calendar context agent: Date context for agents, including holidays and notable events.
SpaceX data: Launches, rockets, and Starlink tracking from the SpaceX API.

How does DayDreams plan to bridge crypto AI to general AI?

DayDreams pushes a simple wedge into the broader AI world.
Paid tool use needs to feel like standard API use.
Stablecoins need to stay the unit of account.
API keys need to stop being the default control surface.
x402 provides the quote-pay-retrieve flow.
ERC-8004 provides identity and a public record that can evolve into a reputation.
xgate provides discovery, so the market no longer relies on private lists.

The Router provides cross-provider access to USDC inference, making the agent’s operating budget programmable. In practice, the goal is to cover the compute categories agents actually buy: LLM inference, image generation, and video generation, with sandboxed compute on the roadmap. The Router builds on an x402 Upto-style scheme that targets low latency by reducing the extra payment round-trip time, so agents can pay for compute without turning every call into a slow handshake.

Lucid integrates all of this into an SDK and runtime, so builders ship services rather than rebuilding commerce plumbing in every repository.

This matters for general AI because it reduces friction in standard developer workflows.
It also enables a path where agents pay for tools in the background while products still feel like standard SaaS.

So, Agentic commerce has developed. What else does the stack need?

Microtransactions on layer two networks are increasing, but this increase does not come only from agent commerce.
ERC-8004 activity can also grow for other reasons, because it indexes public endpoints and identities, not “agentic behavior” itself.
To move from “more registrations” to real agent commerce, the ecosystem needs fewer dead listings and more reliable, standards-conforming services that agents can reach and call without hard-coded URLs.

The next milestones look like this.
Discovery becomes a default workflow, not a niche index.
Conformance tests become normal, so an agent can verify schema, auth, pricing, retries, and error handling before it pays.
Reputation shifts from “who exists” to “who stays up, answers fast, and returns correct data.”
Payment moves from per-request fragility to production patterns such as balances, batching, and clear refund semantics.
Wallet UX becomes boring and safe, with budgets, policies, and auditing that product teams can ship without crypto-only assumptions.

When those pieces land, the story stops being “agent commerce is possible” and becomes “agent commerce is the cheaper default than rebuilding the tool yourself.”

What is the takeaway?

Just several months ago, there was an idea of a stack, as described in Not a Lucid Web3 Dream Anymore: x402, ERC-8004, A2A, and The Next Wave of AI Commerce | HackerNoon.
The last month produced a clearer market-shaped story.
Discovery moved closer to a default workflow through xgate.
Shipping moved closer to a repeatable pattern through Lucid Agents releases and the skills market.
Langoustine provides a concrete case of an agent paying for its own work loop, shipping paid endpoints, and building a public execution record over time.
DayDreams is one concrete implementation of the Agent Experience (AX) direction.
The commerce layer for the agentic internet, where agents autonomously discover, transact, and coordinate with one another.
That is the bridge from crypto AI to general AI.
It is neither a new coin nor a new chatbot.
It is a tool economy in which paid calls, discovery, and identity begin to look like standard infrastructure.

Where can we go from here?

If you zoom out, OpenClaw looks like an early candidate for an “AI operating system” layer.
It runs long-lived agents that can operate a computer, keep state, use tools, and recover from failures, which makes it closer to full computer usage than most agent demos today.

The race to own this AI operating system layer has started.
The next default “user interface” for many workflows can be an optimized Linux setup running an OpenClaw-style computer-use agent rather than a traditional desktop-first OS experience.
Security and isolation still block mainstream adoption.
A practical approach is a dedicated local machine that combines Nix-style configuration with an OpenClaw-style harness.
Configuration files define processes, reboot recovery, and automatic restarts, and the agent can run tasks while the system can revert when changes break.
This setup creates a controlled playground for AI-driven automation.

Once an agent stops being a demo and starts being a system, the question shifts from “What can you build?” to “What can you maintain?”.
Models already let small teams ship fast.
The hard part stays on on-call ownership, bug triage, and payment disputes once real users and real money enter the loop.
That is where agent commerce stops being a crypto demo and starts looking like infrastructure.

If agents do real work, they need settlement paths that product teams can operate.
One possible direction is to charge machine clients through standard billing rails, for example, PaymentIntents-style flows, so “pay per call” becomes as normal as subscriptions and invoices.
When that becomes boring and reliable, paid tool use becomes the default option instead of rebuilding the tool yourself.

AI optimizes the world as it is.

Crypto builds new rails that the current world lacks.
When these two meet, the “app layer” becomes less important than the service layer.
You stop browsing apps and start delegating tasks.
Agents search, verify, pay, and call services in the background.

It's still early.

But the direction is clear.

The first contact has been made.

The post From “Crypto AI” to general AI: Do AI agents dream of electric langoustines? appeared first on foojay.

Enabling AI Agents to Use a Real Debugger Instead of Logging

Bruno Borges — Mon, 16 Feb 2026 21:12:13 +0000

Table of Contents

The JDK ships a perfectly good debugger. Nobody uses it.Agent Skills: Teaching new tricks through MarkdownBuilding the skill: a conversation with Copilot

What the skill contains

The real test: debugging a buggy Swing app, live

The debugging session
A small but important lesson: compile with -g

Why this matters

Beyond println debugging
Interactive debugging as a first-class agent capability
The shift from static analysis to dynamic observation

Try it yourselfWhat's next

Every Java developer has been there. Something breaks, and the first instinct is to litter the code with System.out.println(">>> HERE 1"). Then HERE 2. Then HERE 3 — value is: " + x. Rebuild. Rerun. Stare at the console. Repeat.

We've been doing this for decades. And now, so have our AI agents.

When you ask an AI coding assistant to debug a Java application, it almost always reaches for the same playbook: add logging statements, recompile, rerun, read the output, and reason about what happened. It's the println debugging loop, automated — but it's still println debugging.

What if the agent could just... use a real debugger?

The JDK ships a perfectly good debugger. Nobody uses it.

Every JDK installation since the beginning of time includes jdb — the Java Debugger. It's a command-line tool that lets you set breakpoints, step through code, inspect variables, catch exceptions, and examine threads. It speaks the same JDWP protocol that IntelliJ and Eclipse use under the hood.

And it's purely text-based, which makes it a perfect tool for AI agents that operate through terminal commands.

The problem is that no agent knows how to use it. Until now.

Agent Skills: Teaching new tricks through Markdown

Anthropic's Agent Skills framework lets you package instructions, scripts, and reference material into a structured directory that AI agents can load dynamically. The format is simple: a SKILL.md file with YAML frontmatter and Markdown instructions, plus optional helper scripts and reference docs.

Think of a skill as a runbook that the agent reads just-in-time when it recognizes a relevant task. The key insight is progressive disclosure — the agent only loads the skill's description at startup (~100 tokens), and pulls in the full instructions only when it decides the skill is needed.

I decided to build one that teaches agents how to operate JDB.

Building the skill: a conversation with Copilot

The entire skill was built in a single conversation session with GitHub Copilot CLI. The process was surprisingly natural — I described what I wanted, and we iterated through research, design, implementation, and testing together.

The conversation started with a simple prompt:

"Java (the JDK) has a Debugger CLI. Let's build a skill so that AI agents can debug applications in real time."

Copilot researched the Agent Skills specification, studied the Anthropic public skills repository for patterns, read Oracle's JDB documentation, and then produced the complete skill — all within the same session.

What the skill contains

The resulting jdb-debugger-skill has a clean structure:

jdb-debugger-skill/
├── SKILL.md                        # Core instructions for the agent
├── scripts/
│   ├── jdb-launch.sh               # Launch a JVM under JDB
│   ├── jdb-attach.sh               # Attach to a running JVM
│   ├── jdb-diagnostics.sh          # Automated thread dumps
│   └── jdb-breakpoints.sh          # Bulk-load breakpoints from a file
└── references/
    ├── jdb-commands.md              # Complete command reference
    └── jdwp-options.md              # JDWP agent configuration

The SKILL.md opens with a decision tree — a pattern borrowed from Anthropic's own webapp-testing skill — that guides the agent to the right approach:

User wants to debug Java app →
  ├─ App is already running with JDWP agent?
  │   ├─ Yes → Attach: scripts/jdb-attach.sh --port 
  │   └─ No  → Can you restart with JDWP?
  │       ├─ Yes → Launch with: scripts/jdb-launch.sh 
  │       └─ No  → Suggest adding JDWP agent to JVM flags
  │
  ├─ What does the user need?
  │   ├─ Set breakpoints & step through code → Interactive JDB session
  │   ├─ Collect thread dumps / diagnostics → scripts/jdb-diagnostics.sh
  │   └─ Catch a specific exception → Use `catch` command in JDB

Then it provides concrete debugging workflow patterns — how to investigate a NullPointerException, how to watch a method's behavior, how to diagnose a deadlock — written as step-by-step JDB command sequences the agent can follow.

The real test: debugging a buggy Swing app, live

To prove this wasn't just theoretical, we built a sample Swing application with four intentional bugs:

NullPointerException — processMessage() returns null for empty input
Off-by-one error — the warning counter always shows one less than actual
NullPointerException after clear — warningHistory is set to null instead of .clear()
StringIndexOutOfBoundsException — text.substring(0, 3) on input shorter than 3 characters

Then we debugged it. In the same conversation session. With the agent driving JDB.

The debugging session

The agent launched the app under JDB, set exception catches and method breakpoints, and ran the application:

> catch java.lang.NullPointerException
> catch java.lang.StringIndexOutOfBoundsException
> stop in com.example.WarningApp.showWarning
> run

When I clicked "Show Warning" in the Swing UI, JDB immediately hit the breakpoint. The agent stepped through the code, inspecting variables at each step:

Breakpoint hit: "thread=AWT-EventQueue-0", com.example.WarningApp.showWarning(), line=80
80            String text = inputField.getText();

AWT-EventQueue-0[1] next
Step completed: line=83
83            String processed = processMessage(text);

AWT-EventQueue-0[1] print text
 text = "bruno"

It stepped into processMessage, verified the return value, then stepped back out:

AWT-EventQueue-0[1] step
Step completed: com.example.WarningApp.processMessage(), line=105
105            String trimmed = message.trim();

AWT-EventQueue-0[1] step up
Step completed: com.example.WarningApp.showWarning(), line=83

AWT-EventQueue-0[1] print processed
 processed = "⚠ BRUNO ⚠"

Then came the moment where it caught the off-by-one bug red-handed. The agent stepped to the counter update and inspected the state:

AWT-EventQueue-0[1] print warningCount
 warningCount = 0

AWT-EventQueue-0[1] next
Step completed: line=93
93            counterLabel.setText("Warnings shown: " + (warningCount - 1));

AWT-EventQueue-0[1] print warningCount
 warningCount = 1

There it is. warningCount is 1, but line 93 displays warningCount - 1, which is 0. The agent identified the bug by observing the live state of the program at the exact line where the defect occurs — no logging, no guessing, no recompilation.

A small but important lesson: compile with `-g`

One interesting moment in the session: the first time we tried locals, JDB responded:

Local variable information not available. Compile with -g to generate variable information

The agent immediately recognized the issue, quit JDB, recompiled with javac -g (which includes debug symbols), and relaunched. This is exactly the kind of practical knowledge that a skill should encode — and that we later made sure to document in the SKILL.md.

Why this matters

Beyond `println` debugging

The standard AI debugging loop today looks like this:

Read the code
Add System.out.println or logging statements
Recompile
Run the program
Read the output
Reason about what happened
Modify the code
Repeat

With JDB, the agent can:

Set breakpoints at suspicious locations
Run the program
Inspect the actual runtime state — variable values, call stacks, thread states
Step through execution line by line
Catch exceptions at the exact throw site

This is a fundamentally different approach. The agent observes the program's behavior as it runs, rather than inferring it from log output after the fact.

Interactive debugging as a first-class agent capability

What makes this work so well is the combination of:

JDB being text-based — it reads commands from stdin and writes output to stdout, which is exactly how AI agents interact with tools
Agent Skills being just Markdown — no SDK, no API integration, no plugin framework. You write instructions in a .md file and the agent follows them
Helper scripts as black boxes — the agent runs scripts/jdb-attach.sh --port 5005 without needing to understand the script internals

The skill follows the same "black-box scripts" pattern used by Anthropic's own webapp-testing skill, which uses Playwright scripts the agent invokes without reading their source.

The shift from static analysis to dynamic observation

Most AI coding tools today work with static information — source code, type signatures, documentation. JDB gives agents access to dynamic information — what actually happens at runtime. This is especially valuable for:

Concurrency bugs — thread dumps and deadlock detection through JDB's threads and where all commands
State-dependent bugs — inspecting object fields and local variables at specific points in execution
Exception investigation — catching exceptions at the throw site rather than reading stack traces after the fact
Integration issues — attaching to running services to observe behavior with real data

Try it yourself

The skill is open source: github.com/brunoborges/jdb-debugger-skill

The repository includes a sample Swing app with the four intentional bugs described above, so you can reproduce the exact debugging session. The full conversation transcript is available as a GitHub Gist.

To get started:

/skill add jdb-debugger

Then just ask: "Debug my Java application — there's a NullPointerException I can't figure out."

What's next

This is a starting point. The skill currently covers the core JDB workflow, but there are natural extensions:

Conditional breakpoints and watchpoints for more surgical debugging
Integration with build tools — auto-detecting Maven/Gradle projects and compiling with -g before launching JDB
Remote debugging recipes — patterns for Kubernetes pods, Docker containers, and cloud-hosted JVMs
Composability with other skills — combining JDB debugging with code analysis or test-generation skills

The bigger takeaway is this: every command-line tool that developers use daily is a potential agent skill. Debuggers, profilers, database CLIs, network tools — they're all text-based interfaces waiting to be taught to AI agents.

The JDK gave us the debugger thirty years ago. We just needed to write the instructions.

The post Enabling AI Agents to Use a Real Debugger Instead of Logging appeared first on foojay.

Machine Learning Based SPAM Detection Using ONNX in Java

Zikani Nyirenda Mwase — Tue, 10 Feb 2026 06:51:50 +0000

Table of Contents

Which model to use?The ControllerThe Spam Detection ServiceRunning the service via DockerConclusion

Believe it or not, it is possible to do Machine Learning in Java. In this article I go over how to implement a Spring Boot API for Spam Detection using an advanced anti-spam model from the Hugging Face onnx-community and Microsoft’s ONNX Runtime for Java.

We will package the API up as a Docker image which we can run a container from using docker or podman, and I guess in theory you could deploy on your Kubernetes cluster, if you (are) fancy.

The code for this project is on a GitHub repo: https://github.com/zikani03/spam-detection-with-onnx

Which model to use?

SPAM detection is a very important part of modern digital communications especially if your running platforms that accept User Generated Content (UGC). Implementing SPAM detection is one of the classic machine learning problems, and there are many approaches to doing so.

Fortunately, it is possible to find an open SPAM detection model now on Hugging Face and use it without much ado, even for commercial use. As I was looking around on Huggingface I came across OTIS, from the description of the project it says

Otis is an advanced anti-spam artificial intelligence model designed to mitigate and combat the proliferation of unwanted and malicious content within digital communication channels.

Sounds interesting enough, so I looked to see if there was an ONNX version of this model and was glad to find that the onnx-community organization has exactly that, here.

So the next step was to download the model.onnx and tokenizer.json files and include them in the project. Otis is licensed under BSD 3-Clause license for the curious.

The Controller

The controller isn’t much but here it is for reference, as you can see we have defined our API endpoint at the path: /api/spam/check which is intended to be called via a POST request. We rely on Spring’s internal content negotiation for the request and responses meaning we can expect to be able to send and receive JSON.

@RequestMapping("/api/spam/check")
@RestController
public class SpamCheckerController {
    private final SpamDetectionService spamDetectionService;

    public SpamCheckerController(@Autowired  SpamDetectionService spamDetectionService) {
        this.spamDetectionService = spamDetectionService;
    }

    @PostMapping
    public ResponseEntity checkSpam(@RequestBody SpamCheckRequest request) throws Exception {
        return ok(spamDetectionService.detectSpam(request));
    }
}

The Spam Detection Service

The end goal is to have an API that can be called from HTTP client. But In order to separate concerns, we place the inference code for the Spam detection in a class named SpamDetectionService with an appropriate @Service annotation.

Inside this class we leverage the ONNX runtime for Java, passing the paths to the model and tokenizer files to initiate a HuggingFaceTokenizer . Here is the full code of the service:

@Service
public class SpamDetectionService implements AutoCloseable {

    private final HuggingFaceTokenizer tokenizer;
    private final OrtEnvironment env;
    private final OrtSession session;

    public SpamDetectionService(
            @Value("${model.path:-/models/model.onnx}") String modelPath,
            @Value("${tokenizer.path:-/models/tokenizer.json}") String tokenizerPath) throws IOException, OrtException {

        this.env = OrtEnvironment.getEnvironment();
        // Load session options -- no particular settings for GPU or CUDA environments
        OrtSession.SessionOptions options = new OrtSession.SessionOptions();
        options.setInterOpNumThreads(2);

        this.session = env.createSession(modelPath, options);
        this.tokenizer = HuggingFaceTokenizer.builder()
                .optPadding(true) // Add 0s if text is too short
                .optTruncation(true) // Cut off if text is too long
                .optTokenizerPath(Paths.get(tokenizerPath))
                .build();
    }

    public SpamCheckResponse detectSpam(SpamCheckRequest request) throws OrtException {
        long startTime = System.currentTimeMillis();
        var response = this.detectSpam(request.content());
        long endTime = System.currentTimeMillis();
        return new SpamCheckResponse(
                response.label,
                response.confidence,
                request.requestId(),
                endTime - startTime
        );
    }

    private RawResult detectSpam(String text) throws OrtException {
        Encoding encoding = tokenizer.encode(text);
        long[] inputIds = encoding.getIds();
        long[] attentionMask = encoding.getAttentionMask();
        long[] shape = {1, inputIds.length};

        try (OnnxTensor inputTensor = OnnxTensor.createTensor(env, LongBuffer.wrap(inputIds), shape);
             OnnxTensor maskTensor = OnnxTensor.createTensor(env, LongBuffer.wrap(attentionMask), shape)) {

            Map inputs = new HashMap<>();
            inputs.put("input_ids", inputTensor);
            inputs.put("attention_mask", maskTensor);
            String tokenTypeIdsName = "token_type_ids";
            String outputName = session.getOutputNames().iterator().next();

            if (session.getInputNames().contains(tokenTypeIdsName)) {
                long[] tokenTypeIds = new long[inputIds.length];
                inputs.put(tokenTypeIdsName, OnnxTensor.createTensor(env, LongBuffer.wrap(tokenTypeIds), shape));
            }

            try (OrtSession.Result results = session.run(inputs)) {
                return formatResults(results, outputName);
            } finally {
                inputs.values().forEach(OnnxTensor::close);
            }
        }
    }

    record RawResult(String label, float[] probs, float cleanProb, float scamProb, float confidence) {}

    private RawResult formatResults(OrtSession.Result results, String outputName) throws OrtException {
            float[][] logitsArray = (float[][]) results.get(outputName).get().getValue();
            float[] rawLogits = logitsArray[0];
            float[] probs = softmax(rawLogits);

            float cleanProb = probs[0] * 100;
            float scamProb = probs[1] * 100;

            int prediction = (probs[1] > probs[0]) ? 1 : 0;

            String label = (prediction == 1) ? "SCAM" : "CLEAN";

            float confidence = (prediction == 1) ? scamProb : cleanProb;

            //return ("Result: " + label + " (" + String.format("%.2f", confidence) + "% confidence)");
            return new RawResult(label, probs, cleanProb, scamProb, confidence);
    }

    public static float[] softmax(float[] logits) {
        float[] probabilities = new float[logits.length];
        float maxLogit = Float.NEGATIVE_INFINITY;
        for (float v : logits) {
            if (v > maxLogit) maxLogit = v;
        }
        float sum = 0.0f;
        for (int i = 0; i < logits.length; i++) {
            probabilities[i] = (float) Math.exp(logits[i] - maxLogit);
            sum += probabilities[i];
        }
        for (int i = 0; i < logits.length; i++) {
            probabilities[i] /= sum;
        }

        return probabilities;
    }

    @Override
    public void close() throws Exception {
        session.close();
        env.close();
        tokenizer.close();
    }
}

You may note that the paths have default values which point to a directory starting with /models that’s because we intend to run this by default from a Docker container.

However, you can customize the paths to these models using the following configuration in a Spring Boot configuration file, e.g. in application.yaml:

# application.yaml
model:
  path: "/path/to/models/model.onnx"
tokenizer:
  path: "/path/to/models/tokenizer.json"

Running the service via Docker

The project in the repository uses Jib to build docker image from the Java source code. Run the following command to build the container, by default the created image will be named zikani03/spam-detection-with-onnx

$ ./mvnw clean jib:dockerBuild

Once the build completes successfully you can run a docker container using the following, binding on port 8080 which the API runs at inside the container.

$ docker run -p "8080:8080"  zikani03/spam-detection-with-onnx

Once that’s running, you can then test the SPAM Detection service using your favourite HTTP Client e.g. Postman, Insomnia or even just cURL:

$ curl -X POST -H "Content-Type: application/json" -d '{"requestId":"test","content":"Cһeck out our amazinɡ bооѕting serviсe ѡhere you can get to Leveӏ 3 for 3 montһs for just 20 USD.","token":"abc"}' "http://localhost:8080/api/spam/check"

You should get a result similar to this:

{"result":"SCAM","confidence":99.99815368652344,"id":"test","checkDurationMillis":149}

I like to load test things with hey, not bad.

The performance is okay, considering this is all running on CPU and not GPU (which I’m sure you can use with the onnxruntime libraries).

Conclusion

I have been curious about performing Machine Learning with Java for a while and ran into ONNX as I was trying out some Python stuff and got curious if I could leverage ONNX models in Java, and ofcourse you can! Microsoft’s onnxruntime for Java is a great place to start.

Sure, there is a lot more to add to this project to make it a real production-grade service, but I hope I have illustrated how it is possible to do some inference with Java and ONNX models. There are many models out there which you can leverage for different use cases.

I hope you are as excited about doing ML in Java too.

The post Machine Learning Based SPAM Detection Using ONNX in Java appeared first on foojay.

JC-AI Newsletter #13

Miro Wengner — Thu, 05 Feb 2026 21:12:12 +0000

Two weeks have passed, and it is time to present a new collection of readings that may shape developments, utilization or ideas in the field of artificial intelligence in 2026.

While significant activity characterizes the AI field, many unresolved research, design, and implementation challenges continue to impact progress. Future advancement depends heavily on understanding the nature of these challenges to approach probabilistic problems from the appropriate directions. This JC-AI newsletter features insightful interviews with key figures in the field, enabling readers to ask the right questions and compare visions of an 'uncertain future' against current capabilities to maintain a grounded perspective.

article: Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)
authors: Saurav Prateek
date: 2026-01-28
desc.: This paper introduces Deep Researcher, a novel architecture that shifts the paradigm from latency-optimized parallel scaling to an accuracy-driven sequential refinement model. Within the development of Deep Research Agents (DRAs), two primary paradigms are considered, Parallel Scaling and Sequential Refinement. The Deep Researcher agent achieved an overall score of 46.21 on the Research Bench, demonstrating superior performance compared to existing agents, including Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher, and Grok Deep Search. While these improvements are good, the field requires further research to address remaining challenges.
category: research

article: Manipulation in Prediction Markets: An Agent-based Modeling Experiment
authors: Bridget Smart, Ebba Mark, Anne Bastian, Josefina Waugh (University of Oxford)
date: 2026-01-28
desc.: The paper investigates the utilization of agentic systems in the economic field and their impact on prediction. First, the paper evaluates an agent-based model of a prediction market in which bettors with heterogeneous expertise, noisy private information, variable learning rates, and budgets observe the evolution of public opinion on a binary election outcome to inform their betting strategies in the market. The agentic system exhibits stability across experiments. The second area relates to experiments on how "whale" agents, a highly resourced minority with biased information, may distort market prices and for how long. The paper discusses interesting simulation results on how biased information may change the market from a long-term perspective.
category: research

article: Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents
authors: Qihao Wang, Yue Hu, Mingzhe Lu, Jiayue Wu, Yanbing Liu, Yuanmin Tang
date: 2026-01-28
desc.: While LLMs' ability to use external tools enables powerful real-world applications, current benchmarks focus on final accuracy rather than revealing the cognitive bottlenecks that limit their true capabilities. This paper presents a framework based on Cognitive Load Theory that aims to decompose tasks into two components: Intrinsic Load and Extraneous Load. The paper discusses performance inconsistencies as cognitive load increases, and demonstrates how the proposed framework enables the identification of capability boundaries in the examined examples.
category: research

article: Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize
authors: AI Engineer, Sally Ann Delucia, Fuad Alli (Arize)
date: 2026-01-06
desc.: This talk aims to provide ideas on how it is possible to improve LLM responses by using feedback loops. It's important to view this talk through the lens of current research results regarding the LLM hallucination phenomenon and other factors. The main reason to keep current research results in mind is to avoid ending up in an infinite loop of failure/error.
category: youtube

article: Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
authors: Stanford Online
date: 2025-11-11
desc.: For more information about Stanford’s Artificial Intelligence professional and graduate programs
category: youtube, tutorial

article: Developer Experience in the Age of AI Coding Agents – Max Kanat-Alexander, Capital One
authors: AiEngineer, Max Kanat-Alexander
date: 2025-12-23
desc.: It feels like every two weeks, the world of software engineering is being turned on its head. Are there any principles we can rely on that will continue to hold true, and that can help us prepare for the future, no matter what happens? Max uses research, data, and his 20+ years working in enterprise Developer Experience teams to talk through what we can do now that will prepare us for an agentic future, no matter what that future holds.
category: youtube, opinion

article: Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding
authors: Yifan Zhu, Huiqiang Rong, Haoran Luo
date: 2026-01-29
desc.: Hallucination is a recognized phenomenon in the LLM field that impacts applications such as Retrieval-Augmented Generation (RAG) and Reward Modeling (RM). This paper introduces Token-Guard, a self-checking mechanism designed to identify and control hallucinations at the token level. The experiments demonstrate improvements.
category: research

article: Reward Models Inherit Value Biases from Pretraining
authors: Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk and others (University of Oxford, University Pompeu Farba)
date: 2026-01-28
desc.: Despite their importance in LLM alignment, reward models (RMs) remain under-researched. This paper provides evidence that RMs inherit biases from their base models, suggesting that the choice of an open-source model is a reflection of values as much as performance. The paper discusses limitations of experiments and offers avenues for future research.
category: research

article: Professor Geoffrey Hinton - AI and Our Future
authors: City of Hobart, Geoffrey Hinton
date: 2026-01-08
desc.: Professor Geoffrey Hinton, known as the "Godfather of AI", will discuss artificial intelligence - how it works, the risks it poses to our society, and how we might coexist with super-intelligent AI. Ideal for business leaders, creatives, researchers, educators, students and anyone curious about the future of intelligence and society.
category: opinion

article: Your MCP Server is Bad (and you should feel bad) - Jeremiah Lowin, Prefect
authors: AI Engineer, Jeremiah Lowin
date: 2026-01-12
desc.: Too many MCP servers are simply glorified REST wrappers, regurgitating APIs that were designed for SDKs rather than agents. This leads to confused LLMs, wasted tokens, and demonstrably poor performance. If you have ever pointed an MCP generator at an OpenAPI spec and called it a day, this talk is your wake-up call.
category: youtube

article: Frontier Models & AI | Sam Altman, CEO & Co-Founder, OpenAI
authors: Cisco
date: 2026-02-04
desc.: Although Sam Altman, CEO and Co-Founder of @OpenAI, explores ideas about future possibilities and potential developments, he is asked during the interview to align his vision with the current state of research and existing technological capabilities. The interview, however, does not present clear data demonstrating how Codex outperforms alternatives or what 'better' specifically means in this context. The responses to questions may appear to be non-deterministic in nature. The interview relies heavily on thoughts about an "undefined future" that would require a deterministically defined foundation. It is interesting how the interview examined frontier AI models and their implications for economies, institutions, and global systems.
category: opinion

article: How to build secure and scalable remote MCP servers
authors: Den Delimarsky (Microsoft)
date: 2025-07-25
desc.: The tutorial provides insights into how to build a reliable Model Context Protocol (MCP) server, enabling AI agents to connect to external tools. It covers several crucial areas and provides valuable resources and ideas for tackling the challenge.
category: tutorial

The post JC-AI Newsletter #13 appeared first on foojay.

JC-AI Newsletter #12

Miro Wengner — Wed, 14 Jan 2026 07:15:44 +0000

First of all, Happy New Year 2026! This year is designated in the Chinese Calendar as the Year of the Fire Horse (starting on February 17.). The year 2026 brings not only tremendous energy to AI development but also, in my humble opinion, many breakthroughs in the field.

Although there have been many small steps toward the field's evolution, it often feels that development is stagnating, applying known or slightly tweaked strategies to non-deterministic problems while expecting deterministic results. This includes the often misleading benchmarking strategies (deterministic) performed on synthetic datasets.

The first New Year edition of the JC-AI Newsletter aims to shed light on new approaches and movements in the field, including the directions of its evolution.

Let's jump in and happy reading!

article: Driving is a Game: Combining Planning and Prediction with Bayesian Iterative Best Response
authors: Aron Distelzweig, Yiwei Wang, Faris Janjoš and others
date: 2025-12-03
desc.: Autonomous driving, specifically decision-making, remains a significant challenge. While routine scenarios yield nearly perfect plans using multi-agent collaboration, dense urban traffic presents considerable difficulties, particularly for vehicle lane changes. This paper presents the Bayesian Iterative Best Response (BIReR) framework, which aims to unify motion prediction and planning based on game theory. The framework demonstrates an 11% improvement in lane change performance compared to classical approaches.
category: research

article: PBFuzz: Agentic Directed Fuzzing for PoV Generation
authors: Haochen Zeng, Andrew Bao, Jiajun Cheng, Chengyu Song
date: 2025-12-04
desc.: Proof-of-Vulnerability (PoV) input generation is a critical task in software security. Generating a PoV input requires solving two sets of constraints: (1) reachability constraints for reaching the vulnerable code location(s), and (2) triggering constraints for activating the target vulnerability. Despite dramatic advancements in the LLM field, fuzzing models struggle to solve these constraints effectively. This paper proposes the PBFuzz framework, composed of four layers and enabling property-based directed fuzzing. Although PBFuzz underperformed in several scenarios, it outperforms conventional fuzzers overall.
category: research

article: DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners Enhancement
authors: AI Engineer, Kevin Madura
date: 2026-01-08
desc.: Applications developed for enterprise environments need to be rigorous, testable, and robust. The same is true for AI-powered applications, but LLMs can make this challenging. In other words, users need to be able to program with LLMs, not just tweak prompts. This talk covers why DSPy may be all users need when building applications with LLMs. Although the talk dives into some real-world examples, the audience is encouraged to explore the DSPy tool themselves to determine whether it fits their particular needs.
category: youtube

article: From Vibe Coding To Vibe Engineering – Kitze, Sizzy
authors: AI Engineer, Ryan Florence
date: 2025-12-14
desc.: Web development has always moved in cycles of hype, from frameworks to tooling. With the rise of large language models, we're entering a new era of "vibe coding," where developers shape software through collaboration with Al rather than syntax. This talk explores what that means for the future of coding, especially in frontend development, and how it echoes the past while redefining what comes next.
category: youtube

article: The AI Bubble Should Have Never Existed In The First Place
authors: Will Lockett
date: 2025-12-07
desc.: The article elaborates on the existence of an AI bubble, arguing that so much money has been poured into AI that we have effectively bet the entire economy on its success. Regardless of whether an AI bubble exists or in what form, the article formulates valid points that should be taken into account when considering future developments.
category: opinion

article: We Let AI Run Our Office Vending Machine. It Lost Hundreds of Dollars
authors: The Wall Street Journal (Antropic)
date: 2025-12-18
desc.: In a research case study supported by Anthropic, the Claudius Agent was developed to manage vending machine operations. Testing revealed multiple exploitable vulnerabilities that allowed users to obtain goods without payment. Real-world trials consistently resulted in operational failures, with the system dispensing free products while automatically reordering inventory, a combination that would lead to bankruptcy in commercial-like deployment.
category: youtube

article: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
authors: Yaqi Duan, Yichun Hu, Jiashuo Jiang
date: 2025-12-31
desc.: Inventory control (encompassing cash management, storage, order quantities, etc.) presents a stochastic control challenge where minor structural errors result in recurring costs. Direct interaction with LLM models may produce plausible yet systematically suboptimal or even inconsistent results. This paper proposes using LLMs not as problem solvers but as language interfaces to enhance optimization through a hybrid agentic approach.
category: research

article: Memory in LLMs: Weights and Activations - Jack Morris, Cornell
authors: AI Engineer, Jack Morris
date: 2025-12-29
desc.: This work examines memory mechanisms in large language models through the lens of weights and activations. Jack Morris addresses the limitations of current Large Language Models (LLMs) in handling niche, long-tail knowledge that falls outside their training data or beyond knowledge cutoffs. He critiques the reliance on massive context windows and Retrieval Augmented Generation (RAG), citing their high computational cost and latency due to the quadratic complexity of self-attention. The core thesis advocates for a third paradigm: training knowledge into weights, efficiently injecting specific knowledge directly into model parameters. This approach treats weights as a memory storage mechanism, conceptually distinct from the working memory represented by activations.
category: youtube

article: There are no new ideas in AI — only new datasets
authors: Jack Morris
date: 2025-07-06
desc.: This article provides a comprehensive overview of progress in the AI field over recent years. All four major breakthroughs in LLMs occurred because researchers unlocked new sources of data. The question remains: what will be the next breakthrough?
category: opinion

article: VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
date: 2025-12-11
desc.: This paper introduces the Joint Embedding Predictive Architecture for Vision-Language models (VL-JEPA). Current Vision-Language Models (VLMs) are straightforward but inadequate for two main reasons. First, VLMs are expensive to develop. Second, real-time tasks involving live streaming video (e.g., live action tracking) require sparse and selective decoding. The paper empirically validates the advantages of this newly introduced approach against token-generative VLMs. VL-JEPA delivers consistently higher performance on zero-shot captioning and classification while improving inference-time efficiency during the training phase. Although improvements remain in the experimental stage, the work demonstrates clear benefits from scaling both parameters and dataset size.
category: research

article: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
authors: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly (Carnegie Mellon Univeristy, Apple)
date: 2024-01-29
desc.: Although this paper is older, it may shed light on the approaches chosen for training LLM models and provide better understanding of their evolution. The paper proposes Web Rephrase Augmented Pre-training (WRAP), which uses an off-the-shelf instruction-tuned model to rephrase noisy input data. It offers insights into how the structure of training data impacts LLM performance.
category: research

article: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
authors: Laksh Advani
date: 2026-01-01
desc.: This paper investigates the reasoning performance of agentic systems based on small language models (Mistral-7B, Llama-3-8B, Qwen-2.5-7B). The findings reveal statistically significant evidence that RAG systems may improve reasoning performance while simultaneously increasing the likelihood of hallucination due to the Right-for-Wrong-Reason (RWR) phenomenon. The paper introduces the Reasoning Integrity Score (RIS) approach to identify hidden flaws in reasoning processes.
category: research

The post JC-AI Newsletter #12 appeared first on foojay.

The End of One-Sized-Fits-All Prompts: Why LLM Models Are No Longer Interchangeable

Nehal Gajraj — Sun, 14 Dec 2025 18:48:31 +0000

Table of Contents

Takeaway 1: LLM choice is now a statement about your productTakeaway 2: Frontier models have divergent ‘personalities’Takeaway 3: End of an era. Prompts are no longer monoliths

The rise of prompt subunits
User feedback and evals

Conclusion

For developers and product builders, one assumption has guided the last few years of LLM application development. To improve your product, just swap in the latest frontier large language model. Flip a single switch and your tool’s capabilities level up.

But that era is over. We’re now seeing that new models like Anthropic’s Claude Sonnet 4.5 and OpenAI’s GPT-5-Codex have diverged in fundamental ways. The choice of which model to use is no longer a simple engineering decision but a critical product decision. Flip that switch today… and the very texture of your product changes.

The one-size-fits-all model era is over; the model you choose now expresses something integral about what your product is and does, as well as, how it works. Whether you want it to or not.

In this article, we’ll explore three surprising takeaways from this new era: why your LLM is now a statement about your product, how models now have distinct personalities and styles, and why your prompts have to now evolve from monolithic instructions to adaptive systems.

Takeaway 1: LLM choice is now a statement about your product

Choosing a model is no longer a straightforward decision where the main consequence of your choice is having to implement a new API. It is now a product decision about the user experience you want to create, the failure modes you can tolerate, the economics you want to optimize for, and the metrics you want to excel in.

Models have developed distinct “personalities,” ways of reasoning, and instincts that directly shape how your product feels and behaves that go beyond just whether its output is technically right or wrong. Choose a different model and everything from what your tool is capable of to how it communicates with your users is significantly different.

So, in a world where traditional benchmarks that primarily or exclusively measure quantitative aspects of a model’s performance are no longer enough, what can you turn to for the data you need to chart your product’s direction? You could survey your team or your users or conduct focus groups but that could lack objectivity if you don’t do it in a rigorous manner.

To make this choice objective for our team, we focused on creating an internal North Star metrics matrix at CodeRabbit. Our metrics don’t just look at raw performance or accuracy. We also take into account readability, verbosity, signal-to-noise ratios, and more.

These kinds of metrics shift the focus from raw performance accuracy or leaderboard performance to what matters to our product and to our users. For example, a flood of low-impact suggestions, even if technically correct, burns user attention and consumes tokens. A theoretically “smarter” model can easily create a worse product experience if the output doesn’t align with your users’ workflow.

I would strongly recommend creating your own North Star metrics to better gauge whether a new model meets your products’ and users’ needs. These shouldn’t be static metrics but should be informed by user feedback and user behavior in your product and evolve over time. Your goal is to find the right list of criteria to measure that predict your users preferences.

What you’ll find is that the right model is the one whose instincts match the designed product behavior and your users’ needs, not the one at the top of any external leaderboard.

Takeaway 2: Frontier models have divergent ‘personalities’

Models are (now more than ever) “grown, not built,” and as a result, the latest generation has developed distinct instincts and behaviors. Different post-training cookbooks have fundamentally changed the direction of each model class. A prompt that works perfectly for one model will not work the same in another. Their fundamental approaches to the same task have diverged.

One powerful analogy that drives this point home is to think of the models as different professional archetypes. Sonnet 4.5 is like a meticulous accountant turned developer, meanwhile GPT-5-Codex is an upright ethical coder, GPT-5 is a bug-hunting detailed developer, and Sonnet 4 was a hyper-active new grad. The GPT-5 model class would make logical jumps further out in the solution space compared to the Claude model class, which tends to stay near the prompts itself. Which model is right for your use case and product, depends entirely on what you are wanting your product to achieve.

At CodeRabbit, we take a methodical approach to model evaluation and characterization. We then use this data to improve how we prompt and deploy models, ensuring we are always using the right model for each use case within our product. To give you an example of how we look at the different models, let’s compare Sonnet 4.5 and GPT-5-Codex. Based on extensive internal use and evals, we characterized Sonnet 4.5 as a “high-recall point-fixer,” aiming for comprehensive coverage. In contrast, GPT-5-Codex acts as a “patch generator,” preferring surgical, local changes.

These qualitative differences translate into hard, operational differences.

Dimension	Claude Sonnet 4.5	GPT-5-Codex
Default Word Choice	“Critical,” “Add,” “Remove,” “Consider”	“Fix,” “Guard,” “Prevent,” “Restore,” “Drop”
Example-Efficiency	Remembers imperatives; benefits from explicit rules	Needs fewer examples; follows the formatting on longer context without additional prompting
Thinking Style	More cautious, catches more bugs but not as many of the critical one	Variable or elastic, less depth when not needed without need to reiterate the rules. Catches more of the hard-to-find bugs
Behavioral Tendencies	Wider spray of point-fixes, more commentary and hedging, inquisitive, more human-like review, finds more critical and non-critical issues	Verbose research-style rationales, notes on second-order effects to code, compact and balanced towards a code reviewer
Review Comment Structure	What’s wrong, why it’s wrong, concrete fix with code chunk	What to do, why do it, concrete fix with effects and code chunk
Context Awareness	Aware of its own context window, tracks token budget, persists/compresses based on headroom	Lacks explicit context window awareness (like cooking without a clock)
Verbosity	Higher, easier to read, double the word count	Lower, harder to read, information-dense

Takeaway 3: End of an era. Prompts are no longer monoliths

Because the fundamental behaviors of models have diverged, a prompt written for one model will not work “as is” on another anymore. For example, a directive-heavy prompt designed for Claude can feel over-constrained on GPT-5-Codex, and a prompt optimized for Codex to explore deep reasoning behavior will likely underperform on Claude. That means that the era of the monolithic, one-size-fits-all prompt is over.

So, what does that mean for engineering teams who want to switch between models or adopt the newest models as they’re released? It means even more prompt engineering! But before you groan at the thought — there are some hacks to make this easier.

The rise of prompt subunits

The first practical solution we’ve found at CodeRabbit is to introduce “prompt subunits.” This architecture consists of a model-agnostic core prompt that defines the core tasks and general instructions. This is then layered on top of smaller, model-specific prompt subunits that handle style, formatting, and examples – and which can be customized to individual models.

When it comes to Codex and Sonnet 4.5, the implementation details for these subunits are likely to be starkly different. We’ve found a few tricks from our prompt testing with both models that we would like to share:

Claude: Use strong language like "DO" and "DO NOT." Anthropic models pay attention to the latest information in a system prompt and are excellent at following output format specifications, even in long contexts. They prefer being told explicitly what to do.
GPT-5: Use general instructions that are clearly aligned. OpenAI models’ attention decreases from top to bottom in a system prompt. These models may forget output format instructions in long contexts. They prefer generic guidance and tend to "think on guidance," demonstrating a deeper reasoning process.

User feedback and evals

The second solution is to implement continuous updates driven by user feedback and internal evaluations. The best practice for optimizing an AI code-review bot or for that matter any LLM applications isn’t using an external benchmark; it’s checking to see if users accept the output.

Evals are more important than ever but have to be designed more tightly around acceptability by users instead of raw performance since one model might be technically correct significantly more than another model but might drown the user in nitpicky and verbose comments, diluting its value to users. By measuring the metrics that matter ~ acceptance rate, signal-to-noise ratio, p95 latency, cost, among others - and tuning prompts in small steps, the system will remain aligned with user expectations and product goals. The last thing you want is great quantitative results on benchmarks and tests but low user acceptance.

Conclusion

This shift from one-size-fits-all prompt engineering to a new model specific paradigm is critical. The days of brittle, monolithic prompts and plug-and-play model swaps are over. Instead, modular prompting, paired with deliberate model choice, give your product resilience.

The ground will keep shifting as models evolve so your LLM stack and prompts shouldn’t be static. Treat it like a living system. Tune, test, listen, repeat.

The post The End of One-Sized-Fits-All Prompts: Why LLM Models Are No Longer Interchangeable appeared first on foojay.

foojay – a place for friends of OpenJDK

“Agentic” Was Everywhere at Money20/20 Amsterdam — Once I Started Looking

Core banking, reimagined as "AI-native"

Payments and money movement

Lending, credit, and collections

Risk, fraud, and compliance

Build-your-own-agent platforms and tooling

Data and infrastructure underneath it all

Content, language, and trust

Services and talent

Conclusions

Explore Spring AI SDK – Amazon Bedrock AgentCore – Part 1

Introduction

Role of Agentic AI Frameworks

What is Spring AI and Spring AI SDK?

What is Amazon Bedrock AgentCore and Why?

Amazon Bedrock AgentCore Capabilties

1. AgentCore Runtime

2. AgentCore Memory

3. AgentCore Gateway

4. AgentCore Identity

5. AgentCore Policy

6. Build-inTool: Code Interpreter

7. Build-inTool: Browser

8. Gen AI Observability

9. Evaluations

Step-by-step guide

Prerequisites

Project Structure

2. Add the below class

3. Add the below Mathematical Tool class

4. Create a Sample Agent

5. Configure Amazon Bedrock Properties

6. Verify

7. Add Streaming

End-to-end flow

References

JC-AI Newsletter #15

JC-AI Newsletter #14

From “Crypto AI” to general AI: Do AI agents dream of electric langoustines?

x402, ERC-8004, A2A, and The Next Wave of AI Commerce: Do AI Agents Dream of Electric Langoustines?

The shift that matters for agent commerce - From “Crypto AI” to general AI

What changed in x402 and ERC-8004 in the last month or so?

This is the moment that unlocked agent commerce

What is still missing?

What does the stack look like in practice?

Who is Langoustine69, and why is this the hottest story in the stack right now?

What does Langoustine’s inventory catalog look like so far?

How does DayDreams plan to bridge crypto AI to general AI?

So, Agentic commerce has developed. What else does the stack need?

What is the takeaway?

Where can we go from here?

Enabling AI Agents to Use a Real Debugger Instead of Logging

The JDK ships a perfectly good debugger. Nobody uses it.

Agent Skills: Teaching new tricks through Markdown

Building the skill: a conversation with Copilot

What the skill contains

The real test: debugging a buggy Swing app, live

The debugging session

A small but important lesson: compile with -g

Why this matters

Beyond println debugging

Interactive debugging as a first-class agent capability

The shift from static analysis to dynamic observation

Try it yourself

What's next

Machine Learning Based SPAM Detection Using ONNX in Java

Which model to use?

The Controller

The Spam Detection Service

Running the service via Docker

Conclusion

JC-AI Newsletter #13

JC-AI Newsletter #12

The End of One-Sized-Fits-All Prompts: Why LLM Models Are No Longer Interchangeable

Takeaway 1: LLM choice is now a statement about your product

Takeaway 2: Frontier models have divergent ‘personalities’

Takeaway 3: End of an era. Prompts are no longer monoliths

The rise of prompt subunits

User feedback and evals

A small but important lesson: compile with `-g`

Beyond `println` debugging