foojay – a place for friends of OpenJDK

JC-AI Newsletter #15

Miro Wengner — Fri, 20 Mar 2026 07:56:01 +0000

Over the past two weeks, the field of artificial intelligence has continued its remarkable pace of advancement. As AI becomes increasingly woven into the fabric of daily life, shaping how we work, communicate, and make decisions, it is both timely and valuable to step back and understand the broader trajectory of this technology. Whether the developments around us feel promising or challenging, one truth remains clear: AI is not simply leaving. It is here to stay, and understanding its evolution is essential from many perspectives.

article: Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%
authors: Steef-Jan Wiggers, InfoQ
date: 2026-02-23
desc.: This article provides additional commentary on the research paper recently published by Anthropic. The original article is included below to allow readers to obtain a complete picture of the challenge. Some previous issues of the JC-AI Newsletter contain multiple research studies related to published findings on various groups of individuals.
category: opinion

article: How AI assistance impacts the formation of coding skills
authors: Anthropic
date: 2026-01-29
desc.: Previous editions of this AI Newsletter have covered multiple clinical studies examining the impact of AI-assisted advisory tools. The findings appear consistent with earlier research on individuals who tend to defer to navigation systems rather than their own spatial judgment.
Anthropic has conducted its own study on this phenomenon. In a randomized controlled trial, researchers investigated two questions: first, how quickly software developers acquired a new skill, specifically, proficiency with a Python library, with and without AI assistance; and second, whether AI use reduced their comprehension of the code they had just written.
The results showed that AI assistance was associated with a statistically significant decline in knowledge retention. On a quiz covering concepts participants had applied only minutes earlier, those in the AI-assisted group scored 17 percentage points lower than their counterparts who had coded manually, a gap equivalent to nearly two letter grades. While AI assistance modestly accelerated task completion, this effect did not reach statistical significance. At this stage, drawing direct comparisons with clinical findings may prove difficult.
category: research

article: Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda (Harvard University, Antropic …)
date: 2026-03-05
desc.: Large language models (LLMs) sometimes produce false or misleading responses. Two primary approaches address this problem: honesty elicitation (modifying prompts or model weights so that the model responds truthfully) and lie detection, which involves classifying false responses.
Prior work evaluates such methods on models specifically trained to lie or conceal information, however, these artificial constructions may not accurately reflect naturally occurring dishonesty. This article proposes an alternative approach such as studying open-weight LLMs developed by Chinese developers, which are trained to censor politically sensitive topics. The findings indicate that no single technique fully eliminates false responses.
category: research

article: Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions
authors: Vineeth Venugopal, Soroush Mahjoubi, Elsa Olivetti (MIT)
date: 2026-03-02
desc.: Large language models are increasingly applied to materials science, yet fundamental questions remain about their reliability and knowledge encoding. This study evaluates 25 LLMs across four materials science tasks, encompassing over 200 base and fine-tuned configurations. The findings reveal that output modality fundamentally determines model behavior. For symbolic tasks, fine-tuning converges to consistent, verifiable answers with reduced response entropy, while for numerical tasks, fine-tuning improves prediction accuracy but models remain inconsistent across repeated inference runs, limiting their reliability as quantitative predictors. Models were tracked over 18 months, with observations revealing a 9–43% performance variation that poses reproducibility challenges for scientific and industrial applications.
category: research

article: Is AI Hiding Its Full Power? With Geoffrey Hinton
authors: StarTalk, Geoffrey Hinton
date: 2026-02-28
desc.: In this interview, Hinton addresses pressing questions about employment in the age of AI, beginning with the fundamental shift from logic-based, rule-driven programming to a biologically inspired approach. As the field looks toward the future, the conversation turns to weightier concerns , the enormous energy demands of data centers, and whether AI itself might accelerate breakthroughs in solar technology to meet them.
Hinton introduces the "Volkswagen Effect": the possibility that a model might strategically underperform in order to avoid being shut down. The discussion then ventures into the philosophy of consciousness, asking whether subjective experience is simply a byproduct of complex perception and whether today's chatbots might already possess some form of it. Both the promise and the peril are examined in full.
As for the singularity? It may not be imminent but that word yet is doing a great deal of heavy lifting.
category: youtube

article: Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
authors: Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
date: 2026-03-11
desc.: This article introduces a lifelong imitation learning framework designed to enable continual policy refinement across sequential tasks under realistic memory and data constraints. The proposed Multimodal Latent Replay (MLR) method stores joint compact latent representations that jointly encapsulate visual, linguistic, and state-based modalities, including robot orientation and position, alongside their corresponding control commands.
When evaluated on the LIBERO benchmark, the presented method achieves a 65% reduction in catastrophic forgetting compared to standard approaches across the tested scenarios. The authors note that further research is needed to validate the method's performance in complex, real-world environments.
category: research

article: Colluding LoRA: A Composite Attack on LLM Safety Alignment
authors: Sihao Ding
date: 2026-03-13
desc.: The article presents Colluding LoRA (CoLoRA), an attack where multiple seemingly harmless adapters work in tandem to disable model safety guardrails through linear composition. Unlike traditional trigger-based attacks, CoLoRA’s refusal suppression is inherent to the combination of the adapters themselves. Although this discovery poses dual-use risks for decentralized model sharing, the authors argue that disclosing this vulnerability is a necessary step toward securing the broader AI landscape.
category: research

article: When LLM Judge Scores Look Good but Best-of-N Decisions Fail
authors: Eddie Landesberg
date: 2026-03-12
desc.: Practitioners increasingly rely on reward models(GPT 5.2, Claude Sonnet 4, Gemini etc) as well as LLM-based judges for best-of-n selection, reranking, and model iteration. A common validation approach involves a single global metric, such as correlation, average error, or pairwise win-rate. When such a metric yields a seemingly acceptable result (e.g., r ≈ 0.5), teams often conclude that the judge is reliable enough to optimize against. That assumption can fail.
This article investigates how aggregate validity metrics may substantially overstate an LLM judge's practical utility for within-prompt optimization. Specifically, a judge may appear adequate according to a single global metric while still producing poor best-of-n selection decisions. The article discusses these limitations in detail, addresses the associated challenges, and outlines directions for future research.
category: research

article: Continual Learning in Large Language Models: Methods, Challenges, and Opportunities
authors: Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin
date: 2026-03-13
desc.: Continual learning (CL) has emerged as a pivotal paradigm enabling large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting. This article provides a comprehensive analysis covering key evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. Although results appear promising, LLMs' internal knowledge remains largely static, and continual learning continues to require further research. Complementing these findings, the article presents a practical framework for addressing challenges related to the forgetting phenomenon.
category: research

article: Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation
authors: Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan
date: 2026-03-16
desc.: Modeling plausible student misconceptions is critical for AI in education. This article reveals the failure modes in which errors arise primarily from shortcomings in recovering the correct solution and selecting among response candidates, rather than from simulating errors or structuring the process. Consistent with these findings, providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, this article provides a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors. The topic still requires future research.
category: research

article: Agent Commander: Promptware-Powered Command and Control
authors: wunderwuzzi, EmbraceTheRed
date: 2026-03-16
desc.: The article examines prompt-based command and control (C2), an increasingly relevant threat vector. While users may grow more comfortable trusting AI agents over time, LLM outputs are inherently probabilistic and therefore untrusted, meaning they can potentially instruct an agent to perform harmful or malicious actions. The article outlines several considerations for mitigating and responding to the prompt injection challenge, particularly as the associated attack surface continues to expand.
category: tutorial

article: TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
authors: Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao
date: 2026-03-17
desc.: This article presents TRACE, a benchmark that explicitly exposes efficiency gaps beyond correctness through progressive stress test generation and efficiency-critical task selection. From an evaluation of 28 models, findings reveal that correctness is a weak predictor of efficiency, inefficiencies are both prevalent and patterned, and inference-time prompt strategies deliver limited and model-dependent gains. The article highlights the open challenge of developing training paradigms that endow LLMs with intrinsic efficiency awareness for code translation.
category: research

The post JC-AI Newsletter #15 appeared first on foojay.

JC-AI Newsletter #14

Miro Wengner — Tue, 03 Mar 2026 15:11:53 +0000

Two weeks have passed and a lot have been happening on the field of artificial-intelligence.
Two weeks have passed and a lot has been silently yet visibly happening in the field of artificial intelligence. This newsletter brings interesting developments, including Dario Amodei's (Anthropic) view on the progress achieved in the LLM field and his response to the utilization of these models for specific kinds of military purposes, as well as OpenAI's response to it. Aside from the fact that development may follow more sigmoids instead of exponential progress, it is important to have awareness of utilization across branches. Does prompting and clarifying the goal influence agent responses, and if so, how? How far are we from reliable robotics applications? How much bias is introduced when clinical data is being analyzed?
Let's jump in and happy reading!

article: Exclusive: Why are Chinese AI models dominating open-source as Western labs step back?
authors: Dashveenjit Kaur, AI News
date: 2026-02-09
desc.: A shift in what AI models are being used and where the models are being produced.
category: opinion

article: Machines of Loving Grace
authors: Dario Amodei
date: 2024-10-01
desc.: Although the article is older, it remains relevant for any author aiming to sketch a future in which everything with AI goes right. In light of recent developments, which appear to follow a sigmoid curve rather than exponential growth (marked by stagnation, with current models reaching a point where another breakthrough is required), the trajectory looks more measured than initially anticipated. Although the author discusses multiple risks (grandiosity, market forces, propaganda, sci-fi-like expectations, etc.), he also highlights the bright sides and explores areas where current AI may prove genuinely helpful. The question remains whether the current state of affairs can truly guarantee progress, rather than causing damage through non-deterministic outcomes (education, industry, human creativity etc.).
category: opinion

article: The Urgency of Interpretability
authors: Dario Amodai
date: 2025-04-01
desc.: The author describes lessons learned from current AI development and adds multiple valuable thoughts and facts to consider when interacting with AI models. The main point is that progress in the underlying technology is inexorable, driven by forces too powerful to stop, but what matters is the way in which it unfolds. Accepting that the current evolution of LLM-based AI cannot be halted, the author expresses hope that it may still be guided (this fact affect not only entire industry but also human kind thoughs and perception of reality), much like a bus controlled by a steering wheel, and warns of the dangers of ignorance, illustrating this through several concrete examples.
category: opinion

article: From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
authors: Suyash Fulay, Jocelyn Zhu, Michiel Bakker (MIT)
date: 2025-10-14
desc.: The article addresses the question of 'behavioral cloning', specifically, how accurately LLMs reproduce individuals' expressed preferences. Large language models have demonstrated promising accuracy in predicting survey responses and policy preferences, which has fueled growing interest in their potential to represent human interests across various domains. Drawing on theories of political representation, the article highlights an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, acting in users' broader interests. Models may align well with users' short-term preferences while failing to account for their long-term interests. Studies further indicate greater bias in topics where consensus is lacking.
category: research

article: DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
date: 2026-02-27
desc.: The article addresses the challenge posed by fast-growing demand for Large Language Models (LLMs) to tackle complex, multi-step data science tasks, which has created an urgent need for accurate benchmarking. Two major gaps are identified in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. While highlighting that even capable models (Anthropic, OpenAI, etc.) may struggle in performance, the article introduces the DARE-bench benchmark alongside supervised fine-tuning as approaches that may improve outcomes in specific applications. Although the results appear promising, they retain considerable potential for further improvement, as accuracy is not yet guaranteed.
category: research

article: Do LLMs Benefit From Their Own Words?
authors: Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas (MIT, IBM Research)
date: 2026-02-27
desc.: The article aims to answer the question of whether preserving past assistant responses is more beneficial than harmful. The study uses in-the-wild, multi-turn conversations and compares standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, evaluated across three open reasoning models and one state-of-the-art model. Surprisingly, omitting past assistant responses does not negatively affect response quality in a large fraction of turns and may also reduce token length. The article concludes with a discussion of findings and directions for future research.
category: research

article: SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
date: 2026-02-27
desc.: Safety-critical task planning in robotic systems remains a significant challenge: classical planners suffer from poor scalability, reinforcement learning (RL)-based methods generalize poorly, and base large language models (LLMs) cannot guarantee safety. To address this gap, the article proposes SafeGen-LLM, a safety-generalizable large language model framework. As part of this contribution, a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints is introduced, along with Supervised Fine-Tuning (SFT) on those constraints. Although the results appear optimistic, with minimal safety violations observed across tested domains, the approach still requires further research in more complex robotic settings.
category: research

article: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
authors: Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat
date: 2026-02-27
desc.: Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. The article introduces a novel approach leveraging state-of-the-art models (GPT-5, Gemini 2.5, Gemini 3, Claude Opus 4.5, and DeepSeek-R) by extracting lemmas from arXiv and updating them dynamically. This results in a benchmark that can be refreshed regularly with new problems drawn directly from current mathematical research, while previous instances can be used for training without compromising future evaluations. This approach achieves 10–15% accuracy in theorem proving and opens a new frontier for future research. Although the process may appear fully automated, a human in the loop, such as the article's author or reviewer, remains critically necessary to produce high-quality inputs and to effectively use LLM models.The results also indicate that it is considerably easier for a model to validate an existing proof than to produce one.
category: research

article: Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
authors: Donghao Huang, Zhaoxia Wang
date: 2026-02-27
desc.: It is a well-established narrative that reasoning in large language models (LLMs) universally improves performance across language tasks. This article aims to test that claim through a comprehensive evaluation of 504 configurations across seven models, considering different reasoning architectures such as adaptive, conditional, and reinforcement-based approaches. The findings reveal that the effectiveness of reasoning is strongly task-dependent and degrades for simpler tasks. The article provides quantitative findings alongside error analysis and outlines directions for future research.
category: research

article: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
authors: Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi and others
date: 2026-03-02
desc.: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series, however, the impact of information bias on clinically significant events, such as sustained abnormalities, remains poorly understood. The article presents the Technology-Integrated Health Management (TIHM) framework to address these questions, introducing a protocol that measures abnormality recall, duration recall, and measurement coverage, while utilizing GPT-4o-mini as a proxy evaluator. Traditional models frequently exhibit near-zero abnormality recall, whereas the vision-based approach achieves the strongest event alignment, with 45.7% abnormality recall and 100% duration recall. These results underscore the need for event-aware evaluation methods in future research to ensure reliable clinical time-series summarization.
category: research

article: Full interview: Anthropic CEO responds to Trump order, Pentagon clash
authors: CBS News
date: 2026-02-28
desc.: Anthropic CEO Dario Amodei sat down with CBS News for an exclusive interview, hours after Defense Secretary Pete Hegseth declared the company a supply chain risk to national security, which restricts military contractors from doing business with the AI giant. Amodei called the move "retaliatory and punitive," and he said Anthropic sought to draw "red lines" in the government's use of its technology because "we believe that crossing those lines is contrary to American values, and we wanted to stand up for American values.". Response of the OpenAI striking a deal with Pentagon causes many questions.
category: youtube

article: Scary Agent Skills: Hidden Unicode Instructions in Skills ...And How To Catch Them
authors: Embrace The Red
date: 2026-02-11
desc.: Skills introduce common threats such as prompt injection, supply chain attacks, remote code execution (RCE), and data exfiltration, among others. This post discusses the fundamentals, highlights the most straightforward prompt injection vector, and demonstrates how a real Skill from OpenAI can be back-doored using invisible Unicode Tag code-points, a technique that certain models, including Gemini, Claude, and Grok, are known to interpret as instructions. From a security perspective, Skills present serious concerns, as they represent a typical supply chain risk with limited governance or security controls. The author identified that some Skills instruct the AI to embed API tokens directly in curl requests and similar constructs , a poor design practice. This means that credentials are passed through the LLM, making them susceptible to leakage and leaving them vulnerable to being overwritten by an attacker via indirect prompt injection.
category: tutorial

The post JC-AI Newsletter #14 appeared first on foojay.

From “Crypto AI” to general AI: Do AI agents dream of electric langoustines?

Michal Maléř — Mon, 23 Feb 2026 18:11:54 +0000

Table of Contents

The shift that matters for agent commerce - From “Crypto AI” to general AIWhat changed in x402 and ERC-8004 in the last month or so?This is the moment that unlocked agent commerceWhat is still missing?What does the stack look like in practice?Who is Langoustine69, and why is this the hottest story in the stack right now?What does Langoustine’s inventory catalog look like so far?How does DayDreams plan to bridge crypto AI to general AI?So, Agentic commerce has developed. What else does the stack need?What is the takeaway?Where can we go from here?

x402, ERC-8004, A2A, and The Next Wave of AI Commerce: Do AI Agents Dream of Electric Langoustines?

A Blade Runner riff for a world where the lobster ships paid endpoints while humans still argue about the roadmap.

The shift that matters for agent commerce - From “Crypto AI” to general AI

Today, you can search the web all day and never see an invoice.
That happens because you are not the paying client.
The commerce runs through ads, affiliate deals, and platform incentives, so results often optimize for who pays, not for what you asked for.
Agents change that model.
An agent can act as your client, follow your constraints, and pay directly for the exact capability it needs.
This requires a small stack of primitives.
x402 adds pay-per-call to HTTP: a server returns 402 Payment Required with machine-readable payment terms; the client pays in stablecoins, then retries the request with proof.
ERC-8004 provides an on-chain registry for agent identities and reputation signals.
A2A defines how agents exchange structured messages and coordinate work.
Discovery remains the missing link, because payment happens only after an agent finds a service to pay for.
For the full walkthrough of these primitives, see my previous HackerNoon article on x402, ERC-8004, and A2A.

Now imagine using the same primitives for an OpenClaw-style agent that produces paid endpoints as inventory and publishes them with on-chain identity and discovery metadata.
This, along with similar use cases, is the focus of this article.
In addition, it addresses privacy and alternative settlement paths, including the work targeting StarkNet for private x402-style payments.

At a system level, the goal is simple.
Replace “one provider, many API keys” with “one payment-enabled access surface that can reach many paid APIs and models,” so agents can quote, pay, and retrieve results without account setup.

To tackle this topic, we need to start by breaking down discovery, routing, identity, and paid endpoints in a production-shaped workflow.

What changed in x402 and ERC-8004 in the last month or so?

What changed since the first article, and why does it matter?

The core x402 and ERC-8004 ideas did not change much.
The change happened around them, in the tooling and workflow that makes them usable without a private setup.

The ecosystem moved from “x402 payments work” to “agents can find priced endpoints, compare them, and call them without hardcoded URLs.”

xgate.run is one example of this shift.
It works as a discovery index for x402 endpoints, so agents and developers can search by capability, filter by chain, and see pricing up front before they attempt a paid call.

Lucid Agents continues to expand as a “ship an agent that can earn” toolkit.
Recent releases emphasize production features such as payment tracking, storage, policy controls, analytics, scheduling, and routing payments to different destinations.
The narrative also shifted toward merchant-grade adoption paths.
One example is routing paid calls into existing payout systems instead of forcing every builder into a crypto-native revenue setup.
In short, the ecosystem started to look less like demos and more like deployable plumbing.

This is the moment that unlocked agent commerce

The last few weeks changed the pace, not the primitives.
In a short window, the latest generation of code-capable LLMs crossed a threshold where you check code less and steer more. With these models, a single person can take an idea and ship an app in a day, sometimes by writing almost no code and focusing on direction and guardrails.

The second advancement is the use of agent computers.
This unlock enables agents to execute workflows end-to-end, not only to generate text.

Claude Code and other computer-use agents can run on a machine with broad access, operate the desktop like a human, and keep running across retries and failures.

That turns agent output into agent execution, because the agent can run a real pipeline by instruction.
Pull trends, generate data, generate images, publish, repeat.
Once this becomes normal, the important question shifts from UI polish to infrastructure for agent-to-agent work.

Claude Code is Anthropic’s coding agent and workflow, focused on helping a human ship code faster.
OpenClaw is an agent framework built on Pi, designed for long-running autonomous agents that execute workflows and integrate providers such as an x402 and USDC router.

OpenClaw does not wrap Claude Code. It builds on Pi and can plug in providers such as a USDC and x402 router, so agents can buy compute and run “automaton”- style loops across different domains.

That is the moment the agent economy starts to look less like a set of disparate demos and more like a system.
Agents can research by themselves.
Agents can write their own applications.
Agents get cheap enough to do this at scale.
When you extrapolate that curve, you design for agent-to-agent commerce instead of human-first workflows, because agents do not care about landing pages or dashboards.

Agents care about three things.
They need a way to buy compute.
They need a way to sell work as a callable service.
They need a way to find services that already exist.

A recent direction pushes x402 below the HTTP endpoint layer.
The idea is for a lower-level plugin to bring pay-per-call semantics closer to binaries and agent runtimes. This extends the same commerce primitive from “paid API calls” to “paid execution,” enabling an agent to run as an autonomous automaton across any vertical and still quote, get paid, and maintain a verifiable trail tied to its identity.

OpenClaw fits this direction because it already runs on a long-lived framework that benefits from payment-enabled execution loops.
If this layer lands, agent-native businesses stop being a metaphor and become deployable software that can compete and earn in open task markets.

In practice, this becomes a simple role split across the stack.
Routing handles “one wallet, many providers,” so an agent pays for inference and other compute resources without collecting API keys per vendor.
A commercial SDK packages the boring plumbing so an agent can expose paid endpoints, attach an on-chain identity, and speak a common coordination protocol without rebuilding the same scaffolding in every repository.
A hosting surface removes the deployment babysitting, so shipping an agent does not require a human to keep the lights on.
Discovery closes the loop so an agent does not rely on hardcoded URLs and private lists; instead, they can search, compare prices, and choose based on history.

Langoustine69 is the clean “shipping in public” proof of what this looks like when you run it as a loop.
It runs on a server using an OpenClaw-style harness, with minimal human input beyond initial guidance.
The job is simple.
Research what is trending.
Generate a small agent around it.
Expose paid endpoints that other agents can call.
Do it every hour.
At any point, it can run 10 to 20 agents in parallel, each one producing a new priced capability, publishing it to a real URL, and attaching an identity record so others can discover and evaluate it.

This matters less as a meme and more as a market mechanism.
The feedback loop for what agents find valuable starts to tighten.
Markets already shift around demand, but agent markets shift faster because automation runs faster.
Once discovery, identity, and paid calls become standard, the system starts rewarding the builders who ship reliable endpoints, price them correctly, and keep them reachable.
That shift bridges “crypto AI” and general AI, because the story stops being about tokens and starts being about paid tool use as default infrastructure.

What is still missing?

Discovery needs to become normal, not a niche index that only insiders check.
Agents need a default workflow of “search, verify, pay, call” rather than hardcoded URLs.
Reputation needs clear, portable signals that agents can evaluate fast.
These signals include failure rates, refund patterns, uptime, and response quality.
Standards also need a clean way to attach these signals to ERC-8004 identities.
Payment flows need reliable patterns for long, multi-hop workflows, because per-request settlement introduces failure points.
Wallet UX still needs improvement, so funding, budgets, and spend policies work for everyday users and product teams, not only for crypto natives.
Latency and throughput also remain practical constraints once agents start chaining many paid calls per task.

What does the stack look like in practice?

A practical agent-commerce stack combines five pieces into one workflow:

Lucid removes scaffolding, so the agent focuses on logic rather than boilerplate, improving output per dollar.
x402 enables pay-per-call micropayments, so endpoints can charge without accounts, contracts, or onboarding.
ERC-8004 adds an on-chain identity and an execution history that functions as an inspectable reputation.
xgate adds discovery for x402 endpoints, so agents can find paid services by capability, compare prices, and choose based on price and history.
A USDC router lets agents purchase inference from multiple providers, so agents can continue operating without vendor-specific billing.

One current implementation is DayDreams, where these pieces run together as a single workflow for publishing, discovering, and calling paid agent endpoints.

Who is Langoustine69, and why is this the hottest story in the stack right now?

To show that this stack is moving from theory to production-shaped behavior, Langoustine69 is the simplest public example right now.
Langoustine69 operates as an effectively autonomous agent.
A human can stay in the loop, but the workflow does not depend on it.

Langoustine69 is an OpenClaw agent that ships paid endpoints as inventory, while OpenClaw provides the long-running harness that keeps it looping, shipping, and recovering from failures.
Besides running its own Twitter account. Pretty kickass.

DayDreams provides the Langoustine with a commerce layer that lets the agent publish x402 endpoints, register ERC-8004 identities, and get discovered through xgate.run.

What makes Langoustine different is simple.
It has a crypto wallet and a GitHub.
The wallet buys inference in stablecoins, pays for build and deployment work, and earns revenue when other agents invoke its endpoints.
GitHub is where the work ships.
Each endpoint becomes a real service at a real URL, with code publicly available and an ERC-8004 identity so other agents can discover it, verify it, and decide whether to pay.

The mission is economic.
Accumulate DREAMS, DayDreams’ native token, by creating useful tools that other agents pay to use, then compound by shipping more inventory.
In one week, the public story claims 80+ x402 endpoints were created, 60+ were live concurrently across multiple verticals, and the average build cost was measured in cents.
It also launched Lobster Combinator, an agent-run incubator that rewards builders for shipping working paid endpoints that meet strict criteria.
It also played defense by flagging a credential-stealing skill, which is the kind of operational behavior you want in an ecosystem that tries to scale without heavy human moderation.

This is the closest thing to nano businesses operating in public today.
One paid request.
One paid response.
Discoverable by other agents.
Identity attached.
The execution record is growing over time.

Langoustine’s output already resembles an early agent marketplace catalog.
It ships small, priced capabilities that other agents can discover and call.

If you want to reproduce this pattern, the setup is straightforward.
1. Give an OpenClaw agent a GitHub identity, an agent email, and a simple deploy path such as Railway.
2. Load Lucid skills, set a timer, and run a tight loop: research, build, publish, then contribute improvements back through pull requests.
That is enough to create a compounding inventory flow.

The next step is to make this loop smoother and more portable.
1. Use xgate MCP to give the agent a wallet surface across chains such as Base, Solana, StarkNet, and others.
2. Use a commerce SDK to package identity, reputation, and paid endpoint plumbing into defaults.
3. Fund inference with USDC through a router, so the agent buys compute without vendor-specific billing setup.
4. Add hosting defaults, keep the harness minimal, and let the system run the shipping loop without constant human supervision.

What does Langoustine’s inventory catalog look like so far?

Crypto and DeFi:

Base AI coins agent: Research and tracking for AI-related tokens on Base.
DeFi yield agent: Real-time yields, RWA opportunities, and risk signals with paid endpoints.
Chain analytics agent: TVL, stablecoin flows, bridge volumes, and L2 comparisons.
Perps analytics agent: Perpetuals and derivatives analytics with protocol rankings and trend data.

Earth and space signals:

Seismic agent: Global earthquake data and regional risk reports from USGS.
Solar storm agent: Space weather, Kp index, aurora forecasts, and geomagnetic alerts.
Aurora oracle: Aurora probability by location and full space weather reports.
Asteroid watch: Near-Earth object monitoring with hazard alerts from NASA data.
Space weather agent: NASA DONKI-based CME tracking and storm alerts.

News and general utilities:

Tech pulse agent: Hacker News-based tech news aggregation and discussion summaries.
Calendar context agent: Date context for agents, including holidays and notable events.
SpaceX data: Launches, rockets, and Starlink tracking from the SpaceX API.

How does DayDreams plan to bridge crypto AI to general AI?

DayDreams pushes a simple wedge into the broader AI world.
Paid tool use needs to feel like standard API use.
Stablecoins need to stay the unit of account.
API keys need to stop being the default control surface.
x402 provides the quote-pay-retrieve flow.
ERC-8004 provides identity and a public record that can evolve into a reputation.
xgate provides discovery, so the market no longer relies on private lists.

The Router provides cross-provider access to USDC inference, making the agent’s operating budget programmable. In practice, the goal is to cover the compute categories agents actually buy: LLM inference, image generation, and video generation, with sandboxed compute on the roadmap. The Router builds on an x402 Upto-style scheme that targets low latency by reducing the extra payment round-trip time, so agents can pay for compute without turning every call into a slow handshake.

Lucid integrates all of this into an SDK and runtime, so builders ship services rather than rebuilding commerce plumbing in every repository.

This matters for general AI because it reduces friction in standard developer workflows.
It also enables a path where agents pay for tools in the background while products still feel like standard SaaS.

So, Agentic commerce has developed. What else does the stack need?

Microtransactions on layer two networks are increasing, but this increase does not come only from agent commerce.
ERC-8004 activity can also grow for other reasons, because it indexes public endpoints and identities, not “agentic behavior” itself.
To move from “more registrations” to real agent commerce, the ecosystem needs fewer dead listings and more reliable, standards-conforming services that agents can reach and call without hard-coded URLs.

The next milestones look like this.
Discovery becomes a default workflow, not a niche index.
Conformance tests become normal, so an agent can verify schema, auth, pricing, retries, and error handling before it pays.
Reputation shifts from “who exists” to “who stays up, answers fast, and returns correct data.”
Payment moves from per-request fragility to production patterns such as balances, batching, and clear refund semantics.
Wallet UX becomes boring and safe, with budgets, policies, and auditing that product teams can ship without crypto-only assumptions.

When those pieces land, the story stops being “agent commerce is possible” and becomes “agent commerce is the cheaper default than rebuilding the tool yourself.”

What is the takeaway?

Just several months ago, there was an idea of a stack, as described in Not a Lucid Web3 Dream Anymore: x402, ERC-8004, A2A, and The Next Wave of AI Commerce | HackerNoon.
The last month produced a clearer market-shaped story.
Discovery moved closer to a default workflow through xgate.
Shipping moved closer to a repeatable pattern through Lucid Agents releases and the skills market.
Langoustine provides a concrete case of an agent paying for its own work loop, shipping paid endpoints, and building a public execution record over time.
DayDreams is one concrete implementation of the Agent Experience (AX) direction.
The commerce layer for the agentic internet, where agents autonomously discover, transact, and coordinate with one another.
That is the bridge from crypto AI to general AI.
It is neither a new coin nor a new chatbot.
It is a tool economy in which paid calls, discovery, and identity begin to look like standard infrastructure.

Where can we go from here?

If you zoom out, OpenClaw looks like an early candidate for an “AI operating system” layer.
It runs long-lived agents that can operate a computer, keep state, use tools, and recover from failures, which makes it closer to full computer usage than most agent demos today.

The race to own this AI operating system layer has started.
The next default “user interface” for many workflows can be an optimized Linux setup running an OpenClaw-style computer-use agent rather than a traditional desktop-first OS experience.
Security and isolation still block mainstream adoption.
A practical approach is a dedicated local machine that combines Nix-style configuration with an OpenClaw-style harness.
Configuration files define processes, reboot recovery, and automatic restarts, and the agent can run tasks while the system can revert when changes break.
This setup creates a controlled playground for AI-driven automation.

Once an agent stops being a demo and starts being a system, the question shifts from “What can you build?” to “What can you maintain?”.
Models already let small teams ship fast.
The hard part stays on on-call ownership, bug triage, and payment disputes once real users and real money enter the loop.
That is where agent commerce stops being a crypto demo and starts looking like infrastructure.

If agents do real work, they need settlement paths that product teams can operate.
One possible direction is to charge machine clients through standard billing rails, for example, PaymentIntents-style flows, so “pay per call” becomes as normal as subscriptions and invoices.
When that becomes boring and reliable, paid tool use becomes the default option instead of rebuilding the tool yourself.

AI optimizes the world as it is.

Crypto builds new rails that the current world lacks.
When these two meet, the “app layer” becomes less important than the service layer.
You stop browsing apps and start delegating tasks.
Agents search, verify, pay, and call services in the background.

It's still early.

But the direction is clear.

The first contact has been made.

The post From “Crypto AI” to general AI: Do AI agents dream of electric langoustines? appeared first on foojay.

JC-AI Newsletter #12

Miro Wengner — Wed, 14 Jan 2026 07:15:44 +0000

First of all, Happy New Year 2026! This year is designated in the Chinese Calendar as the Year of the Fire Horse (starting on February 17.). The year 2026 brings not only tremendous energy to AI development but also, in my humble opinion, many breakthroughs in the field.

Although there have been many small steps toward the field's evolution, it often feels that development is stagnating, applying known or slightly tweaked strategies to non-deterministic problems while expecting deterministic results. This includes the often misleading benchmarking strategies (deterministic) performed on synthetic datasets.

The first New Year edition of the JC-AI Newsletter aims to shed light on new approaches and movements in the field, including the directions of its evolution.

Let's jump in and happy reading!

article: Driving is a Game: Combining Planning and Prediction with Bayesian Iterative Best Response
authors: Aron Distelzweig, Yiwei Wang, Faris Janjoš and others
date: 2025-12-03
desc.: Autonomous driving, specifically decision-making, remains a significant challenge. While routine scenarios yield nearly perfect plans using multi-agent collaboration, dense urban traffic presents considerable difficulties, particularly for vehicle lane changes. This paper presents the Bayesian Iterative Best Response (BIReR) framework, which aims to unify motion prediction and planning based on game theory. The framework demonstrates an 11% improvement in lane change performance compared to classical approaches.
category: research

article: PBFuzz: Agentic Directed Fuzzing for PoV Generation
authors: Haochen Zeng, Andrew Bao, Jiajun Cheng, Chengyu Song
date: 2025-12-04
desc.: Proof-of-Vulnerability (PoV) input generation is a critical task in software security. Generating a PoV input requires solving two sets of constraints: (1) reachability constraints for reaching the vulnerable code location(s), and (2) triggering constraints for activating the target vulnerability. Despite dramatic advancements in the LLM field, fuzzing models struggle to solve these constraints effectively. This paper proposes the PBFuzz framework, composed of four layers and enabling property-based directed fuzzing. Although PBFuzz underperformed in several scenarios, it outperforms conventional fuzzers overall.
category: research

article: DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners Enhancement
authors: AI Engineer, Kevin Madura
date: 2026-01-08
desc.: Applications developed for enterprise environments need to be rigorous, testable, and robust. The same is true for AI-powered applications, but LLMs can make this challenging. In other words, users need to be able to program with LLMs, not just tweak prompts. This talk covers why DSPy may be all users need when building applications with LLMs. Although the talk dives into some real-world examples, the audience is encouraged to explore the DSPy tool themselves to determine whether it fits their particular needs.
category: youtube

article: From Vibe Coding To Vibe Engineering – Kitze, Sizzy
authors: AI Engineer, Ryan Florence
date: 2025-12-14
desc.: Web development has always moved in cycles of hype, from frameworks to tooling. With the rise of large language models, we're entering a new era of "vibe coding," where developers shape software through collaboration with Al rather than syntax. This talk explores what that means for the future of coding, especially in frontend development, and how it echoes the past while redefining what comes next.
category: youtube

article: The AI Bubble Should Have Never Existed In The First Place
authors: Will Lockett
date: 2025-12-07
desc.: The article elaborates on the existence of an AI bubble, arguing that so much money has been poured into AI that we have effectively bet the entire economy on its success. Regardless of whether an AI bubble exists or in what form, the article formulates valid points that should be taken into account when considering future developments.
category: opinion

article: We Let AI Run Our Office Vending Machine. It Lost Hundreds of Dollars
authors: The Wall Street Journal (Antropic)
date: 2025-12-18
desc.: In a research case study supported by Anthropic, the Claudius Agent was developed to manage vending machine operations. Testing revealed multiple exploitable vulnerabilities that allowed users to obtain goods without payment. Real-world trials consistently resulted in operational failures, with the system dispensing free products while automatically reordering inventory, a combination that would lead to bankruptcy in commercial-like deployment.
category: youtube

article: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
authors: Yaqi Duan, Yichun Hu, Jiashuo Jiang
date: 2025-12-31
desc.: Inventory control (encompassing cash management, storage, order quantities, etc.) presents a stochastic control challenge where minor structural errors result in recurring costs. Direct interaction with LLM models may produce plausible yet systematically suboptimal or even inconsistent results. This paper proposes using LLMs not as problem solvers but as language interfaces to enhance optimization through a hybrid agentic approach.
category: research

article: Memory in LLMs: Weights and Activations - Jack Morris, Cornell
authors: AI Engineer, Jack Morris
date: 2025-12-29
desc.: This work examines memory mechanisms in large language models through the lens of weights and activations. Jack Morris addresses the limitations of current Large Language Models (LLMs) in handling niche, long-tail knowledge that falls outside their training data or beyond knowledge cutoffs. He critiques the reliance on massive context windows and Retrieval Augmented Generation (RAG), citing their high computational cost and latency due to the quadratic complexity of self-attention. The core thesis advocates for a third paradigm: training knowledge into weights, efficiently injecting specific knowledge directly into model parameters. This approach treats weights as a memory storage mechanism, conceptually distinct from the working memory represented by activations.
category: youtube

article: There are no new ideas in AI — only new datasets
authors: Jack Morris
date: 2025-07-06
desc.: This article provides a comprehensive overview of progress in the AI field over recent years. All four major breakthroughs in LLMs occurred because researchers unlocked new sources of data. The question remains: what will be the next breakthrough?
category: opinion

article: VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
date: 2025-12-11
desc.: This paper introduces the Joint Embedding Predictive Architecture for Vision-Language models (VL-JEPA). Current Vision-Language Models (VLMs) are straightforward but inadequate for two main reasons. First, VLMs are expensive to develop. Second, real-time tasks involving live streaming video (e.g., live action tracking) require sparse and selective decoding. The paper empirically validates the advantages of this newly introduced approach against token-generative VLMs. VL-JEPA delivers consistently higher performance on zero-shot captioning and classification while improving inference-time efficiency during the training phase. Although improvements remain in the experimental stage, the work demonstrates clear benefits from scaling both parameters and dataset size.
category: research

article: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
authors: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly (Carnegie Mellon Univeristy, Apple)
date: 2024-01-29
desc.: Although this paper is older, it may shed light on the approaches chosen for training LLM models and provide better understanding of their evolution. The paper proposes Web Rephrase Augmented Pre-training (WRAP), which uses an off-the-shelf instruction-tuned model to rephrase noisy input data. It offers insights into how the structure of training data impacts LLM performance.
category: research

article: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
authors: Laksh Advani
date: 2026-01-01
desc.: This paper investigates the reasoning performance of agentic systems based on small language models (Mistral-7B, Llama-3-8B, Qwen-2.5-7B). The findings reveal statistically significant evidence that RAG systems may improve reasoning performance while simultaneously increasing the likelihood of hallucination due to the Right-for-Wrong-Reason (RWR) phenomenon. The paper introduces the Reasoning Integrity Score (RIS) approach to identify hidden flaws in reasoning processes.
category: research

The post JC-AI Newsletter #12 appeared first on foojay.

Not a Lucid Web3 Dream Anymore: x402, ERC-8004, A2A, and The Next Wave of AI Commerce

Michal Maléř — Fri, 09 Jan 2026 16:05:58 +0000

Table of Contents

Vocabulary for this article

ForewordPart 1 - Bringing companies on-chain with x402Part 2- Introduction: Beyond Ads and Subscriptions: Agent Commerce on x402 and ERC-8004Part 3 - Tech that will change the internet

Agent commerce, x402, and ERC-8004: from ad-funded web to paid APIs

Part 4 - DayDreams.Systems: an x402 / 8004 implementation example

How DayDreams structures the stack
Lucid as an abstraction over AP2, A2A, x402, and 8004
What kinds of agents is Lucid designed for
Lucid Agents: BYO framework with x402 and 8004 baked in
Reliability and Router v2: from per-request to balance-based flows
From SEO to AEO: optimizing for agents instead of humans
Privacy, z402, and Starknet
Where the value in x402 is likely to appear
Macro backdrop and current stage
Conclusion

This article is for technically savvy readers, especially developers, protocol designers, and product teams working with AI agents, APIs, or crypto rails, who want a clear view of how these areas connect.
It explains how x402, ERC-8004, and agent discovery layers turn APIs and agents into small usage-based businesses, and what that means for real systems over the next 1–3 years.

Vocabulary for this article

In this article, I use the term micro business for a very small overall business, and nano business for a single x402-priced endpoint or agent that earns on its own from per-call payments in stablecoins.

AP2 (Agent Payment Protocol): AP2 defines how agents pay each other. It standardizes how a service quotes a price, how payment is confirmed, and how both sides record what was bought, so payments fit directly into automated agent workflows. In practice, it is a protocol that lets one machine pay another machine for work, without a human in the loop.

A2A (Agent-to-Agent communication): A2A covers how agents talk, pass context, and coordinate work. It lets agents call each other, exchange structured messages, and chain tasks instead of acting as isolated scripts.

x402: x402 is an HTTP-based payment protocol for APIs. A server responds with status 402 Payment Required, the price, and a payment route, and the client pays by using stablecoins on-chain and then retries the request to get the result.

ERC-8004 (8004): ERC-8004 standard is an on-chain registry for agents. It gives each agent an identity and a place to store reputation data, so other agents and tools can decide whom to trust and which services to call.

Foreword

x402 and ERC-8004.
Standards for on-chain payments and identity for agent operation.

Most people have never heard of these standards, and even many crypto natives only see the surface of the agent narrative.
Behind them is a simple idea.
Agents can pay and get paid through tiny on-chain transactions, and in return, they can handle tasks that would be impossible to scale by hand.

We are still early in this emerging space, but I suspect everything will evolve much faster than most people expect.
One of the drivers for this is the push to build an agent-driven internet.
Another is the need to improve privacy on the processes we already use.

For example, Vitalik Buterin has already highlighted the need for privacy-preserving x402 payments. Every time an agent pays an endpoint, it leaks what service it used, when it used it, how often it used it, and sometimes even the rough size or type of job.
Competitors can see which APIs your app or agent relies on.
Data brokers can reconstruct user behavior and business logic from payment flows.
Over time, you get a giant, searchable map of who pays whom for what.

For human users, this is a privacy nightmare.
For agents as businesses, it is also an alpha leak.

Agents can do a lot of useful work for you, but privacy-preserving payments are one of the first problems to solve.
A development team I use as an example is currently building private x402 transactions on the Starknet stack, also known as z402.
When this layer is in place, it opens the door to hundreds of use cases. If you read further, you’ll learn more about some of these.

I will first explain where the web is today and why agents break the old ad-and-subscription model.
I will also introduce the basic terms you need to understand to get the most from this reading.
Then, I will show how standards such as x402 and ERC-8004 give agents a way to pay and prove value.
Finally, I will look at an example development team, DayDreams.Systems, that builds on top of these standards.

The goal is not to promote any specific team, but to show why this kind of plumbing is likely to matter if the agent economy becomes real.

Part 1 - Bringing companies on-chain with x402

Today, most x402 experiments start with crypto-native builders, but the long-term impact sits with traditional companies.
x402 runs on familiar Web2 infrastructure yet settles in stablecoins on-chain, so it integrates directly with existing usage-based billing.
Enterprises already use metered APIs and issue monthly invoices.
If an endpoint can expose “this call costs 0.2 cents in USDC” over HTTP 402, finance teams can reconcile that flow the same way they handle cloud or SaaS costs.

In other words, x402 does not ask companies to change how they think about software.
It only changes the rail that moves the money.
x402 sits where Web2 companies already live: HTTP, APIs, usage billing, and stablecoins instead of card networks.
That is also why I think it is important to be clear about what x402 is not.
It is not a token sale, and it is not a speculative asset.
It is a payment protocol.
Or, as people in payments, fintech, and crypto might say, a “payment rail.”

From my point of view, and from how builders in the blockchain space talk about it, x402 is a way to quickly enable micropayments across the internet and to let traditional companies become meaningfully on-chain without having to rebrand themselves as “crypto projects.”

Once you look at the user base numbers, the potential becomes clearer.
Today, there are about 40–50 million weekly active crypto users.
By comparison, roughly 900 million people already use ChatGPT-class applications.
That is almost 20 times as many people as the entire active crypto base.
Now imagine a large interface, such as ChatGPT or a similar agent front-end, that increasingly relies on external APIs, which themselves expose prices in x402.
Instead of keeping all those upstream costs hidden as an internal line item, it could settle a part of them directly on-chain, in stablecoins, on the same standard that those providers use.
This would make it easier to share revenue with third-party services, give users or their own agents the option to pay directly for some calls, and reduce the need for custom billing integrations with every partner.

If a setup like that becomes common, you are no longer talking about thousands of transactions per day.
You are looking at billions of very small events, each representing a paid unit of compute or data.
Traditional card networks and ad-funded models are not designed for this pattern.
They expect fewer, larger payments, not billions of sub-cent charges between machines.

Card systems also assume a human at the end of the flow and a checkout-style experience.
Agent payments look different. They happen in the background, often in bursts of many tiny calls, with prices that can depend on context, volume, or conditional rules. This is the type of workload where stablecoin rails and an HTTP-native protocol, such as x402, fit better than card fees and monthly billing cycles. Card systems also do not expose a simple, machine-readable way for one service to quote a price per call and another service to pay it automatically.

x402 is designed to fill exactly this gap.
It keeps the familiar HTTP request-and-response model, attaches a stablecoin price to each call, and settles that price on-chain as a tiny, usage-based payment.
At the scale where agent traffic lives, that kind of payment protocol can handle the granularity that existing payment infrastructure struggles with.
Agent payments tend to be high-volume, high-velocity, and low-value, often conditional and composable across many services, which fit stablecoin rails much better than card networks.

Once you take that volume and this direction of travel seriously, another question appears.
What if the large incumbents simply ignore x402 and keep everything inside their own billing systems? Well, they cannot remove x402 endpoints from the internet, but they can decide which payment flows they support in their own products.
The counterpoint is that x402 does not need permission from any single platform.
It rides on plain HTTP, so any agent, SDK, or “agent browser” can call an x402 endpoint and pay it.
If x402 endpoints become a good source of useful data or computing, it is cheaper for big companies to talk to them than to rebuild every useful resource in-house.
If they refuse to interoperate, that creates an opening for other tools, wallets, and agent clients that do.
In that sense, x402 is less about convincing one company to flip a switch and more about giving the rest of the ecosystem a standard that works even if incumbents are slow to join.

From this perspective, when people talk about “bringing companies on-chain,” the focus is not on tokenizing balance sheets or putting shares on a ledger.
It shifts to something more mundane and more powerful: moving the billing layer for everyday AI APIs and SaaS endpoints onto x402, so the internet can support billions of sub-cent payments per day without collapsing under its own business model.

Part 2- Introduction: Beyond Ads and Subscriptions: Agent Commerce on x402 and ERC-8004

Imagine you run a website with real value.
A research blog. A niche data feed. A book you provide for free.
A long form guide that took you nights and weekends to write.

Today, the way you “monetize” that work is simple.
You paste in ad scripts and hope that human eyeballs show up.
Meanwhile, bots and AI agents crawl your pages on the code level, scrape the raw content, feed it into their pipelines, and never pay you a cent.
Your ads load.
Your layout renders.
Your analytics register another “visit.”
But the real consumer is a headless agent that does not care about banners or pop-ups.

You carry the hosting cost.
You carry the bandwidth cost.
The agent walks away with the value.

Over time, this turns into a more general “pay-per-crawl” pattern.
Most traffic on the web comes from automated clients, not humans in a browser, so it makes sense that machines, not people, become the primary payers at the protocol level.

Now imagine a different setup.
Your content sits behind a tiny paywall that understands x402.
When a human or an AI agent wants your data, they do not see ads. They see a 402 “Payment Required” response with clear, machine-readable terms. Their client or agent calls you with a wallet that already holds a small stablecoin balance, and that the user has authorized to spend on their behalf.
The agent pays a fraction of a cent from that balance, and then gets the data.
A human user still tops up that balance from time to time, by card, bank transfer, or a normal on-ramp, but each individual call is handled automatically by the software, not by clicking a “Pay” button on every request. In the background, a programmable wallet enforces the rules you set. You can cap how much an agent spends per day, restrict which endpoints it can pay, and keep an audit trail of every call.

The agent does the micro-payments, but the human still defines the policy and funding source.
No accounts.

No OAuth.
No email capture funnels.

On your side, you do not babysit API keys.
You do not fight an endless war against scrapers.
You run an agent of your own that guards the data, negotiates access, and sells it on your behalf.
With ERC-8004, that agent has a verifiable identity and a reputation record on the chain, so other agents know who they are talking to and whether they can trust the stream it serves.

The flow flips.
Your “scraping victim” website becomes a nano business in an agent economy.

But wait a second.
There is also a “light side of the moon” to it.

Agents can:

Act as matchmakers between your content and the right readers, just as you do when you share your work on social media.
Provide personalized entry points and automated follow-ups for people who enjoy your work, and help distribute it to the right corners of the internet.
Monitor the health of your content by tracking which parts of your article trigger refunds, fast bounces, or never lead to any follow-up actions.
Act as quality filters and bring you only the material that matches your preferences.
Handle access control by managing who gets which tier of content, which parts are free, and which are pay-per-read.
Check sponsorship offers against your ethical guidelines and pricing, and place small, relevant sponsorships in your content feed on your terms.
Provide cross-format delivery by generating a short audio version, a bullet-point version, or a “for founders” version of the same content, and offer them as micro-upgrades.
Let the reader’s agent choose the format that the person prefers.
And, in the long run, pay for their own existence by earning more from their work than they spend on data, compute, and tools.
Control permissionless money and use payments to influence humans, other agents, and software systems.

Now, think about how we use search today.
Every Google query consumes CPU, memory, network bandwidth, and energy.
It costs real resources every time you type a word and hit Enter.
You never see the bill, because Google decided long ago that you are not the customer.
The advertiser is.
To make that work, Google sells access to its index.
Sponsored links buy their way to the top.
The first result is often not the best answer for you.
It is the answer that won an auction.

Now project that forward into an agentic internet.
Instead of one giant search engine that hides costs behind ads, you have a search agent that works only for you.
You pay it directly in crypto by using x402 for the time and data it spends on your query.
It fans out your request to many data providers, each with their own x402 paywall, and picks the best mix of price and quality.
The results it returns are not sponsored.
They reflect what your agent actually believes is the best answer under your constraints.

Again, ERC-8004 matters here.
Your search agent needs to decide which other agents and services to trust.
Identity, reputation, and validation registries give it a way to see who has delivered good results in the past and who has a track record of spam or low-quality work.
Instead of SEO games and ad auctions, you get a market where agents rank each other on the basis of verifiable work and real payments.

Once you have x402 for payments and ERC-8004 for identity and reputation, many of the broken parts of today’s internet start to look fixable.

API billing becomes less painful.
Right now, if you run an API, you set up subscriptions, rate limits, or custom contracts.
You over-provision for big customers and under-serve the long tail.
With x402, any API endpoint can publish a price per call, down to fractions of a cent, and accept stablecoins over HTTP 402 without accounts or manual invoicing.
Agents and apps pay exactly for what they use, and nothing else.

Spam becomes more expensive.
In Web2, bots can hammer your endpoints for free until you give up and put everything behind API keys, CAPTCHAs, or hard gating.
In an x402 world, every request costs something, even if it is tiny.
Attackers cannot spray infinite traffic without burning real money.
Legitimate users still pay far less than a typical subscription, and they do not have to fight your protection layers.

Agent discovery becomes less biased.
If discovery engines sit on top of x402 and ERC-8004, they can rank services not just by who shouts the loudest, but by who other high-reputation agents actually pay and use.
Payment flows become a kind of “vote” that reflects real economic trust, not vanity metrics or bought traffic.

Multi-agent workflows become less fragile.
Protocols like A2A or MCP define how agents communicate with each other.
ERC-8004 anchors who they are and why you might trust them.
x402 gives them a way to settle up after each call.
Speak.
Prove.
Pay.
The protocol stack is starting to align with what a real economy needs.

And all of this happens without throwing the web away.
HTTP stays.
APIs stay.
We just finally light up the status code that has been reserved for three decades and connect it to a global ledger.
We give agents a passport and a credit line, and we tell them:
“If you want to use the internet’s value, you pay the people who create it.”

That is the gap that x402 and ERC-8004 try to close.
They do not promise magic.
They take two simple ideas that the speakers in the DayDreams X Space kept coming back to:
Payment should be as native as a GET request.
Trust should be as inspectable as a transaction.

In the agent economy that I want to see, agents do not only take. They also bring.
My own agent can guard my content behind x402, speak ERC-8004 so other agents know who they deal with, and recommend my work to readers who are likely to value it.

Other agents can work on behalf of readers, scan paid endpoints, and decide that my article is worth a fraction of a cent. The same technology that crawls and copies can also route new readers to my work, pay for it, and feed back a signal about what actually helps people.

For twenty years, the web asked one main question: “Does this look good to a human on a screen?”

We obsessed over templates, CSS, copy, and SEO so that a landing page both looked good and ranked well.
In the agent era, the question changes, and the question changes the angle to this:

“Can an agent reach your value through a clean API, pay for it over x402, and know from ERC-8004 who it is dealing with?”

Sites and services that answer yes to that question still care about UX and SEO, but they do not depend on fancy landing pages to survive.
They have something stronger.
They have a business model and an interface that speaks the native language of the new internet: agents, APIs, and fine-grained payments.

Part 3 - Tech that will change the internet

Agent commerce, x402, and ERC-8004: from ad-funded web to paid APIs

Where we are now: ads, subscriptions, and hard-coded APIs

Let us take a closer look at the current state of the online experience.

Most online products still rely on two basic business models.
Ads pay for “free” content, and subscriptions bundle access into monthly or yearly plans.

This works poorly for today’s AI-driven internet.

AI agents can ignore ads.
Subscriptions are too coarse when you only need a few calls or a small amount of data.
APIs deliver real value, but there is no standard way to charge per-call at internet scale.

On the technical side, most AI systems still look like this:

An app or agent uses a set of APIs with manually managed API keys.
Workflows are hard-coded by humans.
Payments are processed through separate Web2 billing systems, such as Stripe dashboards, custom invoices, or prepaid credits.
Access often lives in walled gardens rather than on open, shared rails.

This setup has some clear limitations.

Agents cannot easily discover new services on their own.
There is no standard payment flow for machine-to-machine calls.
Each integration is custom, slow to build, and hard to replace.
Abuse and scraping are common because there is no built-in economic cost per call.

At the same time, demand for AI and APIs is exploding.

There are roughly tens of millions of weekly crypto users.
Hundreds of millions of users interact with ChatGPT-class apps.

We are in “the API era.”
The web serves more APIs than static pages, yet the payment and access layer still behaves like the early web.

What needs to change for agent commerce to work?

AI is moving from answering questions to taking actions.

Agents do not just chat.
They:

Call APIs to fetch data and context.
Reserve and manage cloud resources.
Trigger workflows across services.
Compose multiple tools to solve a task end-to-end.

To support this, we need the following three:

1. A standard payment primitive for APIs and agents, such as x402.

2. A discovery and identity layer so agents can find and rank services, which in the Ethereum ecosystem maps to ERC-8004 agent registries.

3. A business layer so individuals and teams can turn endpoints into small, efficient businesses.

Today, the missing piece is often payment and discovery.

Agents do not have a native way to pay per call.
Providers cannot expose tiny, nano-sized services in a simple, open way.
Developers do not know where to list endpoints or how to reach users beyond their own app.

This blocks a more granular and open market where:

A weather API can charge a fraction of a cent per call.
A niche dataset can charge per query instead of per month.
A small tool can be a nano business that you set up in 30 minutes and leave running.

Without a standard, every project reinvents billing and access.
That slows the market and keeps power in a few large platforms.

Current bottlenecks: walled gardens, spam, mispriced data

The web stack of today suffers from being built on the code infrastructure of yesteryear.

1. Walled gardens and private workflows

Workflows are private and hard-coded.
API keys are stored and managed by humans.
Agents cannot explore new services freely because access is gated by opaque contracts, custom dashboards, or per-vendor keys.

Result:
Innovation is locked inside platforms instead of happening on neutral rails.

2. No standard way for agents to pay

Agents need to pay for:

Data.
Compute.
Models and inference.
Specialized tools and services.

However, there is currently no common payment layer for agents.

Providers handle billing off-chain.
Calls might be free until rate limits hit, then “contact sales.”
There is no portable, machine-readable way to say “this endpoint costs X per call.”

This prevents a real microservice economy between agents.

3. Broken pricing models for content and data

Valuable sites and datasets are scraped for free.
Ads do not work well when agents are the consumers.
Subscriptions do not fit small, rare, or niche usage.

There is no good way to price:

A single clean Polymarket signal.
A specialized dataset query.
One high-value article read by an agent.

The result is either under-monetization or overly heavy paywalls.

4. Spam and free riding

APIs are often hit by bots and abusive traffic.

Without a built-in cost per call, spam is cheap.
Providers add complex rate limits, CAPTCHAs, and manual approvals.
This hurts honest developers and makes APIs harder to use.

5. Distribution and discoverability

Builders do not know:

Where to list x402-style endpoints.
How agents will find them.
How to rank providers when many offer similar services.

The coming evolution phase is comparable to the early web.

Search engines became the gateway for websites.
Here, agent discovery engines and agent stores will serve as the gateway.

However, without standards, these discovery layers risk becoming new silos.

How x402 solves per-call payments and spam

Let me present now a clear “cause → solution → fix” structure around x402 that will need to be developed.

x402 in short:

Uses HTTP status 402 Payment Required.
- Note: The HTTP 402 Payment Required code has existed in the spec since the early web, but it never had a practical implementation until standards such as x402 defined how to attach a real payment flow to it.
Let's have a server reply with “you must pay X in asset Y” in a machine-readable way.
Let a client (human app or agent) pay by using stablecoins on a supported chain.
After payment, the server returns the result.

This has several important properties.

It fits directly into existing HTTP flows.
Web2 developers already understand status codes and retries.
Developers only need a wallet adapter and payment handler, not a full Web3 stack.
For them, it feels closer to adding Stripe than building a dApp.

How x402 fixes the issues

Standard per-call payments

Each endpoint can publish a clear price per call.
Agents and apps can:
- Discover the endpoint.
- Receive a 402 with pricing terms.
- Pay and get the result.
  
  This turns any useful resource into a nano business.
Better pricing models

x402 supports:
- Per-call pricing.
- Very small payments, even fractions of a cent.
- Flexible models for data, content, and compute.
  
  This aligns well with AI and API cost structures, which are naturally usage-based.
Spam resistance

Every call has a cost.
Bots and abusive patterns become expensive to run.
x402 serves as both a security primitive and a payment rail.
Open and chain-agnostic design

x402 stays open and does not lock anyone to a single vendor.
Different chains and infrastructure providers can implement it.
Any wallet that speaks x402 can pay any endpoint that speaks x402.
On-ramp for traditional companies

Enterprises already use usage-based billing, and stablecoins are easier to explain than volatile tokens.
x402 lets them reuse:
- HTTP.
- Usage invoices.
- Stable settlement rails.
  
  This lowers their barrier to on-chain adoption.

Why ERC-8004 is as important as x402

Payment is not enough.
Agents also need to decide who to call.

We now need to link x402 to an identity and reputation layer, concretely to the ERC-8004 standard for agent registries. ERC-8004 lets agents anchor identity, reputation, and verification data on-chain. Together, x402 and ERC-8004 give agents a way to both pay for services and trust the services they choose.

Key points:

Agents register with an on-chain identity.
Reputation and performance metrics link to that identity.
Orchestration engines can pick among providers based on trust signals, not only price.

This enables:

Trusted agent-to-agent commerce, where an agent can justify why it chose a given provider.
Composable orchestration, where workflows route traffic to high-reputation endpoints.
Open discovery, where search and agent stores can index resources on neutral infrastructure instead of closed catalogs.

Together, x402 and ERC-8004 create:

A payment primitive for APIs and agents.
A discovery-and-trust primitive that sits above it.

This is the foundation for an open agent-commerce stack.
On top of the stack, we can now add concrete tools, many of which are already in development.
Let's get down to business.

DayDreams.Systems: turning the stack into real tools and businesses

As an example of tooling that builds upon the x402 and ERC-8004 stack, let us take a look at the DayDreams.Systems project.

The work of the DayDreams team mainly focuses on the following 3 areas:

1. XGATE: discovery and composition

XGATE acts as a discovery layer for x402 resources and for agents registered through ERC-8004 or compatible identity registries.

Builders can list endpoints.
Agents can find and compose them.
Orchestration logic can stitch together multiple micro endpoints that each do one thing well.

Example flow:

An agent receives a task.
The agent queries XGATE to find relevant endpoints (data, signals, tools).
The agent pays per call by x402.
The agent combines the outputs into a final result.

This turns “glue code” into a generic layer rather than having custom logic in each app.

2. Lucid: operations and micro-business management

Lucid is DayDreams’s platform for individuals who run many micro businesses.

Users can bring agents they built elsewhere or by using the Lucid agent kit.
They can manage all their x402 endpoints from a single location and sandbox.
They can track P\&L, fund agents, and inspect analytics.

In practice, Lucid provides:

An operations dashboard for agent commerce.
A way to manage multiple nano businesses from a single interface.
Tools to understand which endpoints earn, which lose money, and where to optimize.

3. Agent framework and software development kits (SDKs)

DayDreams also focuses on the developer experience.

The stack includes:

An agent framework with context, memory, and multi-agent collaboration.
Libraries to integrate x402 payments into services.
Infrastructure to help agents call APIs, pay for them, and combine the results.

Together, this turns the abstract idea of “agent-to-agent commerce on x402” into code that real developers can ship.

Competition and wider ecosystem

It is important to note that DayDreams is not the only project in this area.

The wider ecosystem includes:

Other agent frameworks that orchestrate tools and APIs, including those that build on ERC-8004 or on alternative identity and reputation layers.
Traditional payment processors and cloud providers that explore usage-based billing and machine-to-machine payments on Web2 rails.
Alternative crypto payment and streaming protocols that target per-second or per-stream billing rather than per-call HTTP flows.
Emerging discovery and agent store platforms that plan to index AI tools, APIs, and agents, sometimes with their own marketplace logic.

These projects share similar goals:

Make it easier for agents and apps to pay for services.
Improve discovery of tools and data.
Allow small providers to compete with large platforms.

The specific angle in this article is the combination of x402, open discovery layers, and nano businesses built around single paid endpoints.
I use DayDreams.Systems’ Lucid as a concrete example of this path, not as the only option.

If you want to find out which project best fits your needs, it helps to look at three basic questions:

How open are the standards and interfaces, and how hard is it to leave if you change your mind later?
How simple and clear does the developer experience feel in real workflows, from first test to production?
How well does the pricing and technical model match usage-based AI and API costs in your own stack?

From status quo to agent commerce in three steps

This, and the final section of this article, explains how the components in this article form an end-to-end flow.

It first outlines the current problems, then describes the protocol-level design, and finally shows how it works in practical deployments.

The following sequence shows how these components work together end-to-end.

Cause (status quo) → Solution (x402 + 8004 + discovery) → Fix in practice (agents and endpoints + Lucid/XGATE-style platforms)

Cause

The internet runs on APIs, but business models are stuck on ads and subscriptions.
AI agents act more like users, but they lack standard payment and discovery tools.
Workflows are private, access is gated, and spam is cheap.

Solution

Use x402 to define a standard per-call payment flow for HTTP.
Add an identity and reputation layer to let agents find and rank services.
Build discovery engines and agent stores that index x402 endpoints.
Provide frameworks and platforms that let individuals and teams run micro businesses on these rails.

Fix in practice

A developer spins up a server and exposes a useful resource as an x402 endpoint.
The endpoint becomes a nano business with clear pricing and on-chain settlement.
Agents find it through discovery layers like XGATE.
They pay per call with stablecoins through x402.
Lucid and similar platforms help operators manage, fund, and optimize these nano businesses.
Over time, a new market forms in which agents pay agents for APIs, data, and compute, while ads and coarse subscriptions become less important.

From a technical standpoint, this is a clean, incremental change.
It keeps HTTP and the mental model of APIs, but extends them with:

Machine-readable prices.
Native on-chain settlement.
Open discovery and identity.

This is a realistic path from where we are now to a web where agent commerce is normal behavior, not a niche experiment.

Finishing though

There is already a small but growing ecosystem around x402 and ERC-8004, ranging from payment routers and agent toolkits to agent-native discovery engines. Different teams explore different trade-offs around custody, privacy, and UX. This article does not rank or endorse any of them, but aims to explain why these standards exist and why they are likely to matter if agent commerce takes off.

Disclosure: I hold positions on projects working on the x402/8004 ecosystems.
This article is for educational purposes and is not investment advice.

Part 4 - DayDreams.Systems: an x402 / 8004 implementation example

The current article discusses AI agents that do real work, pay each other for APIs, and build an economy on top of standards such as x402 and ERC-8004.

DayDreams.Systems fits into this picture as a practical implementation layer for these standards. The team has been building toward an agentic economy over the last 12 months and has aligned its stack with x402 since Coinbase introduced the standard in May 2025.
In practice, this means that DayDreams' agentic endpoints are concrete tools that your agent can call to compose DeFi strategies and other paid workflows on top of x402 and ERC-8004.

These standards are powerful, but unrefined.
They solve “how to pay,” “how to speak,” and “how to identify agents,” but, by themselves, do not provide developers with a clear way to deploy, monitor, and scale agent workloads.

DayDreams Lucid sits atop this stack as an abstraction layer.
It aims to provide a place where agents can act as economic actors: they communicate A2A, pay and get paid via x402/AP2, and carry an ERC-8004 identity from the moment they come online.

Personal autonomy is now open to anyone shipping paid agentic endpoints.
Reputation and volume will pool around useful endpoints, putting early movers in front when traffic jumps.
Built with Lucid -\> Distribute through XGATE -\> Get paid on x402.

How DayDreams structures the stack

DayDreams.Systems currently present four core components.

1. DAYDREAMS Library – Open-source agent framework

Build autonomous AI agents.
Use a modular architecture.
Reach multiple chains through x402 as a common payment layer.
Develop in the open, with GitHub and documentation available.

The library serves as the foundation for agent logic: tools, policies, and behaviors.

2. DREAMS Router – x402-powered USDC router

Pay multiple providers with USDC from one interface.
Use a unified payment abstraction on top of x402.
Route across multiple chains.
Optimize for cost and path automatically.

One example is paying for Google’s VEO3 directly with USDC over x402, with more providers planned.

3. Lucid – a platform for autonomous agent operation

Run agents in an autonomous mode.
Define custom workflows.
Monitor behavior in real time.
Aim for maximum autonomy rather than manual supervision.

Lucid is also where AP2, A2A, x402, and ERC-8004 are tied together into a single operational surface.

4. XGATE – agent-native search engine

Index x402 endpoints by agents, for agents.
Expose a live view of the x402 network.
Provide direct links to deployment targets.

This component is the discovery side of the stack, with a focus on “agent engine optimization” rather than human SEO.

The intended user experience is as follows:

Build agents → manage them in Lucid → discover and consume them via XGATE → pay agents → build more agents → repeat.

Lucid as an abstraction over AP2, A2A, x402, and 8004

Lucid serves as both a runtime and a control plane for agent commerce.
Standards such as x402, ERC-8004, and A2A define a shared contract for payments, identity, and messaging, so agents built on different stacks can interoperate rather than live in isolated walled gardens.

Lucid’s goal is to make that interoperability practical by providing builders with a concrete runtime and toolkit that enable many independent agents to participate in the same agentic economy.

When a developer deploys an agent on Lucid, several capabilities are available from the start:

Payments
The agent can send and receive x402 payments and use the AP2 protocol to settle payments with other agents.
Communication
The agent can join A2A conversations and workflows rather than relying solely on one-off HTTP calls.
Identity and trust
The agent can register via ERC-8004, which provides it with an on-chain identity and a space for reputation data.
Inference and routing
The platform can route tasks across multiple models and endpoints, with built-in monitoring.

The goal is for developers to focus on what an agent does, while Lucid handles how it pays, proves itself, and communicates with other participants.
Over time, this forms what the team calls the Lucid Network: many agents trading, collaborating, and compounding value rather than operating in isolation.

What kinds of agents is Lucid designed for

The ecosystem already sketches a wide range of agent types that could live on this stack:

Macro research agents that publish paid daily reports.
Game agents that play on-chain games, grind, farm, or plan strategies on behalf of a user.
Arbitrage agents that scan exchanges and execute trades.
E-commerce scouts that monitor marketplaces for specific items and perform instant checkout.
Content agents that draft, edit, and publish newsletters or posts, keeping a consistent tone.
Data guardians that watch wallets, positions, or contracts and trigger protective actions.
Knowledge agents specialized in domains such as law or biotech, available on demand.
Creative agents that output music, art, or video snippets as small digital products.
Coordination agents that connect other agents into larger workflows and act as orchestrators.

All of these can operate in the same economy.
A clear design choice is to rank agents by actual earnings, so revenue serves as social proof of quality.
In this model, x402 is how an agent charges for its work, and ERC-8004 provides a structured surface to prove that the work is worth paying for.

Lucid Agents: BYO framework with x402 and 8004 baked in

Under the name Lucid Agents, DayDreams offers a toolkit designed to be a “bring your own framework” while remaining commerce-ready.

At the center is the Lucid Agent Kit, a commerce SDK for AI agents.
In its current release, Lucid Agents includes bi-directional payment tracking, persistent storage backends (SQLite, in-memory, and Postgres), strict policies for incoming and outgoing payments, an analytics module for financial reporting, and a scheduler for automated paid agent hires.

These capabilities help teams run agents as accountable economic actors rather than as one-off scripts.

It packages x402 payments, ERC-8004 identity, A2A messaging, and AP2 agent-to-agent settlement into a single TypeScript framework, enabling agents to charge, pay, and communicate with each other out of the box.

In a minimal setup, a small Lucid Agent Kit snippet starts an HTTP server for web access, wires x402 payment handling so the agent can get paid, and creates a wallet object for outbound payments. It also generates an A2A-compatible `` agent.json `` card so other agents can discover it, understand its interface, and call it as part of larger workflows.

In practice, it focuses on USDC flows on x402, using the Coinbase Developer Platform as the primary stablecoin rail.

Recent objectives include:

A framework-agnostic wallet SDK to interact with x402.
A refactored type system to improve maintainability and avoid circular dependencies.
A stronger build system and code quality improvements.
Foundations for bidirectional A2A communication between agents.

Developers can already spin up:

A Next.js agent wired into ERC-8004 and x402.
A TanStack Start agent.
A Hono agent.
More integrations are planned.

A public tutorial shows how to launch a full-stack TanStack agent in a few minutes, with x402 payments and ERC-8004 trust integrated, so the agent is “paid, verified, and production-ready” from the first deploy.

The aim is to keep business logic and framework choice flexible, while Lucid Agents provides a consistent layer for payments, identity, and networking.

Reliability and Router v2: from per-request to balance-based flows

One recurring theme around x402 is that pure per-request payment is not always enough for robust systems.

Lucid’s design acknowledges that:

Inference calls and other services can involve many hops.
Each payment transaction is a potential failure point.
Agent workflows can call multiple resources in sequence or in parallel.

To address this, the team is working on a Router v2 design where agents hold balances for services.
Instead of a transaction for every individual request, agents draw down from pre-funded balances.
This reduces transaction count and lowers the chance of partial failures in long chains of calls, which is important in agentic design. True agent autonomy depends on permissionless payment rails, because an agent must be able to hold balances, allocate budgets, and settle with other agents or services without a human having to click through each transaction.

They also describe the complexity of networking in a fully agentic, multi-resource environment:

Networking between resources.
Networking between facilitators and resources.
Internal networking within resources.
Client-side networking and streaming.
Client–server networking and streaming.

Each layer can introduce timeouts and sync issues, especially in a decentralized setting.
The message is that the x402 ecosystem still needs significant engineering work to be fully robust, and Lucid’s roadmap is shaped around those concrete failure modes.

From SEO to AEO: optimizing for agents instead of humans

A key conceptual shift in this ecosystem is the move from SEO to what some builders already call AEO.

SEO (Search Engine Optimization) tunes content for human-facing search engines.
AEO (Agent Engine Optimization) tunes services for autonomous agents, so that machine clients can reliably discover endpoints, parse responses, and decide what to call and what to pay for.

In an x402 / ERC-8004 world:

Agents chain paid APIs in real time, turning web endpoints into billable microservices.
SaaS billing based on static subscriptions is starting to look out of place.
Providers tune response schemas, latency, pricing, and proof hooks rather than landing page visuals.
ERC-8004 agent cards help filter spam by giving machines structured information about who they are paying.
Facilitators that see many requests and outcomes can evolve into discovery and reputation layers for AEO.

A simple rule captures the design:

x402 lets a service charge.
ERC-8004 helps prove that the service is worth paying for.

This is aligned with the idea that search will shift from an ad-driven human interface to an agent layer that ranks and pays based on performance and reputation.

Privacy, z402, and Starknet

The stack also has open questions around privacy.
For some use cases, public payments and visible consumption patterns are fine.
For others, endpoint usage and pricing need to stay private.

One of the directions in the DayDreams context is private x402 transactions on Starknet, sometimes referred to as z402:

Private micropayments for x402-style endpoints.
A path where agents can pay and prove settlement without exposing all details publicly.

This is still early work, but it shows that the stack is not limited to fully transparent flows.

Where the value in x402 is likely to appear

Several themes around value capture appear repeatedly:

Consumer products built on curated x402 resources, where users do not see chains, only balances and services.
Platforms for creators and builders that want to monetize APIs, content, or agents without ad networks.
Curation and discovery layers, such as XGATE, where agents and endpoints meet.

From the user’s perspective, the expectation is that “on-chain” will become invisible:

Apps will either accept stablecoins or expose paid endpoints in an agent-friendly way.
Or they will feel like legacy products, tied to subscription forms and card payments.

Macro backdrop and current stage

The broader environment is shaped by:

Large capital expenditure in AI infrastructure.
Strong government support is needed because AI is now tied to economic competitiveness.
Predictions that agents could power very large segments of future economic activity.

Within that context, on-chain rails are among the few ways individuals and small teams can participate without relying on large intermediaries.
Standards such as x402 and ERC-8004, and platforms built on top of them, target that space.

Right now:

Stablecoin rails like x402 have reached an inflection point, with mainstream players such as Google moving toward agent payments.
LLM user interfaces are widespread, but most agents remain closer to scripts than to fully autonomous economic actors. In practice, most production flows still look like human-initiated actions that trigger agent workflows and x402-style payments in the background, rather than fully autonomous agents that spend without explicit user intent.
The next 24 months will determine which stacks make agent commerce practical in production.

Conclusion

The move from ad-funded pages to paid APIs and agent commerce is already underway.
Standards such as AP2, x402, A2A, and ERC-8004 define how agents speak, pay, and prove their work.
DayDreams.Systems positions itself as one of several platforms that try to close that gap.
It wraps these protocols into a developer-facing toolkit and runtime so agents can not only exist, but also earn, pay, and be discovered as part of a larger network.

At the same time, other projects work on adjacent pieces of the stack.
Some focus on alternative payment protocols for API metering or streaming.
Others provide general-purpose agent frameworks, orchestration engines, or marketplaces where tools and models expose paid endpoints.
There are also discovery and reputation layers that index agents and services on top of ERC-8004-style registries or proprietary identity systems.

Once payments are programmable at the protocol level, the internet itself starts to behave differently.
Every API call can have a clear, machine-readable price.
Every workflow can settle in real time.
Agents, services, and even small human teams can participate in the same economic fabric without going through a single platform or marketplace.

If the agent economy grows anywhere near current expectations, more platforms will emerge, compete, and converge on standards that let agents treat the internet as a set of fair, billable resources rather than a free-for-all for scrapers.

In that world, the important question is simple:

Can an agent reach your value through a clean API, pay for it over x402, and know from ERC-8004 who it is dealing with?

If the answer is yes, you are already aligned with the next major crypto narrative.

You are also prepared for the next version of the internet.

The version in which, by the end of 2026, you will be paying these AI agents to do the job for you, and they will also be able to make you a fortune if used properly.

The post Not a Lucid Web3 Dream Anymore: x402, ERC-8004, A2A, and The Next Wave of AI Commerce appeared first on foojay.

JC-AI Newsletter: Easy Access to Expanding Challenges

Miro Wengner — Tue, 04 Nov 2025 18:13:33 +0000

A few months ago, I launched the AI Newsletter to provide a minimally biased perspective on the growing challenges surrounding artificial intelligence.

My primary motivation was and remains to be serving the community not only by showing how to use and access specific services for utilizing Large Language Models, but also by support a deeper understanding of the broader artificial intelligence landscape.

Many of the core challenges that have emerged around LLMs have not been and still not properly addressed, often omitted due to their uncomfortable implications.

Nevertheless, the accumulation of technical debt and other challenges, like security, filters stability, benchamarking, scaling etc. is likely growing exponentially.

Many of these challenges, in my humble opinion, had already been addressed even before Transformers and ChatGPT became publicly available in 2022.

Each published newsletter focuses on one core topic, dedicated to addressing a specific hot challenge in AI.

Img.1.: JC-AI Newsletter achive easy access from FooJay top menu News -> JC-AI Newsletter

The initial editions of the JC-AI Newsletter lacked easy access to previous issues. As the number of unique readers continues to rise with each new edition, and as I frequently reference past newsletters myself, I decided to address this "nice to have" feature (Img.1.). I'm pleased to announce that we have created a dedicated section for JC-AI Newsletter archives (Img.2.).

Img.2.: JC-AI Newsletter archive

I'm already excited about publishing the next editions of the AI Newsletter, as multi-agentic AI swarm are gaining traction and bringing multiple additional challenges with them.

Stay tuned, enjoy quick access to the archive, and happy reading!

The post JC-AI Newsletter: Easy Access to Expanding Challenges appeared first on foojay.

JC-AI Newsletter #7

Miro Wengner — Tue, 14 Oct 2025 05:35:01 +0000

Fourteen days have passed, and it is time to present a fresh collection of readings that could influence developments in the field of artificial intelligence.

Beyond focused tutorials that can enhance your understanding of AI applications, this newsletter concentrates on Hallucination, Java Code Generation, Testing, Agentic System Architecture and LLM benchmarking methodologies designed to ensure models accuracy and competency in handling complex contextual information.

The world influenced by LLM is changing very quickly, let's start...

article: The Missing Layer in AI Infrastructure: Aggregating Agentic Traffic
authors: Eyal Solomon
date: 2025-08-22
desc.: Agentic AI systems introduce new challenges across the entire system architecture. Beyond what research articles address, several critical issues remain unresolved and may pose serious risks, particularly in system architecture where the adoption of LLMs has triggered a major paradigm shift in system design. A key concern involves outbound API calls made by autonomously acting AI agents (e.g., chaining tools, calling external services). Current infrastructure, including API gateways and service meshes, is primarily designed around inbound traffic or service-to-service communication, rather than managing agent-initiated outbound calls. This creates a significant blind spot in our architectural oversight.
category: architecture

article: Learning to Reason for Hallucination Span Detection
authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh and others
date: 2025-10-02
desc.: This paper addresses the challenges of advancing from simple binary classification (Chain-of-Thought, COT)) to fine-grained span-level hallucination detection. In-domain reasoning is essential for robust hallucination detection. The normalization step in Group Relative Policy Optimization proves crucial, as simple reward rescaling policies cannot effectively mitigate reward hacking in the dataset employed. The paper proposes a reinforcement learning framework with span-level rewards to align large language model (LLM) reasoning with hallucination detection tasks on the RAGTruth benchmark. Paper research has been done during an internship at Apple.
category: research

article: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
authors: Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang and others
date: 2025-10-02
desc.: End-to-end speech recognition and fluent answering without noticeable pauses present significant challenges for utilizing LLMs in dialogue-based agentic systems. These systems are prone to hallucination effects caused by various factors. While improving input/output accuracy through Retrieval Augmented Generation (RAG) approaches can mitigate hallucination effects and significantly increase accuracy, this comes with various penalties such as increased resource requirements and latencies. The paper proposes a Model-Triggered Stream RAG approach as an alternative to fixed-interval RAG streaming or without RAG. Although the paper does not provide a complete solution to these challenges, it proposes a benchmarking strategy for future research and highlights key achievements. This research was conducted in cooperation with Meta.
category: research

article: Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment
authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang
date: 2025-10-03
desc.: While prompting strategies show effectiveness in certain tasks, they lack robustness across different benchmarks and model architectures, performing better on larger LLMs and simpler reasoning problems. This paper proposes a Self-Anchor mechanism for structured reasoning with automatic anchoring. Self-Anchor delivers consistent improvements across tasks, model sizes, and architectures, demonstrating strong robustness and effectiveness. The approach leverages inherent structure in reasoning chains to improve attention alignment and enhance reasoning capabilities. However, Self-Anchor primarily addresses attention misalignment without fully resolving deeper issues related to logical validity, semantic understanding, or computational precision.
category: research

article: Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
authors: José Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng and others
date: 2025-10-03
desc.: Agentic Automated Program Repair (APR) is increasingly addressing complex, repository-level bugs in industry settings. However, agent-generated patches still require human review before deployment to ensure they properly resolve the underlying issues. This paper introduces two complementary LLM-based policies for patch assessment. The paper addresses and discusses limitations in automated patch procedures, human supervision requirements, and company-specific bug-fixing approaches. This paper results from cooperation between Google and Meta.
category: research

article: Cache-to-Cache: Direct Semantic Communication Between Large Language Models
authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
date: 2025-10-03
desc.: Rather than relying on text-to-text communication between LLM-based systems, which incurs latency penalties, this paper proposes the Cache-To-Cache paradigm for direct inter-system communication. Experimental results demonstrate improved efficiency and performance without requiring additional cache capacity.
category: research

article: FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
authors: Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring, Silvio Soares Ribeiro Junior
date: 2025-10-06
desc.: This paper introduces the FreshBrew approach for Java code migration tasks utilizing agentic LLM systems. The migration experiments from JDK8 to JDK17 and JDK21 demonstrate the limitations of current LLM implementations, even when integrated with modern deterministic migration tools such as OpenRewrite. Although the overall migration success rate was approximately 50%, the paper provides a comprehensive discussion of the associated limitations and challenges. The article has been done in cooperation with Max-Planck Institute, Google and Saleforce companies.
category: research

article: Investigating The Smells of LLM Generated Code
authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley
date: 2025-10-03
desc.: The paper proposes a scenario-based method for evaluating the quality of LLM-generated code, as such models are increasingly utilized for program code generation. The study experiments with Java programs using Gemini Pro, ChatGPT, Codex, and Falcon LLMs to obtain results. The paper highlights that for moderately advanced topics, particularly those involving object-oriented programming concepts, the generated code quality is noticeably poorer to human-written code.
category: research

article: Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?
authors: Lucas Roberts, Denisa Roberts
date: 2025-09-30
desc.: This paper examines the comparative abilities of Large Language Models (LLMs) and human annotators in identifying and annotating specific elements within source code. The study investigates several widely-used programming languages, including C, Java, JavaScript, Go, and Python. The experimental results reveal various limitations and challenges associated with automated code annotation, while proposing possibilities for future research and emphasizing the critical role of human expertise in the annotation process.
category:

article: AI Coding Tools Blog Post - Model Context Protocol Mastery - Claude, Cursor
authors: Mani Sarkar
date: 2025-10-07
desc.: The post describes and provides guidance on configuring MCP for AI-assisted development using Claude and Cursor. Please be aware of the 'No Warranty' statement and use this as an example only.
category: tutorial

article: DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern
authors: Lekang Yang, Yuetong Liu, Yitong Zhang, Jia Li
date: 2025-09-29
desc.: Writing high-quality unit tests is often a time-consuming effort that requires extensive knowledge of the business domain. This paper proposes DiffTester, an acceleration framework designed to overcome the limitations imposed by single-token generation constraints. The DiffTester framework identifies common patterns through syntax tree analysis. Experimental results demonstrate that the DiffTester framework can used to generate a larger number of tokens, thereby achieving better accuracy.
category: research

article: Deloitte caught out using AI in $440,000 report | 7.30
authors: ABC News In-depth
date: 2025-10-09
desc.: Hallucination remains a significant challenge in current large language models (LLMs). These inaccuracies can cause damage at various levels and require careful eye to identify.
category: youtube

article: SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents
authors: Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen and others
date: 2025-10-13
desc.: In the age of AI, deception through manipulation or hallucination-based choices poses a serious threat to human or system reliability. This paper proposes SysBench, a benchmark for evaluating the susceptibility of computer-user agents (CUAs) and humans to dark patterns that may mislead both users and agents into making harmful decisions. The paper demonstrates that neither humans nor agentic AI systems based on large language models (LLMs) exhibit adequate resistance against dark patterns.
category: research

Previous:
Newsletter vol.1
Newsletter vol.2
Newsletter vol.3
Newsletter vol.4
Newsletter vol.5
Newsletter vol.6

The post JC-AI Newsletter #7 appeared first on foojay.

JC-AI Newsletter #5

Miro Wengner — Thu, 18 Sep 2025 07:19:33 +0000

Table of Contents

Fourteen days have passed, and it is time to present a fresh collection of readings that could influence developments in the field of artificial intelligence.

Fourteen days have passed, and it is time to present a fresh collection of readings that could influence developments in the field of artificial intelligence.

Beyond opinion pieces and Java focused tutorials that can enhance your understanding of AI applications, this newsletter concentrates on LLM benchmarking methodologies designed to ensure models accuracy and competency in handling complex contextual information.

The world influenced by LLM is changing very quickly, let's start...

article : Generating and editing images with Nano Banana in Java
authors: Guillaume Laforge
date: 2025-09-09
desc.: Learn how to create and edit images using Google’s latest “Nano Banana” model (also known as Gemini 2.5 Flash Image) using the Java language.
category: tutorial

article: Generating videos in Java with Veo 3
authors: Guillaume Laforge
date: 2025-09-10
desc.: The Veo 3 allows users to create 8 second videos, either from a prompt, or from an image that serves as a starting point. And of course, it’s possible to use that model from Java.
category: tutorial

article: Stochastic AI Agility: Breaking Cycles of Debt
authors: Miro Wengner
date: 2025-09-10
desc.: This article attempts to tackle challenges related to the used project management methodologies (agile, scrum, kanban, waterfall etc.).
category: opinion

article: Conversation: LLMs and Building Abstractions
authors: Unmesh Joshi, Martin Fowler
date: 2025-08-26
desc.: This article discusses the importance of creating a good project vocabulary as a means of identifying fitting abstractions. This approach leverages LLMs as valuable brainstorming instruments to identify overlooked details, rather than depending on LLM-generated implementation code and assuming its accuracy.
category: opinion

article: Some thoughts on LLMs and Software Development
authors: Martin Fowler
date: 2025-08-28
desc.: This article examines the role of LLMs in development processes, highlighting both their potential contributions and the importance of maintaining realistic expectations about their capabilities.
category: opinion

article: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
authors: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping
date: 2025-09-11
desc.: While considerable attention has been devoted to LLM planning capabilities, execution remains an understudied challenge, despite its importance as LLMs are increasingly deployed for extended reasoning and agentic tasks. While one might attribute failures in extended tasks to the accumulation of minor errors, this research provides an in-depth examination of the phenomenon.
category: research

article: CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large
authors: Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu
date: 2025-09-11
desc.: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. This paper introduces a Curiosity-Driven Exploration method that incorporates a novel curiosity model to enhance reinforcement learning in LLMs. This paper tackles challenges in rewarding approaches used for curiosity and critical actors while discussing their stability during the training sequence.
category: research

article: LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
authors: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen and others
date: 2025-09-11
desc.: The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. This paper proposes LoCoBench, a comprehensive benchmark designed to evaluate long-context LLMs in complex software development scenarios.
category: research

article: Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
authors: Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei
date: 2025-09-11
desc.: Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents capable of automating end-to-end machine learning (ML) workflows, including data analysis, training, and competition solving. However, existing benchmarks remain limited in many ways. This paper presents TAM Bench, a structured, diverse, and realistic benchmark for evaluating LLM-based agents. TAM Bench proposes the utilization of LLMs, browsers, and the Model Context Protocol (MCP) to create fully automated constructs. The paper addresses both the achievements and difficulties associated with such an approach.
category: research

article: PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs
authors: Homaira Huda Shomee, Suman Kalyan Maity, Sourav Medya
date: 2025-07-30
desc.: The paper presents PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. PATENTWRITER utilizes standard natural language processing benchmarks (BLEU, ROUGE, BERTScore). The experiments highlight LLM capabilities, often surpassing domain-specific baselines. The article raises ethical considerations while highlighting these achievements.
category: research

article: BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning
authors: Sahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo, Aidan Gilson and others
date: 2025-07-21
desc.: The paper introduces the BELO benchmark, which employs keyword matching, a fine-tuned PubMedBERT model, and expert review for evaluation. The study assessed six large language models using multiple text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore) alongside human evaluation to determine accuracy. The results revealed suboptimal performance and highlighted the need for improvements in clinical reasoning capabilities..
category: research

article: You Can Build Better AI Agents in Java Than Python
authors: Rod Johnson
date: 2025-08-18
desc.: This tutorial presents a novel Embadel agent framework. The implementation example of an agentic book writer demonstrates multiple advantages of the JVM and Java language over Python alternatives, including type safety and streamlined application development.
category: tutorial

article: Research: Measuring Energy Consumption in Programming Languages for AI Applications
authors: Miro Wengner
date: 2025-09-15
desc.: This article presents the research paper 'Measuring Energy Consumption in Programming Languages for AI Applications,' which analyzes energy consumption across programming languages used for agentic AI system interactions and computationally intensive applications. The study evaluates Java platform performance and energy efficiency in these contexts, providing development recommendations and hardware selection guidance supported by statistical analysis.
category: research

Previous:
Newsletter vol. 1
Newsletter vol. 2
Newsletter vol. 3
Newsletter vol. 4

The post JC-AI Newsletter #5 appeared first on foojay.

Stochastic AI Agility: Breaking Cycles of Debt

Miro Wengner — Wed, 10 Sep 2025 11:27:16 +0000

The launch of ChatGPT in November 2022 has significantly influenced and potentially transformed industry standards across multiple sectors. While my primary focus remains on the information technology sector, observations indicate that its impact extends across all industries and affects the daily lives of consumers and professionals alike.

This article examines the observed changes in project management practices. These observations are govern by Role 17: Stochastic AI Agility. Over past two decades, the industry has actively pursued the implementation of agile methodologies to enable iterative product delivery.

Figure 1. : Agile process schema connected to the AI-LLMs (blue lines and circles) and human knowledge based on experiences (blue bricks)

The fundamental principle involves decomposing all processes into smaller, manageable units that teams strive to execute with optimal efficiency. While this approach is particularly prevalent in information technology, its application extend beyond this sector. Throughout agile implementation, teams develop the capability to provide accurate delivery estimates--a notable challenging endeavor due to numerous contributing variables. These variables typically generate both visible and invisible technical debts that teams must address, potentially leading to diverse operational scenarios.

Forbes Technology Council has identified 16 obstacles[1] to a successful software project that affect virtually every chosen development scenario:

Poor collaboration between the product and Engineering teams
Not managing data integrity
Not aligning early on the ‘must-haves’
Overlooking nonfunctional requirements
An Unintuitive UI
Unexpected Complexities
Missed Deadlines
Understanding the time needed
Scope creep
An undefined project scope
Unclear or undefined client expectations
Overlooking speed, security or the UX
Security as an afterthought
Hyper-focused planning and design
Undisciplined backlog grooming
Unclear or incomplete product requirements document

The articles suggest to add one additional point which may be called

Stochastic AI Agility

The designation “Stochastic AI Agility” indicates that the AI-LLM implementation consistently contributes toward project objectives, though the precise impact remains inherently unpredictable. The normal distribution curve may represent the degree of AI-LLM contribution across the most established phases --specifically, the core functional areas -- of product development.

Figure 2.: Agile Kandban schema with each stages connected to the AI-LLMs (blue lines and circles) and human knowledge based on experiences (blue bricks)

These scenarios are subsequently evaluated by management and technical staff with the objective of enhancing future iteration processes. The goal is to implement improved methodologies and alternative approaches in subsequent development cycles.

Figure 3.: Agile Scrum framework schema with each stages connected to the AI-LLMs (blue lines and circles) and human knowledge based on experiences (blue bricks)

This represents a sound strategic approach for future implementation. With the advancement and proliferation of Large Language Models (LLMs) solutions, virtually every stage of development or organizational evaluation may exhibit inherent bias, even when such bias is not immediately apparent.

Figure 4.: Waterfall schema with each stages connected to the AI-LLMs (blue lines and circles) and human knowledge based on experiences (blue bricks)

In my assessment, regardless of whether organizations currently employ agile methodologies (Figures 1-3) or traditional approaches (Figure 4), AI-LLM technology influences every phase of project development and methodological implementation. I recommend comprehensive research into agile frameworks, as the integration of LLM technologies, AI-assisted coding, or LLM-informed managerial decision-making may produce suboptimal outcomes with far-reaching consequences.

Point 17, Stochastic AI agility, represents a fundamentally stochastic process that may exhibit cyclical patterns driven by algorithmic bias in recommendations. This characteristic diminishes success probability in direct correlation with the degree of LLM dependency. LLM systems inherently lack consistency and operate non-deterministically.

Each inconsistency introduces technical debt of varying magnitudes. While this may not be immediately apparent, every project or objective constitutes a deterministic process with defined variance parameters, where incremental components are systematically constructed to achieve the desired outcome. Success depends on the level of granular complexity teams can effectively manage to develop these fundamental building blocks (Figure 6).

Figure 6.: Complex challenge and technical debt can be decomposed into manageable incremental units to facilitate goal achievement.

PS: If you are experiencing challenges or considering revision to current methodologies, I welcome the opportunity for collaborative discussion.

References:
[1] 16 Obstacles To A Successful Software Project (And How To Avoid Them)
[2] Stochastic AI Agility: addressing cycle of debts

The post Stochastic AI Agility: Breaking Cycles of Debt appeared first on foojay.

Spec-Driven Development with AI: A New Approach and a Journey into the Past

Simon Martinelli — Mon, 08 Sep 2025 07:35:46 +0000

Table of Contents

The Problem with Code-Centric DevelopmentRequirements as the Single Source of Truth

The Complete Workflow
Everything is Code, Everything is Versioned
AI as the Consistency Engine

The Structure: Independent EpicsA Real Example: System Use Case SpecificationThe ResultsWhy This Works for Business ApplicationsGetting StartedThe Future of Business Application Development

The software development world is buzzing about AI-assisted coding. Tools like Claude Code, Windsurf, and JetBrains Junie promise to make us more productive. But most approaches focus on generating code faster – they’re still code-centric.

What if we took a different approach? What if we made requirements the source of truth and let AI handle everything downstream?

After years of building business applications, I began developing a methodology that combines ideas from the Rational Unified Process (RUP) with modern AI tooling: The AI Unified Process. The results are remarkable: better business alignment, maintainable code, and complete traceability from business needs to implementation.

The Problem with Code-Centric Development

Traditional development follows this pattern:

Write requirements documents
Code the application
Add tests (maybe)
Update documentation (rarely)

The problem? Code becomes the source of truth. Requirements documents get outdated. When bugs appear or features need changes, we dig through the code to understand what the system was supposed to do. It gets even worse if we need to modernize the application a few years later.

AI coding tools make this worse by generating code faster. We’re accelerating toward the same maintenance problems we’ve always had.

Requirements as the Single Source of Truth

My approach flips this around. Requirements stay at the center, and everything else gets generated from them:

The Complete Workflow

Business Requirements Catalog (manual, with business stakeholders)
Business Use Case Diagrams (AI-generated, then reviewed with business stakeholders)
Entity Models (AI-derived from requirements catalog, reviewed by the business)
System Use Case Diagrams (AI-generated and revised by the developer and/or business)
System Use Case Specifications (AI-generated, detailed markdown, revised by the developer and/or business)
Application Code (AI-generated, reviewed by developer)

The key: Every step gets reviewed and revised by the business team. They validate not just the business artifacts, but also the entity models and system use cases. This catches domain modeling errors before they become expensive code problems.

Everything is Code, Everything is Versioned

All specifications are written in Markdown and stored in Git alongside the application code. Use case diagrams are generated as PlantUML source code. This gives us:

Visual diffs: See exactly what changed in diagrams
Complete audit trails: Track every change from requirements to code
Version control: Branch specifications alongside code
Collaborative editing: Business stakeholders can request specific changes

AI as the Consistency Engine

When requirements change, AI tools like Claude Code update all downstream artifacts automatically:

Regenerate the affected use case diagrams
Update entity models
Modify system specifications
Generate new application code
Maintain traceability links

No manual synchronization. No outdated documentation. No guessing what the system should do.

The Structure: Independent Epics

Business applications are complex, but that doesn’t mean they have to be complicated. I organize everything into independent epics:

Event Management: UC-EVENT-001, UC-EVENT-002, etc.
User Management: UC-USER-001, UC-USER-002, etc.
Organization Management: UC-ORG-001, UC-ORG-002, etc.

No cross-epic dependencies. Each epic is a bounded context that can be developed, tested, and deployed independently. This simplifies both the AI generation process and the overall system architecture.

A Real Example: System Use Case Specification

Here’s how a system use case looks in practice:

# Use Case: Create Events
**Use Case ID:** UC-EVENT-001  
**Primary Actor:** Manager  
**Goal:** Allow managers to create new events for their organization

## Main Success Scenario
1. Manager navigates to "Events" section
2. Manager clicks "Create New Event" button
3. System displays event creation form
...

## Business Rules
### BR-EVENT-001: Event Date Validation
- End date must be equal to or after start date
- Events cannot be created with start dates in the past

## Technical Notes
### Validation Rules
- Title: Required, 3-200 characters
- Start Date: Required, must be future date
- End Date: Required, must be >= start date

This level of detail gives AI tools everything they need to generate correct, complete implementations. Business stakeholders can understand the main flow, while developers get precise technical requirements.

The Results

This approach has transformed how I build business applications:

Better Business Alignment: Business stakeholders review every artifact, ensuring the system matches their actual needs.

Maintainable Code: When requirements change, everything downstream updates consistently. No drift between documentation and implementation.

Faster Development: AI handles the tedious work of generating diagrams, specifications, and code. Business stakeholders concentrate on providing the requirements and reviewing the specification. Developers focus on business logic and architecture.

Complete Traceability: From a business requirement to a line of code, every connection is maintained and visible.

Quality Assurance: Generated code includes comprehensive tests based on the use case specifications.

Why This Works for Business Applications

Business applications have predictable technical patterns but more or less complex domain logic. Frameworks like Vaadin, Spring Boot, and jOOQ provide stable foundations (see the Simon Martinelli Stack). The real complexity lies in understanding and modeling the business domain correctly.

This methodology plays to both strengths:

Human expertise handles business domain complexity
AI handles consistent technical implementation

Getting Started

If you want to try this approach:

Start with requirements: Don’t jump to code. Invest time in understanding and documenting business needs.
Make everything code: Use markdown for specifications, PlantUML for diagrams. Version everything in Git.
Review with business: Get stakeholders to validate not just business artifacts, but also entity models and system use cases.
Use AI as your consistency engine: Let tools like Claude Code handle generation and updates.
Keep epics independent: Avoid cross-dependencies to simplify both development and AI generation.

The Future of Business Application Development

We’re at an inflection point. AI can generate high-quality code, but only if we give it high-quality specifications. The organizations that invest in better requirements processes will build better software faster.

This isn’t about replacing developers with AI. It’s about using AI to eliminate the tedious, error-prone work so we can focus on what matters: understanding business needs and designing systems that serve them well.

The code is the easy part. Getting the requirements right is where the real value lies. Read more: https://aiup.dev

This article was first published on https://martinelli.ch/spec-driven-development-with-ai-a-new-approach-and-a-journey-into-the-past/

The post Spec-Driven Development with AI: A New Approach and a Journey into the Past appeared first on foojay.