Daily D4 Digest — 2026-04-18

TL;DR

  • Simon Willison documents a masterclass in agentic prompt engineering: cloning a reference codebase to /tmp, pointing Claude Code at it, and having it self-verify with browser automation — a pattern worth institutionalizing
  • Meta reports 4x higher bug detection with Just-in-Time testing that generates tests dynamically during code review, a major validation of the Verify step in spec-driven workflows
  • Scepsy achieves 2.4x throughput / 27x lower latency for multi-LLM agentic workflows by profiling aggregate GPU shares — the first serious serving system optimized specifically for agentic workloads
  • AWS launches Agent Registry (preview) with native MCP + A2A support, signaling that agent governance infrastructure is becoming a hyperscaler battleground
  • The Autogenesis Protocol proposes explicit lifecycle, versioning, and rollback semantics for agent resources — filling gaps that MCP and A2A intentionally leave open

Call to Action

  • Evaluate Scepsy’s Aggregate LLM Pipeline approach for your multi-model serving stack — the insight that per-LLM execution shares are stable even when end-to-end latency is not could dramatically simplify capacity planning. Paper
  • Adopt the “clone reference codebase to /tmp” pattern as a standard prompting technique for your coding agents — it’s the agentic equivalent of giving a contractor the architectural drawings. Willison’s guide
  • Track AWS Agent Registry alongside Microsoft and Google alternatives — if you’re deploying agents across teams, a centralized registry will soon be table stakes for governance. InfoQ

D1 — Agentic Engineering

Willison’s “Reference Codebase” Pattern for Agentic Prompts. Simon Willison published a detailed breakdown of a deceptively short prompt that accomplished a non-trivial feature addition in a single Claude Code session. The key pattern: instruct the agent to clone simonw/simonwillisonblog from github to /tmp for reference, then tell it to imitate existing logic (the Atom feed filtering) rather than specifying the logic from scratch. The agent derived a correct Django ORM → SQL mapping by reading the model definition from the reference repo and applied it to a completely different HTML/JS tool. Combined with self-verification via uvx rodney for browser automation, this is a reusable template for complex cross-repo agentic tasks. The resulting PR required zero human correction.

Claude Code Gets Agent-Based Code Review. Anthropic shipped a new Code Review feature for Claude Code that uses multiple AI reviewers to analyze pull requests. This moves code review from a single-pass LLM check to an agentic multi-reviewer architecture. The practical implication: teams using Claude Code now get review built into the same agent loop that generated the code, closing the write→review→revise cycle without context switches. (Also D4-adjacent: multi-reviewer architectures multiply inference costs per PR.)

Self-Evolving EDA Tools via Multi-Agent Loops. A paper from Yu and Ren describes the first self-evolving logic synthesis framework where LLM agents autonomously rewrite and improve the source code of ABC, a widely-used million-line-scale EDA tool. The agents follow a closed-loop cycle — propose modifications, compile, validate correctness, evaluate quality-of-results on multi-suite benchmarks — discovering optimizations beyond human-designed heuristics. This is the agentic engineering pattern at its most ambitious: agents not just writing code, but iteratively improving complex existing codebases through a correctness-gated evolution loop. The “programming guidance” prompts that constrain agent modifications are effectively a spec boundary.

Datasette 1.0a28: Claude Code + Opus 4.7 in Production. In a practical data point, Willison notes that most changes in Datasette 1.0a28 were implemented using Claude Code and the newly released Claude Opus 4.7. The release fixes real breakages from the prior alpha — compatibility bugs, resource cleanup issues — suggesting that agentic coding is now handling non-trivial bug-fix work on production-critical open source infrastructure.

D2 — AI in the Product

MM-WebAgent: Hierarchical Agents for Coherent Webpage Generation. Microsoft Research introduces MM-WebAgent, a framework that coordinates AIGC tools (image generators, video creators, visualization engines) through hierarchical planning and iterative self-reflection to produce visually consistent webpages. The key insight is that generating multimodal elements in isolation produces incoherent results; MM-WebAgent jointly optimizes global layout, local content, and integration. This is a direct D2 play — embedding agentic orchestration into a generative design product. The accompanying benchmark and multi-level evaluation protocol make this more actionable than typical research.

Blue’s Data Intelligence Layer: LLMs, Web, and Users as First-Class Data Sources. IBM’s Blue DIL goes beyond NL2SQL by treating LLMs, the web, and interactive user sessions as queryable “databases” alongside structured enterprise data. Data planners decompose complex requests into executable query plans that span relational operators, retrieval from heterogeneous sources, and cross-modal reasoning. This is a compound AI system architecture that’s directly relevant if you’re building data-centric products where agents need to reason across structured and unstructured sources. (Also D1 + D3: the data registry and declarative query plans are patterns that generalize.)

D3 — Build for Agents

AWS Agent Registry: Governing Agent Sprawl. AWS launched Agent Registry in preview as part of Amazon Bedrock AgentCore. It provides a centralized catalog for discovering, governing, and reusing AI agents, tools, and MCP servers across an organization, with native support for both MCP and A2A protocols. The registry indexes agents regardless of where they run. Microsoft, Google Cloud, and the ACP Registry are offering competing solutions. The strategic signal: the hyperscalers are treating agent registries the way they treated container registries five years ago — as essential governance infrastructure for the next platform shift.

Autogenesis Protocol: Versioned Lifecycle for Agent Resources. The Autogenesis Protocol (AGP) directly addresses gaps in A2A and MCP around lifecycle management, version tracking, and evolution-safe update interfaces. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol-registered resources with explicit state and versioned interfaces, while its Self-Evolution Protocol Layer (SEPL) provides a closed-loop operator interface with auditable lineage and rollback. This is ambitious — it’s essentially proposing version control semantics for the entire agent stack. The evaluation shows consistent improvements over baselines on long-horizon tasks, suggesting that explicit resource management matters for reliability.

Agentic Microphysics: A Safety Framework for Agent Populations. A methodological paper argues that safety analysis must shift from isolated models to interaction-level mechanisms between agents. They introduce “agentic microphysics” — studying how one agent’s output becomes another’s input under specific protocol conditions — and “generative safety” — growing phenomena from micro-level conditions to identify thresholds and design interventions. For teams building B2A or multi-agent systems, this provides a principled way to think about emergent failure modes that don’t appear in single-agent testing.

D4 — Performance & Cost at Scale

Scepsy: 2.4x Throughput, 27x Lower Latency for Agentic Serving. This is the standout D4 item. Scepsy is a new serving system specifically designed for multi-LLM agentic workflows that branch, fan-out, and recur unpredictably. The key insight: while end-to-end agentic workflow latency is unpredictable, the share of each LLM’s total execution time is surprisingly stable across runs. Scepsy profiles LLMs under different parallelism degrees, constructs an “Aggregate LLM Pipeline” as a lightweight latency/throughput predictor, then searches over fractional GPU shares, tensor parallelism degrees, and replica counts to find optimal allocations. A hierarchical heuristic places allocations onto the GPU cluster while minimizing fragmentation and respecting network topology. Results: up to 2.4x higher throughput and 27x lower latency versus systems that optimize LLMs independently. This is the first serious attempt at treating multi-model agentic workloads as a first-class scheduling problem.

Amazon Deploys Random Graph Datacenter Fabrics at 45% Lower Cost. While not AI-specific, Amazon’s production deployment of random-graph-based datacenter fabrics (RNG) is directly relevant to the infrastructure layer that agentic workloads run on. RNG fabrics match or exceed fat-tree performance across traffic patterns while being up to 45% cheaper, using a novel distributed routing protocol that exploits random graph properties for edge-disjoint path discovery. Amazon has made this the default fabric for most workloads. For anyone running inference at scale, this is the substrate getting cheaper underneath you.

Software Civil Engineering Lens

Today’s items provide unusually strong evidence across multiple SCE pillars:

Meta’s JiT Testing as the Verify step made real. Meta’s Just-in-Time testing achieving 4x higher bug detection is the most concrete validation we’ve seen of the Specify → Plan → Verify → Apply → Observe lifecycle in production. By generating tests dynamically during code review (not ahead of time from a static suite), Meta has effectively made verification change-aware — the tests are generated from the intent of the diff, not from a pre-existing spec. This is the “simulation” pillar of SCE manifesting as mutation testing + LLM-generated intent analysis. The “Dodgy Diff” workflow is particularly interesting: it’s an automated system for detecting when a change looks suspicious, which is a proto-form of the “codes and norms” pillar — automated enforcement of what constitutes safe practice.

Autogenesis Protocol as material datasheets for agents. AGP’s insistence on explicit state, lifecycle, and versioned interfaces for every agent resource is the closest thing we’ve seen to material datasheets for the agentic stack. When you can roll back an agent’s behavior to a known version with auditable lineage, you’re operating in the same regime as civil engineers who can trace a structural failure to a specific batch of steel. The gap between AGP’s academic proposal and production adoption is real, but the direction is right.

Self-evolving EDA tools as bounded autonomy in practice. The self-evolving ABC framework is a compelling example of bounded autonomy: agents are free to propose arbitrary code modifications, but every modification must pass through a correctness validation gate and QoR evaluation on standardized benchmarks before being committed. The “programming guidance” prompts act as the spec boundary. The agents discovered optimizations beyond human-designed heuristics — this is exactly the promise of the 10% → 10× transition, where humans set the constraints and agents find solutions the humans wouldn’t have found.

The pattern connecting Willison, Meta, and the EDA paper is the same: human judgment is relocating from “write the code” to “define the validation criteria.” Willison tells Claude to compare output against his homepage. Meta generates tests from diff intent. The EDA agents are constrained by benchmark suites. In all three cases, the spec is the control surface, and execution is delegated. This is the professionalization thesis in miniature.

Sources