Daily D4 Digest — 2026-04-19
TL;DR
- Simon Willison documents a masterclass in agentic prompt patterns: clone a reference repo to
/tmp, point at existing logic, and have the agent self-verify with browser automation — all in a 3-line prompt- Claude Opus 4.7’s system prompt changes reveal Anthropic’s shift toward “act first, ask later” agent behavior, a
tool_searchmeta-tool, and new product surfaces (Chrome agent, Excel agent, PowerPoint agent)- AWS DevOps Agent hits GA — automated incident investigation across AWS environments signals the mainstreaming of D1-style agentic ops tooling
- Google’s Aletheia solves 6/10 novel research-level math problems fully autonomously, pushing the boundary on what “agent without human in the loop” means for formal verification
Call to Action
- Adopt the “clone to
/tmpfor reference” pattern in your coding agent workflows — it’s the cheapest way to communicate complex domain context without writing documentation. See Willison’s breakdown- Audit your agent system prompts for the “act vs. clarify” trade-off — Anthropic’s new
<acting_vs_clarifying>section is a template for how to instruct agents to bias toward action. Review the full diff- Evaluate AWS DevOps Agent for your incident response pipeline if you’re on AWS — GA means it’s ready for production assessment. Details here
D1 — Agentic Engineering
Reference-repo-as-context: a reusable agentic prompt pattern. Simon Willison’s new guide chapter dissects a deceptively simple 3-line prompt that got Claude Code to add a new content type to a tool. The key patterns: (1) clone a reference codebase to /tmp so the agent can read related domain logic without polluting the working tree, (2) point at existing behavior (“similar to how the Atom everything feed works”) instead of specifying logic, and (3) give the agent a self-verification loop (python -m http.server + browser automation via uvx rodney). The agent independently derived a beatTypeDisplay mapping from the Django ORM definition it found in the reference repo. This is a concrete, repeatable pattern for any team using coding agents: treat existing codebases as executable specifications that the agent can read.
Claude Opus 4.7 system prompt signals Anthropic’s “agent-first” posture. Willison’s detailed diff analysis surfaces several changes with direct implications for D1 practitioners. The new <acting_vs_clarifying> section instructs Claude to “make a reasonable attempt now, not to be interviewed first” — a fundamental behavioral shift that favors agentic action over clarification loops. The addition of tool_search as a meta-tool means Claude will proactively discover available capabilities before telling a user “I can’t do that.” New product surfaces — Claude in Chrome (browsing agent), Claude in Excel, and Claude in PowerPoint, all orchestratable via Claude Cowork — expand the agentic surface area significantly. Also noteworthy: anti-verbosity instructions (“keeps responses focused and concise”) and removal of behavioral patches from 4.6 suggest the base model has internalized constraints that previously required prompt-level enforcement. (Cross-cuts D2, D3)
AWS DevOps Agent reaches general availability. AWS announced GA of its generative AI-powered assistant for incident investigation, deployment analysis, and operational task automation. This is notable not for novelty but for maturity: a hyperscaler shipping an agentic ops tool as a first-class production service. For teams building D1 practices, this sets a baseline expectation — automated incident triage is now a commodity capability on AWS, and custom solutions need to clear a higher bar. (Cross-cuts D4)
System prompts as git timelines — infrastructure for prompt archaeology. Willison also published a research tool that converts Anthropic’s published system prompt history into a git repo with fake commit dates, enabling git diff, git log, and git blame on prompt evolution. This is a small but clever piece of D1 infrastructure — treating system prompts as versioned artifacts with proper change tracking, exactly as you’d treat infrastructure-as-code.
D2 — AI in the Product
Claude’s expanding tool surface redefines “chat product.” The Opus 4.7 system prompt now references 23 distinct tools including bash_tool, web_search, image_search, fetch_sports_data, weather_fetch, places_search, recipe_display_v0, and the new tool_search meta-tool. Claude.ai is no longer a chatbot — it’s an agent runtime with a conversational UI. The search_mcp_registry tool is particularly interesting for D3 implications: Claude can discover and connect to MCP servers dynamically. Product teams building conversational interfaces should study this tool taxonomy as a reference architecture for what a mature agent-as-product looks like.
Google Aletheia: autonomous agents producing novel research output. Aletheia, using Gemini 3 Deep Think, solved 6/10 novel math problems in the FirstProof challenge and scored ~91.9% on IMO-ProofBench — all without human intervention. While this is a research benchmark, the implication for D2 is directional: AI products that produce genuinely novel, verifiable intellectual output (not just summaries or transformations) are becoming feasible. Formal proof discovery is an extreme case, but the pattern — autonomous agent + formal verification of output — generalizes.
D3 — Build for Agents
search_mcp_registry and tool_search in Claude’s production system. The Opus 4.7 tool list includes both search_mcp_registry (discover MCP servers) and tool_search (discover deferred tools at runtime). This is the clearest signal yet that Anthropic is building Claude as an MCP-native agent consumer. If you’re exposing capabilities via MCP, Claude’s 200M+ users can now discover and connect to your server dynamically. The system prompt explicitly instructs Claude to call tool_search before ever claiming it lacks a capability — meaning third-party tools get first-class discoverability.
D4 — Performance & Cost at Scale
No significant standalone D4 developments today. The AWS DevOps Agent GA (covered in D1) has cost implications — it potentially reduces the need for custom incident investigation tooling — but no specific performance or cost benchmarks were published.
Software Civil Engineering Lens
Today’s items collectively advance the SCE thesis on multiple fronts:
System prompts as codes/norms. The Opus 4.7 system prompt is increasingly reading like a building code for AI behavior. The <acting_vs_clarifying> section, the <evenhandedness> section, the <critical_child_safety_instructions> — these are normative constraints that define the envelope of acceptable agent behavior. Anthropic versioning and publishing these prompts (and Willison turning them into git-diffable artifacts) is exactly the kind of transparency that professionalized engineering disciplines require. We’re watching “codes and norms” for agent behavior emerge in real time.
Reference codebases as specifications. Willison’s pattern of cloning a reference repo and telling the agent “do it like this existing code does” is an informal but effective form of specification-by-example. The agent reads the Django ORM model, infers the business logic, and implements it correctly in a different context. This is a lightweight version of the SCE vision: the existing codebase serves as a “blueprint” that constrains the agent’s implementation. The gap from here to formal Event Modeling specs is significant, but the behavioral pattern — give the agent a specification to constrain its work — is identical.
Aletheia and formal verification. Google’s Aletheia is the most SCE-relevant development today, even though it’s in pure mathematics. An autonomous agent producing novel proofs that are formally verifiable is the platonic ideal of the Specify → Plan → Verify → Apply → Observe lifecycle. The agent operates with bounded autonomy (within the rules of formal mathematics), produces output, and that output is mechanically verifiable. The question for software engineering: when do we get Aletheia-like agents that produce formally verified software instead of formally verified proofs? The gap is narrowing.
Sources
- Adding a new content type to my blog-to-newsletter tool — Willison’s deep dive on a 3-line agentic prompt pattern using reference repos, existing behavior as spec, and browser-based self-verification
- Changes in the system prompt between Claude Opus 4.6 and 4.7 — Detailed diff analysis of Anthropic’s system prompt evolution, including new agent-first behavioral instructions and tool taxonomy
- AWS DevOps Agent GA — AWS ships production-ready agentic incident investigation across AWS environments
- Google’s Aletheia — Gemini 3 Deep Think solves 6/10 novel math problems autonomously, ~91.9% on IMO-ProofBench
- Claude system prompts as a git timeline — Research tool that converts Anthropic’s published prompt history into git-diffable versioned artifacts
