Home > Topics > Software Development and Coding

πŸ€–πŸ§­ Agentic Software Engineering

πŸ€– AI Summary

  • πŸ“ High-Level Summary: Agentic Software Engineering (ASE) represents a fundamental paradigm shift in how software is conceived, developed, and maintained. It leverages AI agents capable of autonomous action, tool use, and iterative problem-solving to amplify human software engineering capabilities. ASE encompasses the study of patterns, tools, frameworks, and best practices for effectively collaborating with AI coding agents. The discipline is evolving rapidly, with new patterns, tools, and research emerging weekly. πŸ§ πŸ’»πŸš€

  • πŸ”‘ Key Concepts: The field is characterized by long-running agent sessions with tool execution capabilities, prompt engineering for agents, multi-agent orchestration, human-in-the-loop oversight, context engineering, and the development of agent-specific software engineering practices distinct from traditional human-centric approaches. The core shift is from β€œcode is expensive” to β€œcode is cheap now” - focusing human effort on architecture, quality, and integration rather than implementation details. βš‘πŸ“‰


🧠 Mental Models

🌟 The Spectrum of AI-Assisted Development

  • 🌱 AI Prototyping - Fast, exploratory, prompt-driven. Great for learning and proving concepts. CTCO
  • 🎯 Directed AI Assistance - Specify constraints, reference patterns, define success upfront. Tool is a lever; you’re in control. CTCO
  • πŸ‘₯ Agent Orchestration - Split work across multiple agents, run in parallel, integrate results. Like managing a small team. CTCO
  • πŸ—οΈ Agentic Engineering - Build persistent workflows with context, guardrails, and quality gates. Stay accountable for security and delivery. Simon Willison

πŸ’‘ Core Principles

  • 🏰 Code is Cheap Now - The fundamental mental shift: writing code has become nearly free. This changes everything about how we estimate, plan, and execute. Focus on architecture, quality, and integration - the decisions that require human judgment. Simon Willison

  • βš–οΈ SE for Humans vs SE for Agents - Agentic SE introduces a fundamental duality: traditional human-centric development and new agent-centric workflows. Each requires different tools, processes, and artifacts. arXiv

  • πŸ”„ From Vibe Coding to Agentic Engineering - β€œVibe coding” (using AI without attention to code, often by non-programmers) contrasts with β€œagentic engineering” (professional software engineers using AI to amplify their expertise). The latter involves rigorous methodology, quality gates, and accountability. Simon Willison

🎯 The Agentic Engineer Role

  • πŸ§‘β€πŸ’» Architect - High-level system design and decision-making

  • πŸ” Quality Controller - Code review, standards enforcement, catching what agents miss

  • πŸ“š Context Manager - Maintaining documentation and context that makes agents effective

  • 🎯 Strategist - Deciding what to build and how to approach it

  • πŸ”‘ Key Insight: You need to be able to do what the agent does, and recognize when it’s gone wrong. Spot subtle bugs, security gaps, and maintainability issues. CTCO


πŸ”§ How-To Guidance

πŸ§ͺ Test-Driven Development for Agents

  • πŸ“ Red/Green TDD - Write tests first, confirm they fail, then implement. Test-first development helps agents write more succinct, reliable code with minimal prompting. Simon Willison

  • πŸ”΄ Force Broken Tests - Require agents to write failing tests before implementation to ensure they understand requirements. Atomic Object

  • ▢️ Real Data Testing - Use production-like data to gain clarity on edge cases. Atomic Object

  • ⏸️ Human-in-the-Loop - Pause the AI and run tests yourself during development cycles. Atomic Object

πŸ“ Context Engineering

  • πŸ—‚οΈ CLAUDE.md Files - Project-specific instructions, conventions, and rules that persist across sessions. Anthropic

  • πŸ“‹ Instruction Directories - Maintain persistent context files that give agents project-specific rules, patterns, and conventions. The more context you give, the better the outputs. CTCO

  • πŸ“₯ Just-in-Time Loading - Load relevant context only when needed to avoid overwhelming the context window. Morph

  • 🚫 .claudeignore - Exclude irrelevant files from agent context to improve focus and reduce noise. Morph

  • πŸ”€ Subagent Isolation - Delegate subtasks to isolated subagents to prevent context pollution. Morph

🧩 Problem Decomposition

  • πŸ—οΈ Break Work Into Chunks - When orchestrating agents, break work into independent chunks. Each agent needs clear context about its piece and how it connects to others. CTCO

  • πŸ“¦ Batch Related Work - Group similar tasks together to maximize context efficiency. Aditya Bawankule

  • πŸ”„ Parallel Worktrees - Use Git worktrees for parallel work without conflicts. Aditya Bawankule

πŸ™‹ Human Oversight

  • πŸ‘οΈ Confirm Before Acting - Set explicit confirmation requirements for destructive or irreversible actions. Simon Willison

  • πŸ›‘οΈ Guardrails - Implement safety checks and boundaries for agent actions.

  • πŸ“Š Observability - Log, trace, and monitor agent sessions for debugging and quality control.


πŸ› οΈ Tools & Frameworks

πŸ€– Coding Agents

πŸ† ToolπŸ”‘ Key FeaturesπŸ“Š Best For
Claude CodeTerminal-based, long-running sessions, tool execution, subagents, agent teamsDeep refactoring, complex debugging
OpenAI CodexMulti-agent parallel execution, Codex Agent LoopLarge-scale automation, enterprise
GitHub CopilotIDE integration, chat, agent tasksInteractive development
CursorAI-native IDE, AI Tab, Chat, Ctrl+KIntegrated AI development
  • πŸ“ˆ Model Selection - OpenAI’s Codex series optimized for code execution. Anthropic’s Claude excels at reasoning. Google’s Gemini 3.1 offers strong performance at half Claude’s price. Simon Willison

  • πŸ“Š SWE-Bench Leaders - Claude Opus 4.6 (Thinking) leads with 79.20% on SWE-bench, followed by Gemini 3 Flash (76.20%) and GPT 5.2 (75.40%). Vals AI

πŸ—οΈ Agent Frameworks

  • πŸ”— LangChain - Python-based framework for building agent applications. langchain.com

  • 🀝 AutoGen (Microsoft) - Multi-agent conversation framework. microsoft.com/autogen

  • 🦞 OpenClaw - Open-source autonomous coding agent. GitHub

  • βš™οΈ CrewAI - Multi-agent orchestration for complex workflows. crewai.com

  • 🌐 Spring AI - Enterprise Spring-based agent patterns. Spring

πŸ”Œ Model Context Protocol (MCP)

  • πŸ”Œ What is MCP - Open standard by Anthropic that defines a unified way for AI agents to connect to external tools, data sources, and services. Like β€œUSB-C for AI integrations.” Anthropic

  • πŸ› οΈ MCP Servers - Pre-built integrations for databases, APIs, and tools. GitHub

  • πŸ“¦ MCP Registry - Growing ecosystem of MCP-compatible tools. modelcontextprotocol.io

πŸ’Ύ Local Models

  • πŸ–₯️ Ollama - CLI tool for running LLMs locally. ollama.com

  • πŸ“± LM Studio - Desktop app for local LLM experimentation. lmstudio.ai

  • 🏠 Jan - Privacy-focused local AI. jan.ai

  • πŸ”’ Benefits - Data privacy, no API costs, offline capability, total control. AI Lexicon


πŸ”’ Security & Safety

🚨 OWASP Top 10 for Agentic AI (2026)

  1. πŸ—οΈ Sensitive Data Disclosure - Agents may expose sensitive data through outputs or tool calls
  2. πŸ”„ Tool Poisoning - Compromised tools inject malicious behavior
  3. 🧠 Memory Pollution - Agent context manipulation through injected memories
  4. 🎭 Prompt Injection - External inputs override agent instructions
  5. πŸ”“ Unbounded Execution - Agents can execute unlimited actions without oversight
  6. πŸ“¦ Dependency Confusion - Agent dependencies can be hijacked
  7. 🦠 Multi-Agent Malware - Agents can spread malicious behavior
  8. πŸ’‰ Code Injection - Agent-generated code contains exploits
  9. πŸ‘€ Identity Confusion - Agents impersonate multiple identities
  10. ⚑ Denial of Wallet - Uncontrolled agent resource consumption

OWASP

πŸ›‘οΈ Security Best Practices

  • πŸ” Least Privilege - Grant agents minimum necessary permissions
  • βœ… Input Validation - Sanitize all inputs to agents
  • πŸ“ Audit Logging - Complete traceability of agent actions
  • πŸ”’ Secret Management - Never expose credentials to agents unnecessarily
  • πŸ‘οΈ Human Approval - Require human confirmation for sensitive operations

πŸ“Š Observability & Monitoring

πŸ“ˆ Key Metrics

  • ⏱️ Latency - Response time per step and overall task completion
  • πŸ’° Cost - Token usage and API costs per task
  • βœ… Success Rate - Task completion and quality metrics
  • πŸ”„ Token Usage - Context window utilization and efficiency

πŸ› οΈ Observability Tools

  • πŸ“Š LangSmith - LangChain’s observability platform
  • πŸ“ˆ Datadog - AI observability and monitoring
  • πŸ” OpenTelemetry - Open standard for tracing agents
  • πŸ“‰ AgentOps - Agent-specific monitoring

🎯 Production Considerations

  • πŸ“ Trace Tool Calls - Every LLM call, tool execution, and decision needs logging
  • πŸ’Ύ Checkpoint State - Save agent state for recovery and debugging
  • πŸ“‹ Cost Alerts - Set thresholds to prevent runaway spending
  • πŸ”” Quality Evaluation - Automated assessment of agent outputs

πŸ“š Key Research & Papers

πŸ”¬ Foundational Papers

πŸ“Š Evaluation Benchmarks

  • πŸ† SWE-bench - Software engineering benchmark with production tasks. Claude Opus 4.6 leads at 79.20%.

  • πŸ“ˆ SWE-bench Pro - More challenging version with 1,865 real repository tasks.

  • πŸ”¬ SWE-rebench - Automated pipeline for decontaminated agent evaluation.


πŸ”„ From Single Agents to Coordinated Teams

  • Complex tasks now span multiple specialized agents working in parallel
  • Each agent handles a specific subtask with dedicated context
  • Integration and orchestration become critical skills

⏱️ Long-Running Agents

  • Agents can now work autonomously for hours, handling multi-file refactors
  • Persistence and state management become essential
  • Checkpointing and recovery mechanisms mature

πŸ‘οΈ Human Oversight Evolution

  • From β€œhuman in the loop” to β€œhuman on the loop” - oversight rather than constant intervention
  • Intelligent collaboration - humans focus on decisions, agents handle implementation
  • Escalation protocols for ambiguous or high-stakes decisions

πŸ”’ Security-First Architecture

  • Agent-generated code introduces new attack surfaces
  • Guardrails, sandboxing, and permission systems become standard
  • Non-human identities (NHIs) emerge as a security category

πŸ’° Cost Management

  • Token usage optimization becomes critical
  • Prompt caching reduces costs 90% for long sessions
  • Budget limits and spending alerts for production agents

🎯 Practical Next Steps

πŸ§ͺ For Individuals

  1. πŸ“š Start with TDD - Agent-friendly tests = better agent output
  2. πŸ“ Build your instruction directory - Project conventions, patterns, and rules
  3. 🎯 Orchestrate, don’t micromanage - Give agents goals, not step-by-step instructions
  4. πŸ“Š Invest in observability - Agent sessions need logging, tracing, and rollback strategies
  5. πŸ“š Keep learning - This space evolves weekly; follow Simon Willison, Anthropic engineering, and arXiv SE research

🏒 For Teams

  1. πŸ“‹ Establish Agent Guidelines - Document approved patterns and restrictions
  2. πŸ”’ Implement Security Gates - Scan agent outputs before production
  3. πŸ“Š Monitor Costs - Set budgets and track agent spending
  4. πŸ‘₯ Create Agent Librarian Role - Maintain context and patterns for the team
  5. πŸ”„ Iterate on Processes - Learn from each agent interaction

πŸ“– Bibliography & References

πŸ”¬ Research Papers

πŸ“° Articles & Blogs

πŸ› οΈ Tools & Frameworks

πŸ“Š Benchmarks


  • πŸ—“οΈ Last Updated: 2026-03-01
  • πŸ”„ Research in Progress - This topic is actively being developed