The Vibe Coding Trap: Why AI-Generated Code Needs an Architect

A few months ago, I spent a week reviewing code from a mid-sized enterprise team that had adopted AI coding tools across their entire engineering organization. They had done everything right on paper: rolled out GitHub Copilot and Cursor, ran training sessions, encouraged experimentation. Output velocity tripled within two months. Pull requests that used to take two days were landing in hours.

Then the production incidents started.

In the six months after AI adoption, their incident rate tripled alongside their output. Database timeouts under moderate load. API endpoints returning 500 errors with no useful log messages. A security audit that surfaced hardcoded credentials in three services. An agent workflow that failed silently for days because nobody had implemented exception handling on the external API calls.

The correlation was not a coincidence.

This is the vibe coding trap. And it is catching teams that should know better.

What Vibe Coding Actually Is

Andrej Karpathy coined the term in February 2025, describing it as “fully giving in to the vibes, embracing exponentials, and forgetting that the code even exists.” The concept is simple: describe what you want in natural language, accept what the AI generates, and move on. No deep review of structure. No questioning whether the output matches production requirements. Just vibes.

By Y Combinator’s Winter 2025 batch, 25% of startups reported codebases that were 95% AI-generated. Collins Dictionary named “vibe coding” its Word of the Year for 2025. Fast Company reported in September 2025 that the “vibe coding hangover” had arrived, with senior engineers describing “development hell” when inheriting AI-generated codebases.

To be clear about what I am arguing: AI coding tools are extraordinary. Claude, Copilot, Cursor, and their peers have genuinely changed what one person can build. I use them daily. I have no interest in romanticizing the era of writing every line by hand.

The problem is not the tools. The problem is the assumption that the tool replaces the architect.

Ox Security analyzed 300 open-source projects in 2025, including 50 that were substantially AI-generated, and published their findings under the title “Army of Juniors.” Their conclusion: AI-generated code is “highly functional but systematically lacking in architectural judgment.” Ninety to one hundred percent of the AI-generated codebases had what they called “by-the-book fixation,” meaning the AI follows textbook patterns rather than tailoring solutions to the actual production context. Eighty to ninety percent had avoided refactoring entirely, because the AI only cares about satisfying the prompt, not improving the surrounding code.

GitClear, which analyzed 211 million changed lines of code from 2020 through 2024, found that code refactoring (the structural improvement work that keeps systems healthy) dropped from 25% of changed lines in 2021 to less than 10% in 2024. Code clones grew 4x. Their summary was pointed: “AI-generated code resembles an itinerant contributor, prone to violate the DRY-ness of the repos visited.”

The pattern is consistent. AI coding tools optimize for getting code that runs. They do not optimize for code that survives.

The Prompt Is the Architecture

The most important thing I have learned using AI coding tools over the past two years is this: the quality of what comes out is determined almost entirely by the specificity of what goes in.

A novice asks for a thing. An architect specifies a system.

Let me show you what that looks like in practice, across three scenarios that come up in almost every AI agent project.

Logging: The First Thing That Fails in Production

What a novice asks:

“Build me a logger for my AI agent project”

The AI delivers something like this:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def process_agent_request(request):
logger.info(“Processing request”)
# … agent logic
logger.info(“Request done”)

This runs. It logs. A developer who has not operated a production system at scale looks at it and thinks: good enough.

It is not good enough. When this agent starts failing at 2 AM, the logs will contain Processing request and Request done with no correlation ID, no timestamp in ISO format, no structured fields for querying in CloudWatch or Datadog, no way to trace which request ID failed, and no indication of whether you are looking at production or staging. The on-call engineer will spend an hour filtering through free-text logs trying to reconstruct what happened.

What an architect asks:

“I need a structured logger for a production AI agent service running on AWS ECS and Lambda. Requirements: JSON output format for log aggregation (CloudWatch and Datadog compatible), log level configurable via LOG_LEVEL environment variable, automatic context fields: correlation_id, agent_id, service name, environment, and ISO timestamp. Use structlog with stdlib integration. Include rotating file handler for local development (10MB max, 5 backups), switch to stdout JSON for container environments. Mask any field named api_key, token, or password with REDACTED.”

As a result, the AI delivers something like this:

import logging
import logging.handlers
import os
import structlog
from typing import Any

def mask_sensitive(_logger: Any, _method: str, event_dict: dict) -> dict:
“””Redact sensitive fields before they reach any log sink.”””
for key in {“api_key”, “token”, “password”, “secret”}:
if key in event_dict:
event_dict[key] = “***REDACTED***”
return event_dict

def configure_logging(service_name: str, agent_id: str) -> structlog.BoundLogger:
environment = os.getenv(“ENVIRONMENT”, “development”)
log_level = os.getenv(“LOG_LEVEL”, “INFO”).upper()

renderer = (
structlog.processors.JSONRenderer()
if environment != “development”
else structlog.dev.ConsoleRenderer(colors=True)
)

structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt=”iso”),
mask_sensitive,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
renderer,
],
wrapper_class=structlog.make_filtering_bound_logger(
getattr(logging, log_level)
),
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)

if environment == “development”:
handler = logging.handlers.RotatingFileHandler(
“agent.log”, maxBytes=10 * 1024 * 1024, backupCount=5
)
else:
handler = logging.StreamHandler()

handler.setLevel(log_level)
logging.basicConfig(handlers=[handler], level=log_level)

return structlog.get_logger(
service=service_name,
agent_id=agent_id,
environment=environment,
)

logger = configure_logging(
service_name=”order-fulfillment-agent”,
agent_id=os.getenv(“AGENT_ID”, “local-dev”),
)

def process_agent_request(request_id: str, payload: dict) -> None:
structlog.contextvars.bind_contextvars(correlation_id=request_id)
logger.info(“agent.request.started”, payload_keys=list(payload.keys()))
try:
# … agent logic
logger.info(“agent.request.completed”, duration_ms=42)
except Exception as exc:
logger.exception(“agent.request.failed”, error_type=type(exc).__name__)
raise
finally:
structlog.contextvars.clear_contextvars()

The difference is not style. Every choice here maps to a real production scenario: the JSON renderer means your log aggregation tool can parse fields without regex; the correlation ID means you can trace a single request across ten services; the sensitive field masking means api_key never shows up in your Datadog logs; the rotating handler means local development does not fill a developer’s disk. The generic version has none of this because the person writing the prompt did not know to ask for it.

API Endpoints: Where Security Gaps Live

What a novice asks:

“Build me an API endpoint for my AI agent”

Resulting Code:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route(‘/agent’, methods=[‘POST’])
def run_agent():
data = request.json
result = process_agent(data[‘input’])
return jsonify({‘result’: result})

if __name__ == ‘__main__’:
app.run(debug=True)

Shipped to production, this endpoint has: no authentication, no rate limiting, no input validation, no error handling, no request logging, no timeout, and debug=True enabled. An unauthenticated caller can hammer the agent with requests until your LLM bill for the month arrives. When data[‘input’] is missing, the endpoint crashes with a KeyError that returns a 500 and leaks stack trace details. There is no correlation ID to trace the request.

What an architect asks:

“I need a FastAPI endpoint for a production AI agent service. Requirements: API key authentication via X-API-Key header, rate limiting at 10 requests per minute per key using slowapi, Pydantic request validation, RFC 7807 Problem Details for error responses, OpenAPI docs auto-generated, 30-second request timeout, every request logged with correlation ID and duration, health check endpoint. Map AgentTimeoutError to 408 and AgentCapacityError to 503.”

Now, the resulting code:

import os
import uuid
import time
from typing import Optional

from fastapi import FastAPI, Depends, HTTPException, Request, status
from fastapi.security.api_key import APIKeyHeader
from pydantic import BaseModel, Field
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

from .agent import run_agent_logic, AgentTimeoutError, AgentCapacityError
from .logging_config import configure_logging

logger = configure_logging(service_name=”agent-api”, agent_id=os.getenv(“AGENT_ID”))
limiter = Limiter(key_func=get_remote_address)

app = FastAPI(
title=”AI Agent API”,
version=”1.0.0″,
description=”Production AI agent endpoint with auth and rate limiting”,
)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

API_KEY_HEADER = APIKeyHeader(name=”X-API-Key”, auto_error=False)
VALID_API_KEYS = set(os.environ[“AGENT_API_KEYS”].split(“,”))

class AgentRequest(BaseModel):
input: str = Field(…, min_length=1, max_length=4096)
session_id: Optional[str] = Field(None)

class AgentResponse(BaseModel):
request_id: str
result: str
duration_ms: int

def verify_api_key(api_key: str = Depends(API_KEY_HEADER)) -> str:
if not api_key or api_key not in VALID_API_KEYS:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail={
“type”: “https://tools.ietf.org/html/rfc7807”,
“title”: “Unauthorized”,
“status”: 401,
“detail”: “Invalid or missing API key”,
},
)
return api_key

@app.middleware(“http”)
async def request_logging_middleware(request: Request, call_next):
request_id = str(uuid.uuid4())
request.state.request_id = request_id
start = time.monotonic()
response = await call_next(request)
duration_ms = int((time.monotonic() – start) * 1000)
response.headers[“X-Request-ID”] = request_id
logger.info(
“http.request”,
method=request.method,
path=request.url.path,
status_code=response.status_code,
duration_ms=duration_ms,
request_id=request_id,
)
return response

@app.get(“/health”)
async def health_check():
return {“status”: “healthy”, “version”: “1.0.0”}

@app.post(“/agent/run”, response_model=AgentResponse)
@limiter.limit(“10/minute”)
async def run_agent(
request: Request,
body: AgentRequest,
api_key: str = Depends(verify_api_key),
):
request_id = request.state.request_id
start = time.monotonic()
try:
result = await run_agent_logic(
input_text=body.input,
session_id=body.session_id,
timeout=30,
)
return AgentResponse(
request_id=request_id,
result=result,
duration_ms=int((time.monotonic() – start) * 1000),
)
except AgentTimeoutError:
raise HTTPException(
status_code=408,
detail={“type”: “timeout”, “title”: “Agent Timeout”, “status”: 408},
)
except AgentCapacityError:
raise HTTPException(
status_code=503,
detail={“type”: “capacity”, “title”: “Agent At Capacity”, “status”: 503},
)

The difference in prompt length here is significant and intentional. The architect did not write a longer prompt to be thorough for its own sake. They wrote a longer prompt because they already knew, from experience, what every missing requirement costs at 3 AM on a Saturday.

Database Access: Where Simple Gets Dangerous Fast

What a novice asks:

“Connect my agent to a database”

Resulting code by AI:

import sqlite3

def get_agent_history(agent_id):
conn = sqlite3.connect(‘agent.db’)
cursor = conn.cursor()
# SQL injection waiting to happen
cursor.execute(f”SELECT * FROM agent_history WHERE agent_id = ‘{agent_id}'”)
results = cursor.fetchall()
conn.close()
return results

def save_agent_result(agent_id, result):
conn = sqlite3.connect(‘agent.db’)
cursor = conn.cursor()
cursor.execute(f”INSERT INTO agent_history VALUES (‘{agent_id}’, ‘{result}’)”)
conn.commit()
conn.close()

The f-string SQL construction is a textbook SQL injection vulnerability. Every connection opens and closes immediately with no pooling, meaning 500 concurrent agents will exhaust database connections in seconds. There is no retry logic, so a transient network hiccup kills the operation entirely. And SQLite has no place in a production agent service that needs to handle concurrent writes.

What an architect asks:

“I need a production database layer for a PostgreSQL-backed AI agent service handling ~500 concurrent agents. Requirements: async SQLAlchemy with connection pooling (pool_size=10, max_overflow=5, pool_timeout=30 seconds), retry with exponential backoff on transient errors like connection reset and deadlock (max 3 attempts, 0.5 to 4 second wait), parameterized queries via ORM to prevent SQL injection, automatic transaction rollback on failure, pool health checks on checkout, and graceful pool disposal on shutdown.”

As a result of this prompt, AI generates something like this:

import asyncio
import logging
import os
from contextlib import asynccontextmanager
from typing import AsyncGenerator

import asyncpg
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker
from sqlalchemy.orm import DeclarativeBase, mapped_column, Mapped
from sqlalchemy import String, Text, DateTime, func, select
from sqlalchemy.exc import OperationalError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logger = logging.getLogger(__name__)

DATABASE_URL = os.environ[“DATABASE_URL”]

engine = create_async_engine(
DATABASE_URL.replace(“postgresql://”, “postgresql+asyncpg://”),
pool_size=10,
max_overflow=5,
pool_timeout=30,
pool_pre_ping=True,  # validates connection health before each checkout
echo=False,
)

SessionLocal = async_sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False,
)

class Base(DeclarativeBase):
pass

class AgentHistory(Base):
__tablename__ = “agent_history”

id: Mapped[int] = mapped_column(primary_key=True)
agent_id: Mapped[str] = mapped_column(String(64), index=True, nullable=False)
result: Mapped[str] = mapped_column(Text, nullable=False)
created_at: Mapped[DateTime] = mapped_column(
DateTime(timezone=True), server_default=func.now()
)

@asynccontextmanager
async def get_session() -> AsyncGenerator[AsyncSession, None]:
async with SessionLocal() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise

TRANSIENT_ERRORS = (OperationalError, asyncpg.PostgresConnectionError,
asyncpg.DeadlockDetectedError)

@retry(
retry=retry_if_exception_type(TRANSIENT_ERRORS),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=0.5, max=4),
reraise=True,
)
async def get_agent_history(agent_id: str) -> list[AgentHistory]:
async with get_session() as session:
# ORM handles parameterization; no string interpolation
stmt = (
select(AgentHistory)
.where(AgentHistory.agent_id == agent_id)
.order_by(AgentHistory.created_at.desc())
.limit(50)
)
result = await session.execute(stmt)
return result.scalars().all()

@retry(
retry=retry_if_exception_type(TRANSIENT_ERRORS),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=0.5, max=4),
reraise=True,
)
async def save_agent_result(agent_id: str, result: str) -> AgentHistory:
async with get_session() as session:
record = AgentHistory(agent_id=agent_id, result=result)
session.add(record)
await session.flush()
logger.info(“db.history.saved”, extra={“agent_id”: agent_id, “id”: record.id})
return record

async def shutdown_db() -> None:
await engine.dispose()
logger.info(“db.pool.disposed”)

SQL injection is gone because the ORM handles parameterization. Connection pooling means 500 concurrent agents share 10-15 connections cleanly. The retry decorator handles the transient network errors that PostgreSQL throws under load. And when the service shuts down, the pool disposes cleanly instead of leaving dangling connections on the database server.

The AI generated both versions. The difference is entirely in the prompt, and the prompt reflects the accumulated experience of the person writing it.

The Architecture Tax

There is a cost to shipping code you do not fully understand, and it does not show up on the first invoice. It accumulates quietly, and then it arrives all at once.

I call it the architecture tax.

The most common symptoms I see in vibe-coded enterprise codebases are not dramatic. They are mundane failures that compound. A service with no structured logging that becomes unmaintainable the first time you need to debug a production issue. An agent workflow with no exception handling on external API calls that fails silently for four days before anyone notices the downstream system has stale data. Configuration values hardcoded directly in the source file that make it impossible to promote the service from staging to production without a code change. An authentication layer that was never implemented because the AI did not know authentication was a requirement, and the developer did not know to ask.

Ana Bildea, writing about AI-generated technical debt, put the compounding problem precisely: “Traditional technical debt accumulates linearly. You skip a few tests, take some shortcuts, defer some refactoring. The pain builds gradually until someone allocates a sprint to clean it up. AI technical debt is different. It compounds.”

It compounds because vibe-coded systems tend to be copied. A developer creates a service with an insecure pattern, ships it, and the next developer on the team treats it as the template. The anti-pattern replicates. Ox Security found that AI-generated code frequently generates the same bugs when generating similar functionality, because the model learned from training data that already contained those patterns. Their term for it is “Bugs Dejavu”: the AI regenerates well-known bugs every time it generates that category of code.

The security dimension of the architecture tax is particularly sharp. Research from NYU published on Spiceworks found that roughly 40% of code generated by GitHub Copilot contained security vulnerabilities. A separate study found security issues in 29.1% of generated Python code, with a 6.4% rate of credential leakage. The developers shipping this code are not making deliberate security trade-offs. They are not aware these patterns are vulnerable. That is the trap.

A METR study published in July 2025 found something that should concern anyone managing a team using AI coding tools: when experienced developers were allowed to use AI, they took 19% longer to complete tasks compared to working without it. More striking, those same developers estimated afterward that AI had sped them up by 20%. The perception gap between “I feel more productive” and “I am more productive” is exactly where the architecture tax hides.

How Experienced Engineers Actually Use These Tools

There is no numbered framework here because the real answer is a mindset, not a checklist.

Experienced engineers who use AI coding tools effectively share a common approach: they arrive at the AI with architecture already decided. They know what the system needs to do before they ask the AI to build any part of it. The prompt is not the start of the thinking process. It is the end of it.

This means doing the context work before touching the AI. What are the non-functional requirements for this component? Latency, concurrency, failure modes, security surface, observability needs, environment promotion path. An architect designing a logger asks: how will on-call engineers query these logs at 2 AM? What data do they need to reconstruct the failure? That answer determines the logging format, the field structure, the sink. The AI then implements that specification. It does not determine it.

The review loop matters as much as the prompt. AI-generated code that looks correct can contain subtle issues that only surface under production conditions: a race condition in async code, a missing index on a query that performs fine with 100 rows and destroys performance at 100,000, a retry implementation that causes thundering herd under failure conditions. Experienced engineers review every generated block with the same skepticism they would apply to a junior engineer’s pull request. Not hostile skepticism, but genuine scrutiny.

Prompt specificity is a skill that takes time to develop, and it is directly correlated with domain expertise. Teams without AI prompting training see 60% lower productivity gains compared to teams with structured training, according to data published by DX Research. That gap is not about knowing the right magic words. It is about knowing enough about the problem domain to specify what good looks like.

The architecture-first approach also applies at the system level. Before any AI-generated code is written, the data model should be sketched, the API contracts defined, the failure modes identified, the deployment topology understood. An AI can implement any of these things given a clear enough specification. It cannot decide them for you, and it will not refuse to generate code that contradicts decisions you have not made yet.

Simon Willison, a prominent software developer who has written extensively about AI coding tools, put the right frame on this: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book. That’s using an LLM as a typing assistant.” The question is not whether AI wrote the code. It is whether a person who understands the system owns the code.

What Should the Enterprise Teams do?

When organizations, after adopting AI coding tools discover that their production stability has deteriorated, the root cause is almost always the same. They treated the tooling adoption as a technology decision rather than a process decision.

They gave developers the tools. They did not give developers a framework for using the tools responsibly. They measured feature velocity. They did not measure architectural quality. They celebrated the output volume increase. They did not notice the debt accumulating underneath it.

The teams that use AI coding tools well treat every generated component as code review material, not finished output. They write architecture decision records before writing prompts. They have explicit standards for what must be present in any AI-generated service: structured logging, parameterized queries, explicit error handling, configuration from environment variables, health check endpoints. These standards get embedded into the prompts themselves, sometimes literally pasted as requirements. The AI consistently meets specifications it is given. The problem is consistently the specifications that are missing.

Forrester predicted that by 2025, more than 50% of technology decision-makers would face moderate to severe technical debt. AI-assisted development, used without architectural oversight, is accelerating that timeline.

The Real Skill Is Still the Same

Thirty years of enterprise IT has given me a useful perspective on technology waves. I have watched organizations scramble to adopt each new capability, sometimes successfully, often not. The pattern that determines success is consistent: technology amplifies whatever processes and expertise are already present. Good processes get better. Weak processes fail faster and at larger scale.

AI coding tools are extraordinary amplifiers. They amplify the output of an architect who arrives with clear requirements and strong domain knowledge. They also amplify the output of someone who does not know what they do not know, producing more code, faster, with more subtle problems embedded in it.

The vibe coding problems we are seeing in enterprise environments are not AI problems. They are expertise problems. The solution is not to slow down AI adoption. It is to make sure the people directing AI coding tools understand what they are building well enough to specify it.

The AI handles the implementation. That part is genuinely remarkable. The part it cannot handle, and has never been able to handle, is knowing what good looks like before you ask for it.

That is still the architect’s job. AI can be your junior developer. But if there is no architect to evaluate what the junior developer has done? Your problems will be catastrophic, all at once!

Leave a Comment