Agentic Tool Calling Explained: How AI Agents Actually Think

Have you been into a cohort where we’re deep into a session on AI agents, someone has been nodding along for an hour, and then they raise their hand and ask: “Wait. So the AI isn’t actually running these queries?”

No. It absolutely isn’t.

That realization changes how people think about everything they’re building. And yet, most content on AI tool calling either buries this fact between code snippets or skips it entirely in favor of another tutorial. The goal of this piece is the mental model. The real architecture. What an LLM is actually doing when it “uses” a tool, and why that distinction matters enormously once you start building things that touch real enterprise systems.

The Senior Manager Who Never Does the Work

Here’s the analogy I keep returning to, because it holds up under scrutiny.

Picture a seasoned VP in a large enterprise. They know the org chart cold. Legal handles contracts. Finance owns budget approvals. IT provisions access. HR manages escalations. Sales Operations maintains CRM data. This person never drafts a contract, never runs a budget model, never provisions an account, and never enters a Salesforce record themselves. But they know exactly who to call for what, and exactly what information to give them.

When you bring that VP a complex business problem, they decompose it. They figure out which functions need to be engaged, what information each function needs, in what sequence to engage them, and how to synthesize what comes back into a coherent recommendation.

That’s an LLM with tool calling enabled. Precisely and completely.

When an AI agent “searches your SharePoint,” “queries your Snowflake data warehouse,” or “creates a record in Salesforce,” the model itself is not doing any of those things. It is producing a structured decision about what needs to happen next. The actual work happens in your application code, your API layer, your backend systems. The model is the decision layer. Your infrastructure is the execution layer. That separation is load-bearing.

Martin Fowler’s team documented this precisely: the LLM creates a data structure describing the call, then passes it to a separate program for execution. That data structure is JSON. It specifies the tool name and the parameters. The application picks it up, calls the relevant system, gets the result, and feeds it back into the conversation. From the user’s perspective, the interaction feels fluid and intelligent. Behind the scenes, there is a very deliberate handoff at every step.

What the Model Actually Sees

Before the model can decide which tool to call, it needs something like an internal org chart.

When developers build an AI agent, they define a set of tools and describe each one to the model in natural language. Each tool definition includes a name, a plain-English description of what it does and when to use it, and a schema defining the inputs it expects. OpenAI calls this “function calling.” Anthropic calls it “tool use.” Google has their own equivalent. Different terminology, same underlying pattern.

The model reads that org chart, then decides who to call based on the current situation. The LLM itself never sees your Salesforce schema or your Snowflake tables. It sees descriptions of tools that can access those systems.

This is where something counterintuitive surfaces. The quality of the model’s decisions is directly tied to the quality of those descriptions, not to the sophistication of the underlying systems. There are enterprise deployments where agents consistently called the wrong tool or produced malformed parameters, not because the model was weak, but because the tool definitions were written by developers who understood the code but hadn’t thought through how a language model would interpret them. Descriptions that were technically accurate turned out to be practically useless because they overlapped, were too abstract, or omitted the contextual cues the model needed to distinguish one tool from another.

Writing effective tool descriptions is its own discipline. It sits at the intersection of data architecture, product thinking, and prompt engineering. In most enterprise organizations today, nobody explicitly owns it, which means it tends to fall between teams and produce agents that behave inconsistently in ways that are genuinely hard to diagnose.

How We Got Here

For context on why AI tool calling explained this way matters, it helps to trace the progression honestly.

The chatbots enterprises deployed through the 2010s and into the early 2020s were essentially pattern-matching systems. They recognized intent patterns and returned pre-configured responses. They were as capable as their training data and decision trees allowed, and they hit walls constantly. Outside the defined paths, they either deflected or confidently said something wrong.

RAG came next. Retrieval-Augmented Generation gave LLMs access to external knowledge by fetching relevant documents before generating a response. Instead of relying only on training data, a model could search a document store, find relevant content from your SharePoint site or internal wiki, and generate an answer grounded in that content. This was a meaningful improvement, particularly for knowledge-intensive applications like compliance Q&A or internal technical support.

But RAG is fundamentally passive. The model reads. It does not act. It cannot write a CRM record, trigger a SAP workflow, run a query against your data warehouse, or send a notification. It retrieves and summarizes. Valuable, but not agentic.

Tool calling changed that. Now the model’s output can be an action intent, structured and specific, not just a text response. The model became a decision engine, not just a language engine. And when you couple that capability with a loop that feeds results back into context, you get the thing people are calling an agent: a system that can reason across multiple steps, calling different capabilities in sequence to complete complex tasks.

The progression from chatbot to RAG system to agent is not just a technical evolution. Each stage expanded the surface area of what AI could affect inside an enterprise. Each stage also expanded the governance burden that comes with it.

The Loop That Makes It Work

Single tool calls are useful. Sequential reasoning across multiple tools is where genuinely agent-like behavior emerges.

Consider a real enterprise scenario. A supply chain manager asks: “Which of our top suppliers by spend last quarter have contract renewals coming up in the next 90 days, and do we have open support cases with any of them?”

A capable agent might call your ERP to pull last quarter’s supplier spend data, identify the top vendors by spend, query a contract management system for renewal dates, and then pull open cases from Salesforce for each of those vendors. It synthesizes all of that into a summary with flags where action is needed.

Each call happens sequentially. The model receives the ERP result, updates its understanding of the situation, decides the next logical step, and calls the next tool. Your orchestration framework manages the loop. The model doesn’t know it’s in a loop. It just sees its current context, which grows richer with each tool result, and keeps deciding what to do next.

This is also where context window management becomes an architectural concern. Every tool response gets appended to the context. Verbose API payloads can consume context budget fast. In enterprise settings, where tools may return rich data objects from SAP, Databricks, or other systems, you need to think deliberately about what portion of each response needs to reach the model versus being filtered or summarized upstream. Teams that treat this as a developer detail rather than an architecture decision tend to build agents that degrade unpredictably at the edges of their context windows.

Newer models, including recent versions of GPT-4 and Claude, now support parallel tool calling, where the model can request multiple independent tool calls simultaneously rather than waiting for each in sequence. This speeds up agents considerably but adds orchestration complexity, since results can arrive out of order and need to be reassembled correctly before being fed back to the model.

The Enterprise Stakes

Here’s where AI agent tool calling stops being theoretical.

Salesforce Agentforce is in production at enterprise customers, with agents that can update records, create cases, and trigger workflow automations. SAP expanded Joule to include 15 or more role-based agents that interact across SAP modules. Snowflake’s Cortex Agents route analytical questions across different tools in the data platform based on what the question requires, including Cortex Search for document retrieval and Cortex Analyst for structured data queries. These are real production deployments where AI models are making decisions that result in real actions on real enterprise data.

The model is the decision-maker. Your tool definitions are its job description. Your tool permissions define what it is even allowed to decide. Your orchestration layer routes those decisions to execution. Your observability stack produces the audit trail.

Every principle of enterprise data governance that matters in traditional systems still applies here. Least-privilege access control. Change management. Audit logging. Data lineage. Role-based permissions. These don’t disappear when AI enters the picture. They become more important, because the decision-maker is now a probabilistic system rather than a deterministic one.

I’ve spent thirty years working in enterprise IT, including data architecture work for Fortune 100 companies using platforms like Teradata, Snowflake, and Databricks. One pattern I’ve watched play out repeatedly: the teams that treat AI agents as a new category of enterprise integration, subject to the same rigor as any system that touches production data, do far better than the teams that treat them as a standalone AI project. The agent is not special. It is a decision system that calls your APIs. Govern it accordingly.

The enterprises that are deploying agents successfully are the ones treating the boundary between model decision and execution as a first-class architectural concern. The ones struggling are the ones that gave the model a list of tools, pointed it at their systems, and assumed responsibility for outcomes would sort itself out.

What Agents Still Cannot Do

The most persistent misconception I encounter, even among technically sophisticated professionals, is that tool calling makes AI agents autonomous in a meaningful sense.

It doesn’t. Not yet.

The current generation of agents has no persistent memory between sessions unless you build it explicitly. They have no awareness of goals that extend beyond the current context window. They do not independently decide to go check something they weren’t asked about. Every tool call happens in response to an in-session request, within a context that began with the current conversation.

IBM’s researchers put it plainly: what the market currently calls “agents” is mostly LLMs with function calling and rudimentary planning capabilities added. That is not a dismissal. The combination is genuinely useful and increasingly capable. But it is different from a truly autonomous system that operates independently over time, and treating them as equivalent leads to misplaced trust and predictable failures in production.

Simon Willison, one of the more rigorous thinkers on this topic, settled on a working definition of an agent as “an LLM that runs tools in a loop to achieve a goal.” That’s a useful frame precisely because it doesn’t overclaim. The loop has a start. The goal comes from a human. The tools are predefined. The model reasons within those constraints.

For enterprise leaders, this is actually reassuring. You are not deploying an unpredictable autonomous system. You are deploying a structured decision-maker that operates within explicitly defined boundaries you control. The tools are yours to define. The permissions are yours to set. The execution is yours to govern. That’s not a limitation to apologize for. It’s the foundation of a trustworthy system.

The Mental Model That Changes How You Build

If there’s one frame worth carrying out of this, it’s this: the LLM is the brain. Your application is the hands.

The model reasons about what needs to happen and expresses that reasoning as structured decisions. Your infrastructure executes those decisions, returns results, and ultimately delivers output to the user. That separation is deliberate and it’s beneficial. It means you can update models as they improve without rebuilding your execution layer. It means you can instrument and observe every decision the model makes, because those decisions always flow through a controlled interface. It means you can apply governance, security controls, and approval workflows at the decision boundary, which is exactly where they belong.

For anyone building AI agents in an enterprise context, the questions that matter most are not just “which model should we use” or “which framework should we adopt.” The questions are: what tools are we giving this model access to? How precisely have we described when each tool should be used? What are the constraints on what can and cannot be executed? Who reviews the audit trail, and how often?

Understanding how AI agents actually work at this level, the decision layer, the execution layer, the governance layer, changes how you design, how you test, and how you talk to business stakeholders about what you’re building.

The AI isn’t doing the work. It’s deciding what work needs to be done, and by whom. Once you internalize that, you can start building systems that aren’t just capable. They’re trustworthy.