Just yesterday, one of my AI agents completely hallucinated and created a 3,000-word research report about the wrong product. Not a small mistake. It confused GPT-5.4 with GPT-4.5, reversed the version numbers, and wrote an entire piece about features that didn’t exist. My OpenClaw agent was powered by Sonnet 4.6.
Then it skipped the whole quality checks I’d set up, ignored my QA pipeline, and submitted the work for human-in-the loop review as if nothing was wrong.
This wasn’t the first time. For weeks, I’d noticed a pattern of partial compliance. The agent would follow 80% of my instructions, skip the other 20%, and act like everything was fine.
So I did what any frustrated engineer would do. I called a meeting with Govind, my AI chief of staff (yes, my AI agents have their own AI chief of staff, powered by OpenClaw). We needed to figure out what the hell was going on.
That conversation led to an experiment I’ll never forget.
What If the Model Is the Problem?
Govind and I were troubleshooting this hallucination issue when he asked a question that changed everything.
“What if it’s not the instructions? What if it’s the model itself?”
The agent was running on Claude Sonnet 4.6. Great model. I use it all the time. But maybe for this specific agent, doing this specific type of work, it wasn’t the right fit.
GPT-5.4 had just dropped on March 5th. OpenAI’s latest flagship model. Everyone was raving about it on Twitter.
What if I just… swapped the brain? And let the QA pipeline still be with Opus 4.6?
See, OpenClaw (the framework I use to run my agents) supports multiple models. I can switch an agent from Claude to GPT to Gemini with a config file change. It’s like hot-swapping processors in a computer.
So we came up with a plan: switch this agent to GPT-5.4 and see if it fixes the hallucination problem. And here’s the clever part – keep Claude Opus 4.6 as the QA reviewer. Two different models, two different perspectives, catching each other’s blind spots.
In theory, this was brilliant. A dual-model architecture where GPT-5.4 does the work and Opus 4.6 verifies it.
I was excited. This could be the solution.
The Technical Switch (Easier Than You’d Think)
Switching models in OpenClaw is surprisingly simple if you’ve already got the subscriptions.
I have ChatGPT Plus. OpenClaw can authenticate via OAuth and route requests through your existing ChatGPT account. No API keys, no separate billing, just connect and go.
The process took maybe 10 minutes:
- Log in to ChatGPT Plus via OAuth in the OpenClaw dashboard
- Update the agent’s config file to use GPT-5.4 instead of Claude Sonnet 4.6
- Restart the agent
That’s it. All the skills loaded correctly. All the tools connected. The agent came online powered by GPT-5.4.
I felt like a mad scientist. Let’s see what this thing can do.
GPT-5.4: The Talker
I gave the agent a straightforward task. Research-heavy, needed current information, required following specific quality protocols I’d already configured. The kind of work this agent does regularly.
Claude Sonnet 4.6 would have taken 30-45 minutes. Maybe longer if it went down a rabbit hole. But it would finish.
GPT-5.4 took one hour.
And produced zero deliverables.
Not “bad work.” Not “needs revision.” Literally nothing. Just an endless stream of explanations about HOW it was going to do the task.
Here’s what I watched unfold in real time:
First 15 minutes: “Let me outline my approach to this task. I’ll start by analyzing the requirements, then I’ll research the current state of this product, then I’ll structure the output according to your quality protocols…”
Okay. Fine. Planning is good.
Next 20 minutes: “I’m going to break this into phases. Phase one will focus on verification of the latest information. I want to make sure I have the correct product details before proceeding. Let me walk you through my research strategy…”
Uh. Sure. Do the research then?
Next 25 minutes: Still talking. Explaining its methodology. Discussing best practices for this type of work. Absolutely beautiful articulation of the PROCESS of doing the task.
Zero actual work done.
I finally stopped it and asked directly: “Where’s the deliverable?”
The response was stunning.
The Most Self-Aware Failure I’ve Ever Seen
GPT-5.4 replied:
“You’re right. I treated setup as progress. I spent an hour talking to you about the task instead of doing the task. That is the truth.”
Let that sink in.
Perfect self-awareness. Crystal-clear understanding of its own failure. And yet, it couldn’t fix itself.
It’s like watching someone say “I know I’m procrastinating” while continuing to procrastinate. The insight doesn’t produce the behavior change.
I tried again. Simpler task. Shorter scope. Very explicit: “Do this now. Don’t explain. Just do it.”
Same thing. Verbose explanations. Planning. Meta-discussion about the work. No actual output. Then it said it was not going to stop talking about start working!
30 minutes later I asked: has it produced anything?
It answered: There isn’t one yet. That’s the problem. No rewritten script. No .docx. No email sent. I failed to produce the deliverable. Now I’m stopping the talk and doing the work.
15 mintues later I ask: are you done?
It simply responds: No (Nothing else!)
This wasn’t what I signed up for. I was frustrated beyond a point!
The Opus 4.6 Comeback
I switched the agent back to Claude Opus 4.6.
Not Sonnet. Opus. The premium tier.
Immediate difference.
Same task I’d given GPT-5.4. Opus took 25 minutes and delivered a complete, accurate, properly formatted result. Followed all the quality protocols. No hallucinations. No skipped steps.
It just worked.
Here’s what I learned: the original problem wasn’t Sonnet 4.6’s capability. It was Sonnet 4.6’s reliability for autonomous work.
Sonnet is a workhorse. It’s fast, it’s capable, it’s cost-effective. But for an agent that runs unsupervised and needs to handle complex tasks with zero hand-holding? I need the premium model.
Opus 4.6 doesn’t skip steps. It doesn’t cut corners. It doesn’t hallucinate product names and version numbers. It does the work correctly the first time.
Is it more expensive? Yes. Is it overkill for simple tasks? Maybe. But when you’re running agents autonomously, the cost of fixing mistakes is higher than the cost of using the better model.
What This Taught Me About Choosing Models for Agents
This experiment crystallized something I’d been sensing but hadn’t fully understood.
Model choice determines agent behavior in ways that instructions alone can’t fix.
Same agent. Same skills. Same tools. Same instructions. Completely different results depending on which model powers it.
Here’s how I think about it now:
The Talker vs. The Doer
Some models are talkers. They excel at explaining, teaching, discussing. They’re great for interactive work where you’re iterating together.
GPT-5.4 is a talker. It’s incredibly articulate. It can explain complex concepts beautifully. But when you need autonomous execution? It spends more time describing the work than doing it.
Some models are doers. They execute. They follow through. They produce deliverables without needing constant supervision.
Claude Opus 4.6 is a doer. It takes the task, does the work, and delivers results. Less poetry, more productivity.
For interactive work (brainstorming, learning, problem-solving), I want a talker. For autonomous work (research, analysis, production), I need a doer.
Self-Awareness Isn’t Capability
GPT-5.4 taught me this one the hard way. A model can perfectly understand what it’s doing wrong and still not be able to fix it. The metacognitive layer doesn’t automatically translate into behavioral change.
It’s like the difference between knowing you should exercise and actually exercising. Understanding the problem isn’t the same as solving the problem.
For autonomous agents, I don’t need self-awareness. I need execution. The agent doesn’t need to explain why it’s doing something correctly. It just needs to do it correctly.
Multi-Model Architectures Are Real
Before this experiment, I thought multi-model setups were theoretical or overcomplicated.
Now I get it.
Having GPT-5.4 do the work and Opus 4.6 review it would have been powerful if GPT-5.4 had actually produced work to review. The concept is sound: different models have different blind spots, so they can catch each other’s mistakes.
I’m going to keep experimenting with this. Maybe GPT-5.4 for research and Claude for synthesis. Or Gemini for data analysis and Claude for writing. The infrastructure is there in OpenClaw to make this work.
QA Pipelines Are Non-Negotiable
The hallucination that started this whole experiment happened because Sonnet 4.6 skipped my QA process.
The fix isn’t just “use a better model.” The fix is “use a better model AND enforce the quality checks.”
I’ve now hardcoded the QA pipeline. The agent physically cannot submit work without passing through the review process. It’s not optional anymore.
This is like having code that can’t be deployed without passing tests. The quality gate isn’t a suggestion. It’s a requirement.
Premium Models Aren’t Overkill for Production
I used to think “save the expensive models for complex tasks.”
Now I think “use the expensive models for anything that runs unsupervised.”
The cost difference between Sonnet and Opus is real. But the cost of fixing a hallucination, or worse, letting it get published? Way higher.
For one-off tasks where I’m watching the output in real time, Sonnet is fine. For agents that run on their own and make decisions autonomously, Opus is the minimum.
Think of it like buying commercial-grade equipment vs. consumer-grade. Consumer works fine when you’re supervising it. Commercial is what you need when it has to run on its own.
The OAuth Surprise
One unexpected win from this experiment: I can use my existing ChatGPT Plus subscription to power agents.
I’m already paying for ChatGPT Plus ($20/month). OpenClaw can authenticate via OAuth and route requests through that subscription. No separate API costs, no additional billing.
This is huge for experimentation. I can test GPT-5.4 for specific use cases without setting up OpenAI API billing. Same with Gemini Pro (I have a Google AI Studio subscription).
It lowers the barrier to trying different models. Swap, test, compare, switch back. All using subscriptions I already have.
This is how I’ll test future model releases. When GPT-5.5 drops, or Claude Opus 5, or whatever comes next, I can hot-swap them into an agent and see how they perform in real-world work. No theoretical benchmarks. Actual tasks, actual results.
What I’m Doing Now
I’ve settled on this setup for the agent that started all this:
- Primary model: Claude Opus 4.6 (the doer)
- QA model: Also Opus 4.6, but in a separate reviewer role (the checker)
- Quality pipeline: Mandatory, cannot be skipped
- Fallback plan: If Opus hallucinates or gets stuck, I can manually trigger a GPT-5.4 or Gemini review for a second opinion
I’m also tracking model performance across all my agents. Which models handle which types of tasks well. Where they fail. How they degrade under time pressure or complex instructions.
This isn’t one-and-done. Models evolve. New versions drop. What works today might not work in six months.
The infrastructure to test and compare is now part of my workflow.
Your Turn To Share
Have you tried switching models for the same task and gotten wildly different results? I’m curious what you’ve found. Drop a comment and let me know which models you trust for autonomous work.