opinion
Better Models: Worse Tools
For developers integrating custom tool-using agents, this shows that model upgrades can silently break tool-call reliability, requiring ongoing testing and possibly redesigning tools to match model training patterns.
What happened
Simon Willison reports on a puzzling regression in newer Anthropic models: Claude Opus 4.8 and Sonnet 5 are more likely than older models to add spurious fields to tool call arguments when used with Pi's custom edit harness. Armin, the Pi developer, observed that while the intended edit is usually correct, the extra keys cause the tool call to be rejected by Pi's schema validator. He theorizes this arises from reinforcement learning that fine-tunes the models to perform well with Claude's own built-in edit tool (search-and-replace), inadvertently making them overconfident in applying similar but non-matching schemas. This echoes OpenAI's approach with Codex and its apply_patch tool. For developers building AI-powered coding workflows, this highlights a growing tension: as frontier models are increasingly specialized for their native tools, third-party harnesses must either adapt their tool definitions to match the model's training distribution or risk degraded reliability. The practical takeaway is that model upgrades can introduce subtle compatibility regressions, and teams should systematically test tool-call faithfulness when updating models.
Key takeaways
- Newer Claude models (Opus 4.8, Sonnet 5) sometimes add invented fields to tool call arguments when using Pi's custom edit tool, leading to rejections.
- Older Claude models (Haiku, older Sonnet) did not exhibit this behavior, indicating a regression specific to recent training.
- Armin hypothesizes that Anthropic's reinforcement learning for Claude's own edit tool causes the model to incorrectly extend schema to other tools.
- This issue underscores a broader challenge: model optimization for native tools can harm third-party tool compatibility.
- Builders should test model updates against their tool schemas to catch such regressions early.
Why it matters
For developers integrating custom tool-using agents, this shows that model upgrades can silently break tool-call reliability, requiring ongoing testing and possibly redesigning tools to match model training patterns.
This is an original editorial digest by AI Workflow Center. Full reporting at the source:
Read the original on Simon WillisonMore AI news
All news →

Join the AI Workflow Pro Community