Agents Need Traffic Control

Plus... what breaks after the demo, how agents should research codebases, MCP’s RC, and evals from real traces.

Jun 03, 2026

Agents, quantum, and chips are in the news, but it’s not the new James Bond announcement.

HOT TAKE

Agents: License revoked?

The future of AI apps may look less like agents deciding everything, and more like boring pipelines with good boundaries.

Which do you trust more: Agents or Workflows?

AGENTS or WORKFLOWS

LAST WEEK’S TAKE

Don’t Make Yourself Scarce

No need to review the results of last week’s hot take, you were clear on what’s scarcer.

PRESENTED BY VOXEL51

Building Vision Data Agents with Tools, Skills, and MCP

Most agentic workflows for computer vision tasks are pieced together with custom scripts. One to import your data, another to analyze your labels, another to evaluate detections. It works, but it doesn’t scale.

On June 17th, at 9 AM PST, we’ll walk through how to build agents that can operate on your datasets: tagging samples, computing embeddings, running evaluations, and surfacing results — all through a structured tool interface using MCP and Skills.

You’ll come away with a working mental model for how to wire this up in your own stack.

What’s covered:

How Skills and MCP bridge the gap between an LLM and real workflows: connecting datasets, embeddings, evaluations, and curation pipelines.
Automating common CV tasks like duplicate detection, dataset QA, model evaluation, and annotation.
What breaks in real-world agent systems, and what we learned building them.
Evaluating multi-step agents with traces, rubrics, and regression testing.

Register for the Webinar

HIDDEN GEMS

Guidance for building agent evals from real traces, golden cases, and production regressions.

Agent harnesses as technical debt when models improve, production needs change, and scaffolding starts to outlive its usefulness.

Flexible agent retrieval using grep, BM25, AST tools, APIs, and vector search together, instead of treating vector DBs as the default answer.

Making batch feature computation cheaper and faster by reusing partial aggregates while keeping training data point-in-time correct.

AGENT INFRASTRUCTURE

MCP moves toward production ops

The next MCP specification release candidate is out, and the headline change is not a new tool or another agent demo. It is operational plumbing.

MCP is moving toward a stateless protocol core. In the previous Streamable HTTP version, clients established a session first, then carried an Mcp-Session-Id into later requests. That made horizontal deployments care about sticky sessions, shared session stores, and gateway behavior. In the new release candidate, a tool call can be a self-contained request that any server instance can handle. Protocol version, client identity, and client capabilities travel with the request, while headers such as Mcp-Method and Mcp-Name make routing easier for load balancers, gateways, and rate limiters.

That does not mean agent workflows lose state. It means state has to become explicit. A server can return a handle, such as a repository, browser session, task, or basket ID, and the model can pass that handle back in later tool calls. That is better for logging, orchestration, and debugging, but it also puts more responsibility on tool authors to scope, validate, and expire those handles properly.

There are other production-minded changes too. MCP Apps let servers ship interactive HTML interfaces that hosts render in sandboxed iframes, while Tasks moves long-running work into an official extension with tasks/get, tasks/update, and tasks/cancel. Extensions now get a more formal process, with negotiated capabilities and independent versioning, rather than every new capability needing to live in the core spec.

The security and auth piece is worth calling out. The release candidate tightens OAuth and OpenID Connect behavior for the common MCP pattern where one client talks to many servers. It clarifies issuer validation, client registration, credential binding, refresh tokens, and step-up scope handling, where a client requests scopes on demand as tools need them instead of over-permissioning up front. Den Delimarsky has a short post on that specific change. This matters because agents are increasingly being connected to private data, company systems, paid services, and user-specific workflows.

The migration work is real. Roots, Sampling, and Logging are now deprecated under a formal feature lifecycle policy, though they still work in this release and any removal needs a separate SEP. Full JSON Schema 2020-12 support also gives tool schemas more expressive power, which is useful, but implementers need to watch validation cost and external references.

This RC moves MCP closer to something ops teams can run using familiar infrastructure, but it is still an RC with breaking changes. Angie Jones at AAIF has a good practical breakdown of what to test first. Two of her checks are worth doing now: confirm your requests can move across server instances without losing context, and audit every place your server remembers something between tool calls so it has an explicit handle the client can see and pass back.

Links:

AAIF practical breakdown: https://go.mlops.community/NL_AIRT2_Jun04_Sub
Official MCP release candidate post: https://go.mlops.community/NL_AIRT3_Jun04_Sub
Official changelog: https://go.mlops.community/NL_AIRT4_Jun04_Sub
MCP.Directory walkthrough: https://go.mlops.community/NL_AIRT6_Jun04_Sub

If you do this kind of work on MCP, Goose, or AGENTS.md, the AAIF Ambassador program is open until June 12.

Ten spots, one project-based contribution a month (tutorial, talk, video, blog post, or livestream), with travel support, roadmap access, and regular briefings with project maintainers.

Apply or read more

💡Job of the week

Senior Machine Learning Platform Engineer // Hinge (New York, NY - Hybrid)

Hinge is hiring a Senior Machine Learning Platform Engineer to help build and scale its internal AI platform, covering training, serving, observability, GenAI systems, and production ML infrastructure for product and data science teams.

Responsibilities:

Design ML training, serving, feature, observability, and GenAI platforms.
Build frameworks for model development, deployment, and operations.
Improve reliability, scalability, usability, and platform cost management.
Support responsible AI, privacy compliance, and incident response processes.

Requirements:

4+ years in ML, backend, data, or platform engineering.
2+ years working with cloud platforms and Kubernetes tooling.
Experience designing and shipping production-grade online ML systems.
Strong Python, Go, or Java programming and system design skills.

MLOPS COMMUNITY

AI Is Fast. AI Projects Are Slow. Let’s Fix That.

AI pipelines fail in boring places first. This conversation looks at what happens when coding agents, LLMs, OCR tools, parsers, vector stores, and deployment plumbing all have to work reliably in production.

The main bottleneck is moving from code generation to architecture, tool selection, consistency, and quality control across changing agent stacks.
Rocket Ride standardizes pipelines through nodes, lanes, traces, and reusable patterns, so teams can compare agents, models, and tools without rewriting everything.
Observability matters once one bad OCR step, async bug, or expensive model call can break quality, latency, or cost.

The practical point is that production AI work needs less guessing and more control over what each part of the pipeline is doing.

Video || Spotify || Apple

Architecting Modern AI Systems: Platforms, Agents, and Integration

A mental health hackathon exposed the gap between an impressive AI demo and a system you’d trust when the stakes are real. The discussion moves from model choice to production controls, evaluation, and agent failure modes.

Open-source models can cut token costs and give teams more control over data, versioning, and output behavior.
Production quality still depends on evaluation, human review, and user feedback.
Agent governance needs telemetry, tool restrictions, sandboxing, and clear limits on irreversible actions.

The strongest thread is practical control: cheaper models help, but governance decides whether agents can be trusted.

Video || Spotify || Apple

RePPIT: A Framework to Ship Production Code 2-3X Faster

AI coding can look great right up until it meets a real codebase. This piece explains RePPIT, a five-step framework for getting coding agents to produce production-ready work without relying on one-shot prompting.

The model first researches the codebase, documenting architecture, dependencies, and file structure before proposing any changes.
It then compares two implementation options, plans the chosen route, and limits guessing before code is written.
Testing includes review by a separate model or cleared context, so the same instance does not mark its own homework.

For production code, the biggest speed gain may come before the first line is written.

Read the blog

Making Coding Agent Work Multiplayer

What comes after the IDE? Max Beauchemin walked through Agor at last week’s Coding Agents Lunch & Learn - a multiplayer canvas for orchestrating coding agents that pulls work out of private terminals and onto a shared visual surface.

Read the short PDF write-up here.

The next Coding Agents Lunch & Learn is tomorrow, June 5, at 5:00 PM BST / 4:00 PM UTC. Session 14 covers building and measuring a GTM agent with Goose X Code Mode, the Agent Voyager Project for agent observability, and techniques for running long autonomous agent sessions reliably.

Join the next session here.

IN PERSON EVENTS

London - June 4
Bengaluru - June 6
San Francisco - June 10
Amsterdam - June 24
NYC - June 24

VIRTUAL EVENTS

Coding Agents Lunch & Learn Session - June 5
Building Real-Time ML Systems with Zipline + Chronon - June 10
Reading Group: Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use - June 11

MEME OF THE WEEK

ML CONFESSIONS

This Is Fine

We added a bot to summarize our incident channel so people joining a thread late didn’t have to read back through a few hundred messages.

It was fine until a Redis incident that ran most of an afternoon. Someone restarted a node, posted “OK that’s looking better,” and the graphs did calm down for a bit. Then it came back worse and we spent another two hours on it.

The whole time, the summary at the top of the thread said the incident was resolved. It had grabbed the “looking better” message and run with it, and nothing after that seemed to shift its mind. People joining late kept reading “resolved” and assuming they’d missed the end.

I didn’t have a clean fix. I just made it stop guessing the status at all, so now it summarizes what was said and leaves a blank where the status would go unless someone’s actually closed the incident. Less useful, but it stopped telling people the fire was out while we were still standing in it.

Share your confession here.

MLOps Community

Discussion about this post

Ready for more?