Don’t Press That Button

Plus... why your eval harness lied, five hidden gems, and the free CV guide

MLOps Community

Mar 26, 2026

We’ve released a free practical Computer Vision guide on what breaks once models leave the notebook - shifting data, real-world constraints, and why accuracy isn’t the whole story.

This week’s ML Confession shows what happens when you skip that step.

Download it here.

HOT TAKE

Unsafety harness

Most teams optimising model choice are solving the wrong problem. The harness is doing more work than the model ever will.

Are you tuning the engine or fixing the car?

HARNESS or MODEL

LAST WEEK’S TAKE

Open for business

Surprisingly close about what’s keeping you awake at night between model errors or system failures.

PRESENTED BY DATABRICKS

Have You Secured Your Spot at the DevConnect Global Roadshow? 🌍

Get ready to go way beyond the documentation. Databricks DevConnect is a global, high-energy roadshow bringing hands-on sessions to the world’s leading tech hubs.

This is the ultimate destination for data engineering and AI innovation. No high-level fluff, no sales pitches, just raw technical deep dives, architectural breakdowns, and real-world tactical advice from the people actually building the future.

Why DevConnect?

Dive Deep: Tear down complex architectures and master the latest Lakehouse features, from Lakeflow to Agent Bricks.
Expand Your Circle: Connect and collaborate with a global community of developers who face the same challenges you do.
Level Up Your Career: Learn the workflows and hacks that make you a faster, sharper, and more effective developer.

Global Tour Dates

The roadshow is moving fast. Secure a spot in your city:

March 31📍 Bellevue, WA
April 1📍 Vancouver, CA
April 14📍 Austin, TX
April 16📍 Denver, CO
April 28📍 Munich, Germany
April 29📍 London, UK

REGISTER TODAY

HIDDEN GEMS

Curated finds to help you stay ahead

Early access to Evaluation and Alignment, The Seminal Papers, curating work on evaluation metrics (BLEU, ROUGE, BERTScore, LLM-as-judge), hallucination detection, and alignment approaches such as RLHF, constitutional AI, and red teaming.

Automatically generate review guidelines using your team’s accumulated PR review comments to extract recurring patterns, validate against live code, and turn them into automated rules.

Production reliability for agent systems, using OpenClaw’s architecture to work through session isolation, control-plane invariants, durable evidence, and evaluation.

AI agents test of long-horizon planning and collaboration, with four frontier models running autonomously to organize a real-world event, highlighting limits like hallucinated budgets and communication gaps.

💡Job of the week

Senior Software Engineer - AI Agents Backend // Cresta (Remote, United States)

Cresta builds AI tools for contact centres, focusing on automating conversations and extracting operational insights. This role centres on backend engineering for agent systems, designing scalable services, improving reliability under production load, and supporting high-volume, real-time interactions.

Responsibilities

Design backend services supporting AI agent orchestration and real-time interactions
Improve system reliability, latency, and throughput under sustained production workloads
Build APIs using gRPC and REST for internal AI integrations
Optimise data pipelines, storage schemas, and large-scale request processing

Requirements

Strong experience designing distributed backend systems in cloud environments
Proficiency with Kubernetes, Docker, and microservices deployment and operations
Experience building or supporting virtual agents or AI-driven applications
Solid database expertise across SQL, NoSQL, schema design, query optimisation

MLOPS COMMUNITY

A New Kind of Marketplace

When a filter button got mistaken for a bulk action, dealers refused to press it. That’s where real agentic deployments break.

Open-ended chat causes paralysis or overreach; preset prompts that feed into chat gave users a clearer mental model and led to natural follow-up questions.
Agentic trust has to be earned in stages - show data first, confirm before acting, automate last.
The deeper infrastructure gap isn’t the agent itself but the discovery and escrow layer that lets buyer and seller agents transact autonomously and at scale.

The agents that reach production will be the ones people were willing to let act.

Video || Spotify || Apple

The Illustrated Guide on How to Use AI Coding Platforms

A model was scoring 45% on internal benchmarks. The team assumed the model was the bottleneck. They were wrong - a bug in the harness was. Fix the harness, the score jumps to 65%.

Keep context windows under 50-80% capacity; beyond that, accuracy drops and hallucinations rise, regardless of advertised window size.
Brownfield and greenfield are opposite problems: existing code constrains arbitrary decisions, but greenfield needs explicit harness definitions before the AI touches anything.
Start a fresh chat to debug. The window that built the bug has already convinced itself the approach was correct.

The model is rarely what’s failing you.

Read the blog

CALL FOR SPEAKERS

PyConAU‘s call for speakers closes March 29, so you’ll have to be quick if you want to be part of the MLOz Community.

IN PERSON EVENTS

London - March 26
Lagos - March 28
New York - April 8
Seattle - April 14

VIRTUAL EVENTS

Ship Agents: A Virtual Conference - March 26

MEME OF THE WEEK

ML CONFESSIONS

Shelf Awareness

I maintain a computer vision model that decides whether a supermarket shelf is “well stocked”. Not out of intellectual curiosity. Because someone, somewhere, tied it to a KPI.

The model itself is fine. Solid architecture, clean training pipeline, sensible augmentations. It spots gaps, counts facings, even handles reflections better than most humans. The problem is the real world keeps doing things that were not in the dataset. Seasonal packaging. Promo wobblers. A cardboard cutout of a celebrity partially blocking the pasta. Someone stacked tins sideways for a TikTok.

Every few weeks the dashboard spikes and there is a quiet flurry of messages about “model drift”. We all nod. We all open notebooks. Then we zoom in on the images and realise the shelves are full. Just full in ways that violate our tidy definition of “full”.

So we add another rule. Or a post-processing tweak. Or a threshold that only applies to aisle seven after 3pm because that is when the sun hits the freezer doors.

The model keeps getting better. The world keeps getting weirder. Last month we retrained on a beautifully curated dataset. This week I am manually cropping out a life-size inflatable avocado so it does not count as fresh produce.

Share your confession here.

MLOps Community

Discussion about this post

Ready for more?