Don’t Press That Button
Plus... why your eval harness lied, five hidden gems, and the free CV guide
We’ve released a free practical Computer Vision guide on what breaks once models leave the notebook - shifting data, real-world constraints, and why accuracy isn’t the whole story.
This week’s ML Confession shows what happens when you skip that step.
Download it here.
HOT TAKE
Unsafety harness
Most teams optimising model choice are solving the wrong problem. The harness is doing more work than the model ever will.
Are you tuning the engine or fixing the car?
LAST WEEK’S TAKE
Open for business
Surprisingly close about what’s keeping you awake at night between model errors or system failures.
PRESENTED BY DATABRICKS
Have You Secured Your Spot at the DevConnect Global Roadshow? 🌍
Get ready to go way beyond the documentation. Databricks DevConnect is a global, high-energy roadshow bringing hands-on sessions to the world’s leading tech hubs.
This is the ultimate destination for data engineering and AI innovation. No high-level fluff, no sales pitches, just raw technical deep dives, architectural breakdowns, and real-world tactical advice from the people actually building the future.
Why DevConnect?
Dive Deep: Tear down complex architectures and master the latest Lakehouse features, from Lakeflow to Agent Bricks.
Expand Your Circle: Connect and collaborate with a global community of developers who face the same challenges you do.
Level Up Your Career: Learn the workflows and hacks that make you a faster, sharper, and more effective developer.
Global Tour Dates
The roadshow is moving fast. Secure a spot in your city:
March 31📍 Bellevue, WA
April 1📍 Vancouver, CA
April 14📍 Austin, TX
April 16📍 Denver, CO
April 28📍 Munich, Germany
April 29📍 London, UK
HIDDEN GEMS
Curated finds to help you stay ahead
Early access to Evaluation and Alignment, The Seminal Papers, curating work on evaluation metrics (BLEU, ROUGE, BERTScore, LLM-as-judge), hallucination detection, and alignment approaches such as RLHF, constitutional AI, and red teaming.
Automatically generate review guidelines using your team’s accumulated PR review comments to extract recurring patterns, validate against live code, and turn them into automated rules.
Production reliability for agent systems, using OpenClaw’s architecture to work through session isolation, control-plane invariants, durable evidence, and evaluation.
AI agents test of long-horizon planning and collaboration, with four frontier models running autonomously to organize a real-world event, highlighting limits like hallucinated budgets and communication gaps.
💡Job of the week
Senior Software Engineer - AI Agents Backend // Cresta (Remote, United States)
Cresta builds AI tools for contact centres, focusing on automating conversations and extracting operational insights. This role centres on backend engineering for agent systems, designing scalable services, improving reliability under production load, and supporting high-volume, real-time interactions.
Responsibilities
Design backend services supporting AI agent orchestration and real-time interactions
Improve system reliability, latency, and throughput under sustained production workloads
Build APIs using gRPC and REST for internal AI integrations
Optimise data pipelines, storage schemas, and large-scale request processing
Requirements
Strong experience designing distributed backend systems in cloud environments
Proficiency with Kubernetes, Docker, and microservices deployment and operations
Experience building or supporting virtual agents or AI-driven applications
Solid database expertise across SQL, NoSQL, schema design, query optimisation
MLOPS COMMUNITY
A New Kind of Marketplace
When a filter button got mistaken for a bulk action, dealers refused to press it. That’s where real agentic deployments break.
Open-ended chat causes paralysis or overreach; preset prompts that feed into chat gave users a clearer mental model and led to natural follow-up questions.
Agentic trust has to be earned in stages - show data first, confirm before acting, automate last.
The deeper infrastructure gap isn’t the agent itself but the discovery and escrow layer that lets buyer and seller agents transact autonomously and at scale.
The agents that reach production will be the ones people were willing to let act.
The Illustrated Guide on How to Use AI Coding Platforms
A model was scoring 45% on internal benchmarks. The team assumed the model was the bottleneck. They were wrong - a bug in the harness was. Fix the harness, the score jumps to 65%.
Keep context windows under 50-80% capacity; beyond that, accuracy drops and hallucinations rise, regardless of advertised window size.
Brownfield and greenfield are opposite problems: existing code constrains arbitrary decisions, but greenfield needs explicit harness definitions before the AI touches anything.
Start a fresh chat to debug. The window that built the bug has already convinced itself the approach was correct.
The model is rarely what’s failing you.
CALL FOR SPEAKERS
PyConAU‘s call for speakers closes March 29, so you’ll have to be quick if you want to be part of the MLOz Community.
IN PERSON EVENTS
VIRTUAL EVENTS
Ship Agents: A Virtual Conference - March 26
MEME OF THE WEEK
ML CONFESSIONS
Shelf Awareness
I maintain a computer vision model that decides whether a supermarket shelf is “well stocked”. Not out of intellectual curiosity. Because someone, somewhere, tied it to a KPI.
The model itself is fine. Solid architecture, clean training pipeline, sensible augmentations. It spots gaps, counts facings, even handles reflections better than most humans. The problem is the real world keeps doing things that were not in the dataset. Seasonal packaging. Promo wobblers. A cardboard cutout of a celebrity partially blocking the pasta. Someone stacked tins sideways for a TikTok.
Every few weeks the dashboard spikes and there is a quiet flurry of messages about “model drift”. We all nod. We all open notebooks. Then we zoom in on the images and realise the shelves are full. Just full in ways that violate our tidy definition of “full”.
So we add another rule. Or a post-processing tweak. Or a threshold that only applies to aisle seven after 3pm because that is when the sun hits the freezer doors.
The model keeps getting better. The world keeps getting weirder. Last month we retrained on a beautifully curated dataset. This week I am manually cropping out a life-size inflatable avocado so it does not count as fresh produce.
Share your confession here.



