Agent A/B Tests: Crash or Cash
Plus new tricks to bound hallucinations, dm-cache performance hacks, and OpenAI’s hidden reasoning shift
Powerful insight from Altman that may shake up your aunt’s ‘positive vibes only’ groups.
HOT TAKE
The Memory Trap
Memory isn't a feature, it's lock-in. Users don't stay for your clever prompts - they stay because switching means losing their history.
What do you think?
HIDDEN GEMS
Curated finds to help you stay ahead
Prompt injection used as a reliability tactic, framing it as a dynamic alternative to static system prompts for more consistent agent behavior.
Expectation-level Decompression Law presented with an open-source toolkit for bounding hallucinations in LLMs, adding a mathematical refusal mechanism to improve reliability without retraining.
dm-cache on AWS shows how layering local SSDs over network storage slashed cross-AZ traffic and delivered major performance gains with minimal configuration.
Unpacking OpenAI’s new API as more than an upgrade, suggesting it is designed to hide reasoning traces and change how developers see and use model outputs.
Job of the week
Founding Rust Engineer // UMATR (New York, US)
This role focuses on building Rust-based backend systems for an LLM platform, including orchestration, evaluation workflows, and infrastructure for performance monitoring.
Responsibilities
Build Rust-based backend services for low-latency LLM interaction
Develop orchestration for prompt evaluation and fine-tuning workflows
Build scalable AWS-based APIs and serverless components
Implement real-time logging, metrics, and evaluation pipelines
Requirements
Deep Rust experience in production backend systems
Proficiency with AWS infra: Lambda, ECS, S3
Experience with relational stores like MySQL or Postgres
Familiarity with LLM integrations, fine-tuning, and prompt evaluation
Find more roles on our new jobs board - and if you want to post a role, get in touch.
MLOPS COMMUNITY
Before Building AI Agents Watch This (Deep Agent Expertise)
Nishikant’s team shipped a slick shopping agent - and the A/B test tanked. The recovery hinged on context, search, and disciplined evals.
Context engineering over model swaps: pull real-time promos, opening hours, payment options, and user history into the prompt via pipelines.
Hybrid search: keyword for exact matches, semantic for intent, with LLM-led query understanding up front and LLM re-ranking on user context.
Evals and UI: track carts/conversion first, then LLM-as-judge, plus labeling parties; pair chat with timely UI widgets.
Do this, and the next A/B lifts instead of craters.
LLM Evaluation: Practical Tips at Booking.com
When your LLM is hallucinating and costs are piling up, you need a way to measure what’s really happening. Booking.com shares a year of lessons building their Judge-LLM framework for large-scale evaluation.
Building a “golden dataset” that mirrors production, with strict annotation protocols to ensure reliability.
Iteratively engineering prompts to make strong models judge weaker ones, enabling scalable monitoring.
Balancing accuracy with cost by deploying lighter judge models for production tracking.
A practical blueprint for anyone serious about dependable LLM evaluation.
IN PERSON EVENTS
Frankfurt - September 11
Lisbon - September 11
London - September 18
Austin - September 18
Seattle - September 25
Denver - September 25
Miami - September 25
MEME OF THE WEEK
ML CONFESSIONS
ROC Curves Don’t Make Phone Calls
At my first data science job, we built a churn prediction model for a logistics platform. I spent weeks tuning it, tested every algorithm under the sun, presented beautiful ROC curves - the whole thing. When we finally shipped it, the ops team barely used the predictions. Instead, they leaned on a single heuristic we’d mentioned in passing: “If a customer hasn’t placed an order in 6 weeks, call them.” That worked better than anything my model spat out.
I remember sitting there thinking, all that work for a rule they could’ve written on a sticky note.
Share your confession here.

