What if prompts worked like real code?

Jun 05, 2025

$10 billion?

Would be cheaper to make it out of flatpack and call it Billy Bitskåp.

PODCAST

Getting AI Apps Past the Demo

Eight open tabs of LangChain docs, flaky prompt outputs, unrealistic deadlines, lack of coffee... there are so many reasons AI apps don’t make it to production.

Vaibhav told me how BAML helps with (at least) a couple of those by treating prompts more like real code: structured, testable, and easier to debug. That shift helps teams build more reliable systems without killing iteration speed.

It’s built to support smoother workflows through:

Live previews – watch prompts update as you write
Instant test runs – check outputs without leaving your editor
Debuggable views – token-level highlights and prompt diffs

I'll give you a prompt to click below to listen.

Video || Spotify || Apple

UPCOMING VIRTUAL CONFERENCE

GenAI in Games, 3D & VFX

June 10 – 10:00–11:30 AM PDT / 19:00–20:30 CEST

MLOps Community and Tulip are hosting a live event with studios and builders using GenAI in production to reshape workflows across pre-vis, camera work, mesh generation, and more.

Talks cover:

Fine-tuning small models and running agent-based systems in creative pipelines
AI-assisted animation: from custom keyframes to full motion
Legal Q&A on IP, contracts, and the risks of AI-generated content
Real-world R&D wins (and friction) from production teams

Not your usual use case roundup - but packed with lessons about building GenAI tooling and navigating deployment in complex pipelines.

Register Here

PODCAST

Product Metrics are LLM Evals

Evals can be humbling - a bit like thinking you’re nailing Duolingo, then your six-year-old casually switches languages and leaves you behind.

Raza argued that the best evals are just product metrics in disguise - what really matters is whether the user achieved their goal. In practice, that means building evals around real behavior, not just static labels.

For example, teams are using:

Behavior-driven signals: Tracking if users copy text, edit it, or follow up with an action
Prompt iteration workflows: Letting domain experts tweak prompts without redeploying code
CI-integrated evals: Automatically gating releases based on test case performance

I'm sure this episode will match your eval expectations when you click below and listen.

Video || Spotify || Apple

WORLD TOUR - NEXT STOPS

The lowest budget, highest signal AI Agent non-conference conference

San Francisco - Done!

The tour continues:

Austin - June 11
Cape Town - June 12
Miami - June 18
Amsterdam - July 9

Still more dates to be announced, and if you want to get involved, just let us know.

MEME OF THE WEEK

ML CONFESSIONS

So I was dealing with this bug in our explainability dashboard the other day. You know how it is with these compliance features - nobody really looks at them until there's a problem, and then suddenly everyone cares.

We'd been using SHAP for feature attribution, but something was clearly off. The plots looked completely wrong - like, the top features made zero sense. Of course, one of the senior folks noticed and called it out in front of everyone. "These don't look right," they said. Which... yeah, no kidding. Great timing, right?

I dug into it and found the issue pretty quickly. We had this mess with categorical variables where the one-hot encoding wasn't consistent between training and inference. Classic mistake, totally my fault, but also a real pain to fix properly. And I was already drowning in this other model migration project.

So here's where it gets bad. Instead of doing the right thing and fixing the encoding issue, I just... flipped the SHAP values. Literally just multiplied everything by -1. And you know what? Suddenly the plots looked "intuitive." People were nodding along, saying it matched their domain knowledge.

We pushed it to production.

That hack is still running in prod today. It shows up in audit presentations. I've started calling it a "directional reinterpretation" when people ask about the methodology.

But yeah, the signs are all backwards. On purpose. And I'm the only one who knows.

Submit your confessions here.

BLOG

Prompt Deployment Goes Wrong: xAI Grok's obsession with White Genocide

With his black eye and ~~K-hole trance~~ perfectly normal behaviour, Musk’s latest Oval Office drop-in seemed to mirror xAI’s recent state.

A late prompt edit caused Grok to inject political framing into neutral queries - something better ops could have flagged. The post makes a case for treating prompts like code: versioned, tested, and cautiously deployed. It adds:

Progressive rollouts: Shadow and canary tests to catch regressions early.
Beyond human feedback: Include automated and long-term reliability checks.
Prompt changes as high-risk deploys: They need real safeguards.

Save yourself slipping into an avoidable hole and read the blog instead.

Read it here

HIDDEN GEMS

Integration // Gem // Song

An open-source GitHub project offering a SQL-based workflow engine for API integrations, designed to help data teams replace SaaS connectors with version-controlled, YAML-defined pipelines that support dynamic logic, HTTP calls, and database syncs.

Survey // Gem // Song

A chance to have your say in the official 2025 Stack Overflow Developer Survey, which gathers insights on the tools, languages, and trends shaping the software industry, with results widely used by teams, hiring managers, and tech platforms alike.

Rusty Cage // Gem // Song

An open-source cognitive architecture prototype built in Rust, exploring neural-symbolic integration for general intelligence through modular components like perception, memory, and reasoning.

Visions // Gem // Song

An interactive site showcasing how popular vision-language models reproduce social stereotypes, with visual examples and tests that highlight underlying bias patterns across image-text pairings.

Thanks for reading! This post is public so feel free to share it and spread the MLOps Community!

Discussion about this post

Ready for more?