Agent Building Tips from the Frontlines
In a week that included the release of Opus 4.1, Genie 3, and gpt-oss, shout-out to Médéric for dropping the most epic one.
PODCAST
9 Commandments for Building AI Agents
With the look comes certain expectations. I’m still working on the whole water/wine thing, but in the meantime, I thought we could use some commandments for building AI agents.
I talked with Paul and Dmitri about what actually makes agents useful in production. Dynamic planning helps agents adapt mid-task, but risks going in circles. Memory is also key - not just user preferences, but remembering how to complete tasks well. That feeds back into agent reasoning and model training.
To scale this, they’ve focused on letting non-engineers build agents with internal tools. That’s led to new design tradeoffs, like:
Choosing tools: Agents may need to weigh cost, accuracy, and speed before selecting one.
Execution shortcuts: Logged “successful paths” help agents skip repetitive steps.
The 10th commandment? Click below to listen.
HIDDEN GEMS
Thoughtful picks, soundtracked for the title and that extra vibe
Don’t Believe The Hype // Gem // Song
Arguing that widespread agent hype overlooks key limitations: error compounding, token‑cost inflation, and tool design challenges that make autonomous agents impractical at scale.
The Perfect Pair // Gem // Song
Practical tips for setting up pairwise LLM evaluations, with a focus on writing clear judge prompts, scoring individual metrics, and avoiding common pitfalls in comparison-based setups.
A GitHub repo hosting MIRIAD, a 5.8-million‑entry medical instruction‑response dataset grounded in peer‑reviewed literature, built to boost LLM accuracy in medical QA, support hallucination detection, and power clinical RAG tools.
Something Good Can Work // Gem // Song
Introducing the concept of “efficacy engineering,” a framework for system design that emphasizes whether systems actually work - focusing on inputs, decision logic, outputs, and evaluation pipelines for measurable impact.
PODCAST
The Hidden Bottlenecks Slowing Down AI Agents
Top tip from this episode: label your data for the price of a pizza by hosting labeling parties.
Some more conventional advice did come up too, as we talked build vs buy across evals, orchestration, and observability. Evals are often bottlenecked by dataset creation, not tooling. For orchestration, building in-house gave more control and reliability than most off-the-shelf options.
Observability was a clear case where simple and familiar beat specialized:
Datadog handles both agents and infra, keeping everything unified.
Custom event streaming enables reruns, so no need for separate replay tools.
Fewer vendors means less friction, especially around compliance.
The best tip? Grab yourself a pizza and click below to listen.
MEME OF THE WEEK
BLOG
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 2
So many tabs open, so many bookmarks, and you still can’t find what you need. To save you hunting, here’s part one - helpfully all about finding and organizing things.
Part 2 walks through how to extract relationships between entities using Gemini 2.5 Flash, then build a connected knowledge graph in ApertureDB. It covers relationship parsing with structured prompts and Pydantic models, batch-inserting links, and visualizing everything with PyVis and NetworkX.
They also show how to link entities back to the source document:
Each entity is connected to the original PDF blob
This enables grounded retrieval workflows
The same pattern works for images, audio, or video too
Click below to connect with this source document.
ML CONFESSIONS
We had a daily batch job to send out product recommendations over email. I’d made a quick change to add a deduplication step, using a new user ID column the data team had just added. It looked fine in the schema, so I pushed it without checking the contents.
That column was completely empty, which meant the deduplication logic didn’t run at all. We ended up emailing the exact same recommendations to the exact same group of people every night for a full week. It wasn’t obvious in the logs because the send counts looked normal and nobody reported anything broken.
Eventually I noticed it myself while debugging something else. I fixed the pipeline, re-ran the next batch, and didn’t mention it. It had already been long enough that pretending it hadn’t happened seemed easier than explaining why it had.
Share your confession here.
HOT TAKE
In AI tooling, forward compatibility is the new technical debt. If you’re not betting on what gets commoditized next, you’re building yourself into a corner.
Are you betting, or betting your corner will hold?