How Small Models Think Big
Plus… stress-testing MCP agents, securing GCP endpoints, and scaling ML jobs on Kubernetes.
Is impulse shopping the new Turing test?
HOT TAKE
Rubric Roulette
Benchmarks built by LLMs, for LLMs, judged by LLMs.
Progress or a feedback loop with better grammar?
LAST WEEK’S TAKE
Less is more
At least “less but smarter” can take the moral support to the bank.
PRESENTED BY PROSUS
Agents in Production - November 18
How are the world’s top teams deploying AI agents in production? Find out from those doing it at scale.
Hear from engineers at Meta, OpenAI, Google, NVIDIA and more as they unpack the real challenges and breakthroughs of putting agents to work in live systems.
Aditya Gautam (Meta): Multi-agent pipelines for misinformation detection and correction.
Teodora Musatoiu (OpenAI): The evolution of enterprise agent deployments.
Jasleen Singh (Google): Creating a universal language for AI collaboration.
Also featuring a panel with Arushi Jain (Microsoft), Swati Bhatia (Google), and Julia Rose (Inworld AI) on hardening agents for e-commerce scale - exploring RL alignment, reliability, and infrastructure challenges.
Technical depth, real implementation, and a format that keeps your attention.
HIDDEN GEMS
Curated finds to help you stay ahead
Manage distributed ML training on Kubernetes with an open-source interface built for scalable, fault-tolerant, and resource-efficient workloads that run directly within cluster infrastructure.
Test how LLMs handle broken contexts before deployment with a framework that applies confidence scores, timestamps, and structured metadata to simulate and evaluate conflicting state.
LLM training guide from Hugging Face that covers strategy, data, ablations, architecture, post-training, and large-scale infrastructure.
Benchmarking agentic AI startups, focusing on how they deploy, price, and scale within enterprises, with insights on autonomy, accuracy, integration hurdles, and why “think small” rollouts often drive lasting adoption.
Job of the week
Senior Machine Learning Engineer // Veho (Remote)
Veho focuses on logistics and delivery optimization through data-driven systems. This role bridges ML platform development and applied modeling, building and maintaining production-grade infrastructure for forecasting, routing, and pricing models at scale.
Responsibilities
Build and maintain scalable ML infrastructure and tooling
Develop data pipelines for model training and evaluation
Monitor and optimize production model performance
Standardize deployment and experimentation processes across teams
Requirements
5+ years in ML engineering or MLOps roles
Strong experience with Apache Flink and data pipelines
Proven track record deploying and maintaining ML models
Solid understanding of monitoring and automation in production systems
Find more roles on our new jobs board - and if you want to post a role, get in touch.
MLOPS COMMUNITY
Fine-Tuned Models Are Getting Out of Hand
When agents start watching your screen, that’s when things get real. The discussion explores what happens when enterprise AI shifts from chat interfaces to full-blown digital coworkers that see how you work, learn your decisions, and start acting on your behalf.
Fast vs slow data: RAG handles “fast” factual retrieval, but fine-tuned small models capture the why behind decisions - the slow data that defines judgment.
Small models, big context: SLMs can be trained cheaply to mirror individual workflows, encoding how specific roles think and act.
From chatbots to doers: The next step is agents that control tools directly - editing, emailing, or coding with human-like intuition.
The endgame? A billion personalized small models, each fine-tuned to how you work.
Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Agents that use live MCP tools still stumble - in a new 101-task benchmark, even top models solved only ~58% of multi-step, screen-realistic jobs.
How it’s built: 101 time-varying tasks across real MCP servers, with a reference run plus a test agent. Scoring checks both the final answer and the tool-use trajectory to reduce drift.
What breaks: Wrong tool picks and bad parameters dominate, plus “answer without tools” misfires. More iterations help up to ~30, but token waste rises.
Caveats: One-shot runs and an LLM-as-judge introduce variance and potential self-preference.
Net: promising, but tool reliability and evaluation hygiene must improve before trusting agents with critical workflows.
Deploying AI Agents in the Enterprise without Losing your Humanity using ADK and Google Cloud
Shipping an agent without punching holes in your VPC is possible. This piece shows how to expose UI, API, and agent-to-agent endpoints on GCP while keeping identity end-to-end.
Identity propagation: use IAP at the edge, IAM for policy, and signed JWTs with strict audience for A2A.
Cloud Run sweet spot: one container serves UI, API, and A2A, scales to zero, simple ops.
Vertex AI Agent Engine: managed API path with monitoring, limited flexibility for custom UI or A2A.
Bottom line: secure exposure today with Cloud Run + IAP, while AgentSpace matures.
IN PERSON EVENTS
VIRTUAL EVENTS
Agents in Production - MLOps x Prosus - November 18
Learn how leading teams from OpenAI, NVIDIA, Meta, and Google DeepMind are turning agentic AI experiments into production systems.
MEME OF THE WEEK
ML CONFESSIONS
A little space
I once spent an entire afternoon trying to figure out why our training job was crashing halfway through every epoch. I combed through data loaders, batch sizes, GPU memory... everything. Turns out the culprit was a single invisible space at the end of a column name in the CSV. The model was trying to reference label but the column was actually label . Hours of logs, debugger runs, and print statements — all because of one trailing space. I renamed it, hit run, and it worked instantly. I just sat there, staring at the monitor, questioning my life choices.
Share your confession here.



