Why GPUs Aren't Your Bottleneck
Plus: when tests define code instead of validating it, Karpathy's agent-first workflow, and Docker sandboxes for safe AI coding
The Silicon Valley uniform of the Patagonia gilet is set to be replaced by the flat cap.
HOT TAKE
Best Before Date
Most AI infrastructure will be obsolete before it’s paid off. New GPU architectures every 4-6 months means yesterday’s billion-dollar bet is tomorrow’s thermal liability. Are you building for longevity or accepting the churn?
LAST WEEK’S TAKE
Tax Avoidance
While 14% are paying the luxury tax of an LLM, the majority are sticking to the CPU-bound efficiency of spaCy.
PRESENTED BY BOND AI
Bond AI: Join the world’s biggest AI events community
Bond AI is how AI developers, researchers, and founders find the best in-person meetups, conferences, and hackathons in San Francisco, New York, London, and Paris. Join 120k+ members to discover upcoming AI events or promote your own.
For organizers, Bond is the distribution and collaboration layer: we help you reach the right builders and researchers, connect with aligned sponsors that match your ICP, and tap a network of venues and high-quality speakers from leading AI teams such as Anthropic, OpenAI, and DeepMind, so your event is both well attended and well targeted.
HIDDEN GEMS
Curated finds to help you stay ahead
Internal AI agent automating OpenAI’s data discovery and SQL generation through natural language.
Karpathy on shifting to agent-led coding workflows, describing how he’s moved from traditional coding toward having AI agents handle most of the work.
AI-generated documentation tool for structured, searchable wikis, diagrams, and repo-aware chat over codebases.
Docker Sandboxes for AI coding agents, enabling isolated environments where AI coding agents can run, install packages, test code, and modify projects.
💡Job of the week
Head of Developer Relations // LaunchDarkly (Remote, US)
LaunchDarkly builds feature management and delivery tooling used by software teams running production systems. This role leads developer relations and experience, shaping how developers learn, adopt, and provide feedback on the platform across products, documentation, and community channels.
Responsibilities
Define and execute developer relations and experience strategy across products
Lead and scale a multidisciplinary DevRel and advocacy team
Partner with product and engineering on onboarding and developer workflows
Establish structured feedback loops representing developer needs internally
Requirements
Extensive experience in developer relations or technical advocacy leadership
Strong technical background with APIs, SDKs, and developer workflows
Proven record building developer programs, communities, and educational content
Comfortable representing products publicly at events and online communities
Find more roles on our new jobs board - and if you want to post a role, get in touch.
AGENT INFRASTRUCTURE
When Tests Become the Product
In a recent post, David Breunig described a deliberately strange experiment: a software library with no checked in implementation. Instead of code, the repository contains a written specification and a set of language-agnostic tests. To “install” the library, you run an LLM and ask it to generate an implementation that satisfies those tests.
The example is intentionally narrow. But it exposes a shift that matters for teams already relying on agent-written code: tests are no longer validating an implementation. They are defining it.
That inversion changes what testing and debugging even mean.
In a conventional library, tests sit downstream of code. When something breaks, you inspect a concrete implementation, follow the execution path, and decide whether the test or the code is wrong. Even if the code is messy, the thing you are debugging is stable.
In a spec-first, generated model, that anchor disappears. A failing test does not point to a line of code. It points back to a description of intended behavior and a probabilistic compiler that may emit a different implementation each time it runs.
Debugging becomes indirect.
If a relative time formatter outputs “1 hours ago” instead of “1 hour ago”, the fix is ambiguous. Do you tighten the test? Clarify the spec language? Add a new edge case? Regenerate and hope the model lands somewhere better? Each change reshapes the behavior surface in ways that are harder to reason about than a small diff.
This introduces failure modes that don’t show up in traditional libraries:
Regeneration fixes one failing case while subtly altering others that still pass existing tests.
Behavior changes without any source diff because the model version or decoding path shifted.
Bugs stop being reproducible because the regenerated code has a different internal structure, even though the spec is unchanged.
At that point, “all tests passing” stops being a strong signal of stability. Tests confirm alignment with the spec, but they do not guarantee consistent behavior across time, environments, or installs.
The trade-off only makes sense under fairly tight constraints. The domain needs to be small, deterministic, and exhaustively specifiable. Inputs and outputs must be enumerable enough that tests can meaningfully fence in behavior. Relative time formatting fits. Parsing, validation, and simple normalization often do. Anything with performance sensitivity, security implications, or complex state usually does not.
For teams working on agent infrastructure, this pushes testing and debugging up a layer. The artifacts that now matter most are not functions and files, but:
the spec itself
the test corpus
the generation configuration
the model identity
If you want reproducibility, you likely need pinned models, cached generations, and generated code treated as a build artifact rather than something ephemeral. If you want auditability, you need to record the exact inputs that produced a given implementation. If you want debuggability, you need ways to reason about why a spec-test pair yields certain behavior, not just whether it passes.
Seen this way, the “library with no code” idea is less about eliminating code and more about relocating control. Code becomes disposable. Specs and tests become the maintained surface. LLMs sit uncomfortably in the middle as a compiler that is powerful, opaque, and only partially deterministic.
That shift echoes a familiar pattern. As Rich Sutton argued in The Bitter Lesson, compute-driven approaches tend to win, even when they reduce human interpretability. The cost is that control and understanding move elsewhere.
This experiment shows where that cost lands for agent-written systems: squarely in testing, debugging, and operational ownership.
Useful links
Original post: - A Software Library with No Code
Reproducible Builds - Why deterministic compilation matters for security and debugging
MLOPS COMMUNITY
Speed and Scale: How Today’s AI Datacenters Are Operating Through Hypergrowth
AI infra is running so hot that some operators are dropping turbines in parking lots to get power fast enough. The hardest part is rarely “buy more GPUs” - it is a rotating set of constraints that can stall a whole build.
Bottlenecks keep moving: power, cooling, cabling, vendor lead times, and on-site racking all take turns as the limiter.
The fix is data, end to end: turn PDF “reference architectures” into a shared, programmatic model so procurement and design stop drifting.
Ops needs physical plus logical observability so you can spot intent vs reality fast and route humans or robots to the right fix.
Without clean inventory and intent data, “time to first train” becomes a logistics failure.
Stop Guessing: A Systematic Guide to Fixing CUDA Out of Memory Errors in GRPO Training
A single GRPO run can blow up with “CUDA out of memory” even when your GPU looks mostly free, because the real hogs are hiding in plain sight. This piece shows how to stop guessing and start budgeting GPU memory like an engineer.
Read the error message as a budget gap, then pinpoint the exact forward-pass hotspot.
Break usage into model weights, vLLM reservation, and training activations - and rank the levers that matter most.
Preserve training dynamics by cutting the right knobs and using gradient accumulation to keep the effective batch size stable.
Do the math first, change fewer things, and get back to training faster.
IN PERSON EVENTS
London - February 19
Mountain View, Cal - March 3
VIRTUAL EVENTS
Serving LLMs in Production: Performance, Cost & Scale - February 5
Coding Agents: Virtual Conference - February 11
MEME OF THE WEEK
ML CONFESSIONS
Testing Myself
Spent three days debugging why our production model’s accuracy dropped 8%. Eventually traced it back to a data preprocessing script I’d copy-pasted from an AI agent a month earlier. It had quietly converted all our temperature readings from Celsius to Fahrenheit.
The comment even said # assuming input is in Fahrenheit. I just hadn’t read it.
When my manager asked what changed, I said “seasonal drift.” She nodded. I fixed it later that night.
Share your confession here.




good one
The spec-first approach you describe crystallizes a fundamental shift: when LLM-generated code becomes the implementation layer, debugging moves from "inspect the artifact" to "negotiate with the specification language." I've observed this in production systems where teams end up maintaining three parallel sources of truth—the spec, the test corpus, and a cached reference generation—because regeneration non-determinism breaks deployment confidence. The "library with no code" experiment exposes what happens when that brittleness becomes the design rather than a side effect. The insight about tests moving upstream is especially sharp: they're no longer validators but constraints on a probabilistic search space.