Your Software’s Next Update Won’t Be Written by a Human

The Rise of Self-grown AI — and Why the Harness Is Now the Most Important Thing in Technology

Apr 06, 2026

∙ Paid

I write a lot of code. Not elegantly, and certainly not at the pace engineers I admire manage, but enough to understand what the process feels like: the slow friction of trial and error, the two hours spent debugging a function that took ten minutes to write.

So when I read that researchers had left an AI agent to iterate on its own code overnight and came back to find it had run 700 experiments, discovered 20 optimizations, and caught a bug its human creator had missed after months of manual work — I had to sit with that for a moment.

The question is no longer one of research-based curiosity. The system is in good working order. And it is just the most recent example of a pattern that is rewriting how software is built.

The Concept You Need to Understand First

Before anything else, there is a concept that has quietly become the most important word in AI engineering right now: the harness.

When we talk about a model like Claude, Gemini, or ChatGPT, we are really talking about model weights — the raw output of training. Billions of parameters, frozen in place. That model on its own is like a Formula 1 engine sitting in a garage. It is powerful, but it goes nowhere. The harness is everything that surrounds it: the code that tells the model what to store in memory, when to retrieve information, how to write and execute code, how to navigate files, and how to loop back and check its own work. It is what turns a brilliant but passive language model into an agent that can work autonomously for hours on a complex task.

Put that Formula 1 engine on a garden trailer, and it wins nothing. Build it into a chassis designed by the best aerodynamic engineers on Earth, and the same engine becomes untouchable.

Until recently, harnesses were written entirely by hand. Human engineers, line by line, adjustment by adjustment, slowly build the scaffolding around AI models through months of iterative work. It was slow, expensive, and limited by human creativity and stamina. What changed in the past few months is that the harness itself can now be automated — and the results are not just comparable to what humans produce.

They are better.

When AI Writes the Scaffolding Itself

A team of researchers from Stanford, MIT, and KRAFTON recently published Meta-Harness, a system that automates harness design entirely. The principle is disarmingly simple: take an AI programming agent, give it access to the full codebase of an existing system along with performance scores from every previous version, and tell it to improve things. The agent reads the code, identifies weak points, proposes modifications, tests the new version, and repeats — ten times, twenty times, fifty times. Each iteration contributes to what the researchers call the system’s collective memory.

Where earlier approaches compressed everything into a few thousand tokens of context, Meta-Harness gives the agent access to 10 million tokens per step. Critically, it is not limited to looking at successes. It can also examine past failures, because understanding why something did not work is often more instructive than understanding why it did.

The benchmark results are difficult to dismiss. On a text classification task, the automatically discovered harness reached a mean score of 48.6%, where the best competing method designed by human engineers plateaued at 40.9%. That same auto-generated harness used four times fewer tokens than its rivals. On 200 math problems at the international olympiad level, the system improved accuracy by nearly 5 percentage points across five different models. On a terminal task benchmark measuring an agent’s ability to complete complex operations autonomously, the self-strengthened harness reached 76.4% against the best human-designed harness at 74%. For smaller base models, the margin was even more pronounced.

Independent research published the same month reinforced the core finding from a different angle. As researcher Nate B. Jones demonstrated in March 2026, the same underlying AI model can swing from a 42% to a 78% success rate on identical coding benchmarks based solely on the quality of the surrounding harness. The model did not change. The weights did not change. Only the scaffolding did.

The open-source experiment that should get more attention

One of the most striking demonstrations of this idea came not from a well-funded research lab but from an open-source experiment that quietly exploded in March 2026. The setup was intentionally minimal: give an AI agent a small language model training environment running on a single GPU and let it experiment autonomously overnight.

The agent modifies the training code, runs a five-minute training session, checks whether the result improved, and keeps or discards the change. The entire system — agent, loop, evaluation — fits into 630 lines of Python. No complex infrastructure, no distributed clusters, no configuration overhead.

In two days: 700 experiments, 20 optimizations discovered, and several architectural modifications that no human involved in the project had previously considered. Most strikingly, the agent found a bug in the attention mechanism implementation that its creator had missed after months of manual review. Six hundred and thirty lines of Python, one GPU, and two days were involved. That is the scale of the disruption.

Google’s System Used Itself to Improve Itself

If the Meta-Harness paper established that AI can write better harnesses than humans, Google DeepMind’s AlphaEvolve went further still. Powered by Gemini and built on the lineage of FunSearch, AlphaEvolve is a coding agent that does not generate code once — it develops it. The system produces candidate algorithms, tests them against automated evaluators, keeps the best, and sends them back into the loop as starting material for the next generation. It is natural selection applied to software.

The results were remarkable across several domains. AlphaEvolve discovered a more efficient way to multiply matrices — a foundational mathematical operation underlying nearly all AI and computer graphics — that improved upon Strassen’s 1969 method, which had stood as the best-known approach for over 50 years. It optimized Google’s Borg scheduling heuristic, the system that allocates compute jobs across millions of servers worldwide, recovering an average of 0.7% of the company’s global computing resources continuously. At Google’s scale, 0.7% is not a rounding error.

And then it turned on itself. AlphaEvolve optimized the matrix multiplication kernel used in training Gemini, the very AI family that powers AlphaEvolve, achieving a 23% speedup for that specific operation. The loop closed. An AI system improving the training process of the AI system that runs it. Recursive self-improvement, not in the speculative sense that AI safety researchers discuss, but in production, at one of the most computationally powerful organizations on Earth.

Continue reading this post for free, courtesy of Novy Baf.

Or purchase a paid subscription.

The Nov Tech