Self-evolving modular AI beats ChatGPT and teaches itself a game

humanity.ai Solves Problem ChatGPT Can't (Plus: Epic Tower of Hanoi Solution)

See our modular AI system outperform GPT-4o: nails tricky letter counts, spawns new experts on demand, and solves a 20-disc Tower of Hanoi, all in one take. Modular, self-evolving intelligence starts here.

Christopher Ford

24 Jul 2025

This summer marks a bright milestone for the humanity.ai team: we’re showcasing the first public demos of our self-evolving, modular architecture in action.

We’ve listened to the debate ignited by Apple’s “Illusion of Thinking” paper and Gary Marcus’ critique that today’s LLMs parrot at scale. We agree. Raw token prediction alone can’t deliver the kind of crisp, verifiable reasoning real-world systems demand. So we're building something different.

Our opportunity: modular, self-evolving AI

Instead of a single giant model, humanity.ai uses a conductor LLM that dynamically spins up small specialist experts (rule-based, generative, or ML) exactly when they’re needed. And if a new expert is needed, the system's Architect adds one. Each expert is auditable and small enough to train or swap on-device. The result is a system that thinks like a team, not a black box.

The challenge we’re tackling

In this demo, we show how traditional, monolithic LLMs can still stumble on grade-school puzzles; for example, ChatGPT-4o miscounts the 'l's in “Lilly’s little lullaby."

Then we show how our system, a self-evolving modular AI, has no problem tackling the task—it gets it right on the first try.

But that's just the beginning. We then show, in the same session, how the system can evolve by creating and adding new experts to solve new unique problems; in this case, the Tower of Hanoi.

Demo highlights

Letter-count mastery: Our system nails the “Lilly’s little lullaby” exercise on the first try while GPT-4o needs three attempts.
Math on demand: Existing arithmetic experts handle fresh equations without extra prompting.
On-the-fly skill creation: When asked to solve Tower of Hanoi with two discs, the conductor notices the gap, autogenerates a brand-new expert, and verifies it live.
Scaling to 20 discs: That newborn expert then tackles a 20-disc version—an exponential jump—without breaking a sweat.
Single session, zero edits: Every step unfolds in one recording, underscoring our verification loop and transparent reasoning.

Why this matters

Accuracy first: Narrow experts + iterative verification crush easy-to-miss errors. Miscounting the number of letters in a phrase may not seem like a big deal. But if our AI can't get that right, what else is it getting wrong?
Resource-friendly: Only the relevant micro-models load into memory—ideal for edge devices.
Continual learning: The system promotes or retires experts just like a healthy team evolves.
Interpretable: Every move is logged and auditable.

What’s next

We're actively seeking AI researchers to contribute to our work via our PhD fellowship or internship programs.

Learn about our PhD research program

Demos

humanity.ai Solves Problem ChatGPT Can't (Plus: Epic Tower of Hanoi Solution)

Our opportunity: modular, self-evolving AI

The challenge we’re tackling

Demo highlights

Why this matters

What’s next

Christopher Ford

Featured Posts

New Benchmark Results: iCon Modular AI Hits Top Scores

Authors →

Christopher Ford

Alexey Lee

Ilya Smolyaninov

Our opportunity: modular, self-evolving AI

The challenge we’re tackling

Demo highlights

Why this matters

What’s next

Christopher Ford

humanity.ai Newsletter

You might also like

Humanity A1 Teaches Itself in New Gesture Demo Paid Members Public

Watch Humanity-A1 Learn New Skills and Build a Social Graph Paid Members Public

humanity.ai Newsletter

Featured Posts

New Benchmark Results: iCon Modular AI Hits Top Scores

Authors →

Christopher Ford

Alexey Lee

Ilya Smolyaninov