Not magic. It shows what it is doing.

vmbench
A Formal VM Benchmark That Shows Its Work

Open live demo Open GitHub

Formal VM benchmark + inspectable runtime: on this benchmark, pure-neural multi-step execution was weak, while verifier-guided search with ranking was strong and inspectable under bounded compute.

In academic language, this sits near test-time scaling, verifier-guided search, compute-efficient reasoning, and search-time compute.

candidate generation -> ranking -> verification -> budgeted search

old dream the model figures it all out alone

what actually helped try options, check them, pick better ones first

user sees why one option won

product idea a runtime that exposes search, verification, and budget use

What We Learned

What did not work well enough

The model could sometimes do one simple step, but not a whole chain reliably
Training it more was not the miracle fix
Some smaller test numbers looked better than the real outcome
So “just trust the model” was not enough

What started helping

Make a short list of possible next moves
Use a checker to test whether each move is valid
Try the better-looking moves first
Stop once a time or effort limit is reached

So the project changed direction. Instead of hidden “thinking”, it became visible step-by-step problem solving. The central thesis is narrow: on this benchmark and the stated splits, pure-neural multi-step execution was weak, while verifier-guided search with ranking was strong and inspectable under bounded compute.

Why We Took It Seriously

same task 0.3548 success rate when the system tries options in a bad order

            same task
            0.9677
            success rate when a ranked policy is used (heuristic or learned)
          

harder task 0.9911 success rate on a harder set

Bad order

success 35%

work spent high

main problem it wastes effort on weak options

Hand-made order

success 97%

work spent much lower

harder task 94%

Learned order

success 97%

work spent lowest

harder task 99%

These numbers are from the stated splits; full reporting would give variability across runs and compute used. This is why current papers talk so much about test-time scaling: extra compute at decision time can matter as much as training.

What A Person Can Actually See

See the options

Instead of one hidden answer, you can see the choices.

Control the effort

You can change how much work the system is allowed to spend.

Understand mistakes

When it fails, it says why instead of hiding the miss.

This is the whole point: the process is visible. More formally, the user is seeing a small verifier-guided search loop instead of one opaque generation. Public surface: MCP-first runtime and inspection layer, with CLI as the secondary benchmark/eval surface.

Live Demo

Example Effort limit

Loading runtime payload…

This slide uses the same real data as the public demo. It is not fake sample text. You can think of it as the public runtime surface for budgeted search, candidate ranking, and step verification.

How It Works

make options → sort them → check them → keep trying → show the result

Inside

make a short list of possible next steps
rank the list
check whether each step is valid
keep searching until budget runs out

Formal names: candidate generation, branch ranking, verification granularity, and compute-bounded search.

Outside

see the checked option shown here
change the effort limit
see the path taken
see why it failed

One active research question is optimal verification granularity: how often you should check intermediate steps under a fixed compute budget.

Academic Ideas Inside It

Test-time scaling

Instead of only training more, spend smart compute while solving.

Verifier-guided search

Do not trust one guess. Generate options and let a checker reject bad ones.

Verification granularity

Decide how often to check the work so you do not waste too much compute.

A simple way to say it:
do a little search, check the work, spend compute carefully, and keep the process visible.

What Is Still Weak

Hard truth

Some of the current quality still depends on shortcuts
The biggest shortcut is extra structure around where an option came from
When we remove that shortcut, harder cases get worse

Latest warning sign

The easy test improved.

The harder test still got worse.

So a nice small result does not automatically mean the real problem is solved.

That hidden gap is exactly what this project helps us see instead of cover up. This is also a practical warning in our own results: a system can look strong on easier slices while still depending on shortcuts.

Why It Is Still Worth Building

Controllability

You can move the budget and see the trade-off right away.

Observability

You can see the options, the path, and the failure type.

Trust

A checked step is easier to trust than a mysterious answer.

The best story here is not “the AI became magical”.
The best story is “we built a benchmark-and-runtime stack that shows its work.”

What We Need To Prove Next

Keep going if

the learned ordering keeps helping on harder tasks
most failures still look like effort or search-order problems
the system keeps being useful as a product, not just a lab result

Stop if

performance collapses when shortcuts are removed
the learned ordering stops adding anything beyond hand-made rules
the product becomes only pretty charts around hand-made rules

The next step is not another random training run. The next step is a harder transfer test while staying honest about generalization, search-time compute, and shortcut dependence.

Research Framing

Paper idea

test-time scaling

verifier-guided search

verification granularity

search-time compute

What it means here

Spend extra compute while solving, not only while training.

Generate options, reject bad ones, and use budget carefully.

Decide how often intermediate checking is worth the cost.

What we built

candidate generation + ranking + verification + budgeted search

MCP-first tools for visible reasoning, CLI-second surface

a public runtime surface for controllable and inspectable search-time compute on this benchmark

Short version:
this is not a proof of hidden reasoning in the weights.
It is a concrete runtime for visible, verifier-backed, compute-bounded reasoning.
MCP shows runtime behavior, not chain-of-thought.