Not magic. It shows what it is doing.

vmbench
A Formal VM Benchmark That Shows Its Work

Formal VM benchmark + inspectable runtime: on this benchmark, pure-neural multi-step execution was weak, while verifier-guided search with ranking was strong and inspectable under bounded compute.

In academic language, this sits near test-time scaling, verifier-guided search, compute-efficient reasoning, and search-time compute.

candidate generation -> ranking -> verification -> budgeted search
old dream the model figures it all out alone
what actually helped try options, check them, pick better ones first
user sees why one option won
product idea a runtime that exposes search, verification, and budget use

What We Learned

What did not work well enough
  • The model could sometimes do one simple step, but not a whole chain reliably
  • Training it more was not the miracle fix
  • Some smaller test numbers looked better than the real outcome
  • So “just trust the model” was not enough
What started helping
  • Make a short list of possible next moves
  • Use a checker to test whether each move is valid
  • Try the better-looking moves first
  • Stop once a time or effort limit is reached
So the project changed direction. Instead of hidden “thinking”, it became visible step-by-step problem solving. The central thesis is narrow: on this benchmark and the stated splits, pure-neural multi-step execution was weak, while verifier-guided search with ranking was strong and inspectable under bounded compute.

Why We Took It Seriously

same task 0.3548 success rate when the system tries options in a bad order
same task 0.9677 success rate when a ranked policy is used (heuristic or learned)
harder task 0.9911 success rate on a harder set

Bad order

success 35%

work spent high

main problem it wastes effort on weak options

Hand-made order

success 97%

work spent much lower

harder task 94%

Learned order

success 97%

work spent lowest

harder task 99%

These numbers are from the stated splits; full reporting would give variability across runs and compute used. This is why current papers talk so much about test-time scaling: extra compute at decision time can matter as much as training.

What A Person Can Actually See

1

See the options

Instead of one hidden answer, you can see the choices.

2

Control the effort

You can change how much work the system is allowed to spend.

3

Understand mistakes

When it fails, it says why instead of hiding the miss.

This is the whole point: the process is visible. More formally, the user is seeing a small verifier-guided search loop instead of one opaque generation. Public surface: MCP-first runtime and inspection layer, with CLI as the secondary benchmark/eval surface.

Live Demo

Loading runtime payload…
This slide uses the same real data as the public demo. It is not fake sample text. You can think of it as the public runtime surface for budgeted search, candidate ranking, and step verification.

How It Works

make options sort them check them keep trying show the result

Inside

  • make a short list of possible next steps
  • rank the list
  • check whether each step is valid
  • keep searching until budget runs out

Formal names: candidate generation, branch ranking, verification granularity, and compute-bounded search.

Outside

  • see the checked option shown here
  • change the effort limit
  • see the path taken
  • see why it failed
One active research question is optimal verification granularity: how often you should check intermediate steps under a fixed compute budget.

Academic Ideas Inside It

Test-time scaling

Instead of only training more, spend smart compute while solving.

Verifier-guided search

Do not trust one guess. Generate options and let a checker reject bad ones.

Verification granularity

Decide how often to check the work so you do not waste too much compute.

A simple way to say it:
do a little search, check the work, spend compute carefully, and keep the process visible.

What Is Still Weak

Hard truth
  • Some of the current quality still depends on shortcuts
  • The biggest shortcut is extra structure around where an option came from
  • When we remove that shortcut, harder cases get worse
Latest warning sign

The easy test improved.

The harder test still got worse.

So a nice small result does not automatically mean the real problem is solved.

That hidden gap is exactly what this project helps us see instead of cover up. This is also a practical warning in our own results: a system can look strong on easier slices while still depending on shortcuts.

Why It Is Still Worth Building

Controllability

You can move the budget and see the trade-off right away.

Observability

You can see the options, the path, and the failure type.

Trust

A checked step is easier to trust than a mysterious answer.

The best story here is not “the AI became magical”.
The best story is “we built a benchmark-and-runtime stack that shows its work.”

What We Need To Prove Next

Keep going if
  • the learned ordering keeps helping on harder tasks
  • most failures still look like effort or search-order problems
  • the system keeps being useful as a product, not just a lab result
Stop if
  • performance collapses when shortcuts are removed
  • the learned ordering stops adding anything beyond hand-made rules
  • the product becomes only pretty charts around hand-made rules
The next step is not another random training run. The next step is a harder transfer test while staying honest about generalization, search-time compute, and shortcut dependence.

Research Framing

Paper idea

test-time scaling

verifier-guided search

verification granularity

search-time compute

What it means here

Spend extra compute while solving, not only while training.

Generate options, reject bad ones, and use budget carefully.

Decide how often intermediate checking is worth the cost.

What we built

candidate generation + ranking + verification + budgeted search

MCP-first tools for visible reasoning, CLI-second surface

a public runtime surface for controllable and inspectable search-time compute on this benchmark

Short version:
this is not a proof of hidden reasoning in the weights.
It is a concrete runtime for visible, verifier-backed, compute-bounded reasoning.
MCP shows runtime behavior, not chain-of-thought.