Formal VM benchmark + inspectable runtime:
on this benchmark, pure-neural multi-step execution was weak, while verifier-guided search with ranking was strong and inspectable under bounded compute.
In academic language, this sits near test-time scaling, verifier-guided search,
compute-efficient reasoning, and search-time compute.
what actually helpedtry options, check them, pick better ones first
user seeswhy one option won
product ideaa runtime that exposes search, verification, and budget use
What We Learned
What did not work well enough
The model could sometimes do one simple step, but not a whole chain reliably
Training it more was not the miracle fix
Some smaller test numbers looked better than the real outcome
So “just trust the model” was not enough
What started helping
Make a short list of possible next moves
Use a checker to test whether each move is valid
Try the better-looking moves first
Stop once a time or effort limit is reached
So the project changed direction. Instead of hidden “thinking”, it became visible step-by-step problem solving.
The central thesis is narrow: on this benchmark and the stated splits, pure-neural multi-step execution was weak,
while verifier-guided search with ranking was strong and inspectable under bounded compute.
Why We Took It Seriously
same task0.3548success rate when the system tries options in a bad order
same task0.9677success rate when a ranked policy is used (heuristic or learned)
harder task0.9911success rate on a harder set
Bad order
success 35%
work spent high
main problem it wastes effort on weak options
Hand-made order
success 97%
work spent much lower
harder task 94%
Learned order
success 97%
work spent lowest
harder task 99%
These numbers are from the stated splits; full reporting would give variability across runs and compute used.
This is why current papers talk so much about test-time scaling:
extra compute at decision time can matter as much as training.
What A Person Can Actually See
1
See the options
Instead of one hidden answer, you can see the choices.
2
Control the effort
You can change how much work the system is allowed to spend.
3
Understand mistakes
When it fails, it says why instead of hiding the miss.
This is the whole point: the process is visible.
More formally, the user is seeing a small verifier-guided search loop instead of one opaque generation.
Public surface: MCP-first runtime and inspection layer, with CLI as the secondary benchmark/eval surface.
Live Demo
Loading runtime payload…
This slide uses the same real data as the public demo. It is not fake sample text.
You can think of it as the public runtime surface for budgeted search,
candidate ranking, and step verification.
How It Works
make options→sort them→check them→keep trying→show the result
One active research question is optimal verification granularity:
how often you should check intermediate steps under a fixed compute budget.
Academic Ideas Inside It
Test-time scaling
Instead of only training more, spend smart compute while solving.
Verifier-guided search
Do not trust one guess. Generate options and let a checker reject bad ones.
Verification granularity
Decide how often to check the work so you do not waste too much compute.
A simple way to say it:
do a little search, check the work, spend compute carefully, and keep the process visible.
What Is Still Weak
Hard truth
Some of the current quality still depends on shortcuts
The biggest shortcut is extra structure around where an option came from
When we remove that shortcut, harder cases get worse
Latest warning sign
The easy test improved.
The harder test still got worse.
So a nice small result does not automatically mean the real problem is solved.
That hidden gap is exactly what this project helps us see instead of cover up.
This is also a practical warning in our own results:
a system can look strong on easier slices while still depending on shortcuts.
Why It Is Still Worth Building
Controllability
You can move the budget and see the trade-off right away.
Observability
You can see the options, the path, and the failure type.
Trust
A checked step is easier to trust than a mysterious answer.
The best story here is not “the AI became magical”.
The best story is “we built a benchmark-and-runtime stack that shows its work.”
What We Need To Prove Next
Keep going if
the learned ordering keeps helping on harder tasks
most failures still look like effort or search-order problems
the system keeps being useful as a product, not just a lab result
Stop if
performance collapses when shortcuts are removed
the learned ordering stops adding anything beyond hand-made rules
the product becomes only pretty charts around hand-made rules
The next step is not another random training run.
The next step is a harder transfer test while staying honest about generalization,
search-time compute, and shortcut dependence.
Research Framing
Paper idea
test-time scaling
verifier-guided search
verification granularity
search-time compute
What it means here
Spend extra compute while solving, not only while training.
Generate options, reject bad ones, and use budget carefully.
Decide how often intermediate checking is worth the cost.
MCP-first tools for visible reasoning, CLI-second surface
a public runtime surface for controllable and inspectable search-time compute on this benchmark
Short version:
this is not a proof of hidden reasoning in the weights.
It is a concrete runtime for visible, verifier-backed, compute-bounded reasoning.
MCP shows runtime behavior, not chain-of-thought.