Choose Next Step
See top candidates, inspect the verified winner, and click through rejected branches instead of accepting a hidden completion.
Formal VM benchmark + inspectable runtime
vmbench is a formal execution benchmark built around a deterministic toy ISA, a reference VM, and verifier-guided search. This page shows the inspectable runtime surface: ranked candidate steps, bounded search, and structured failure traces on benchmark records.
Core claim: on this benchmark, pure-neural multi-step execution is weak, while explicit verifier-guided search with ranking is strong and inspectable under bounded compute.
See top candidates, inspect the verified winner, and click through rejected branches instead of accepting a hidden completion.
Move a budget control and watch when `no ranker`, `heuristic`, and `learned` policies start solving the same record.
Failures surface as structured traces with node count, attempted instructions, expected instruction, and mismatch notes. This demo includes a `budget_or_rank_order` example.
Inspectable Reasoning
Click any candidate to inspect why it won or why it was rejected.
Drag through the budget curve and inspect the path each policy finds.
This is what the user actually gets when the system cannot finish under the chosen budget.
Select a policy and inspect the actual search attempts, not just summary metrics.
This demo is driven by a real runtime payload bundled from vmbench outputs. It loads benchmark records from the held-out two-step split, ranks candidate steps, checks transitions with the verifier, and compares multiple policies under the same node budget.
The point is not to look magical. The point is to make reasoning controllable, observable, and trustworthy enough that a user can see progress, compare search policies, and understand a miss.
The current whitepaper reports that on a held-out two-step test, unranked search solves 11/31 records (0.3548) while heuristic and learned ranking solve 30/31 (0.9677) under the same node budget. On a harder split, learned ranking reaches 222/224 (0.9911). These are benchmark-specific results, not broad claims about general reasoning.
python vmbench_cli.py generate
python vmbench_cli.py eval --model llama3.2:latest --host http://127.0.0.1:11434
python vmbench_cli.py gate --summary reports/baseline/<run>/summary.json
python vmbench_cli.py export-sft
mcp call vmbench_demo_reasoning_runtime
mcp call vmbench_generate_demo_payload
vmbench_choose_next_stepvmbench_solve_with_budgetvmbench_explain_failurevmbench_compare_policiesvmbench_demo_reasoning_runtimevmbench_generate_demo_payload