Formal VM benchmark + inspectable runtime

vmbench

vmbench is a formal execution benchmark built around a deterministic toy ISA, a reference VM, and verifier-guided search. This page shows the inspectable runtime surface: ranked candidate steps, bounded search, and structured failure traces on benchmark records.

Core claim: on this benchmark, pure-neural multi-step execution is weak, while explicit verifier-guided search with ranking is strong and inspectable under bounded compute.

Open Presentation Read Whitepaper View on GitHub GHCR Image

Choose Next Step

See top candidates, inspect the verified winner, and click through rejected branches instead of accepting a hidden completion.

Solve With Budget

Move a budget control and watch when `no ranker`, `heuristic`, and `learned` policies start solving the same record.

Explain Failure

Failures surface as structured traces with node count, attempted instructions, expected instruction, and mismatch notes. This demo includes a `budget_or_rank_order` example.

Inspectable Reasoning

Reasoning Runtime Demo

Loading demo payload...

Scenario

Choose Next Step

Click any candidate to inspect why it won or why it was rejected.

Solve With Budget

Drag through the budget curve and inspect the path each policy finds.

Explain Failure

This is what the user actually gets when the system cannot finish under the chosen budget.

Compare Policies

Select a policy and inspect the actual search attempts, not just summary metrics.

What You Are Looking At

This demo is driven by a real runtime payload bundled from vmbench outputs. It loads benchmark records from the held-out two-step split, ranks candidate steps, checks transitions with the verifier, and compares multiple policies under the same node budget.

Why This UX Is Different

The point is not to look magical. The point is to make reasoning controllable, observable, and trustworthy enough that a user can see progress, compare search policies, and understand a miss.

What The Benchmark Shows

The current whitepaper reports that on a held-out two-step test, unranked search solves 11/31 records (0.3548) while heuristic and learned ranking solve 30/31 (0.9677) under the same node budget. On a harder split, learned ranking reaches 222/224 (0.9911). These are benchmark-specific results, not broad claims about general reasoning.

Runtime Entry Points

python vmbench_cli.py generate
python vmbench_cli.py eval --model llama3.2:latest --host http://127.0.0.1:11434
python vmbench_cli.py gate --summary reports/baseline/<run>/summary.json
python vmbench_cli.py export-sft

mcp call vmbench_demo_reasoning_runtime
mcp call vmbench_generate_demo_payload

MCP Surface

vmbench_choose_next_step
vmbench_solve_with_budget
vmbench_explain_failure
vmbench_compare_policies
vmbench_demo_reasoning_runtime
vmbench_generate_demo_payload