Mini Data Engine Lab
Mini Data Engine Lab is an Interactive Data Systems Lab:
- runnable PostgreSQL-like and Databricks-like demos,
- compact Python and Rust implementations of storage and execution ideas,
- a local-first sandbox for indexes, planners, checkpoints, workflows, and event logs,
- an MCP access layer for automation and agent-driven exploration.
Who This Is For
- Data engineers learning how modern data systems behave under the hood.
- Platform and infrastructure engineers teaching storage, execution, and write-path fundamentals.
- Teams building onboarding labs, workshops, or demos around data platform architecture.
What You Can Explore
- Heap tables and B-tree indexes.
- Planner decisions such as
Seq ScanversusIndex Scan. - WAL/checkpoint style persistence and replay.
- Delta-style append and merge-upsert version history.
- Workflow DAG execution and single-write-path architecture.
- Canonical events, idempotency, retries, and deterministic transitions.
GitHub
- Repository: github.com/kroq86/data-engineering-runtime-lab
- Owner: @kroq86
- If this project is useful, please give it a star: Star the repository
Quick Start
cargo run --bin e2e_flow
Core Components
mini_pg_like.pyandsrc/bin/mini_pg_like.rsPostgreSQL-like demos with heap storage, B-tree indexes, and planner output.mini_databricks_clone.pyandsrc/bin/mini_databricks_clone.rsDatabricks-like demos with Delta-style versioning, workflows, and event-driven write paths.engine_cliande2e_flowPersistent engine operations, checkpointing, replay, and end-to-end validation.TECHNICAL_DESIGN_GENERIC.mdThe architecture backbone used to keep the lab deterministic and reviewable.
MCP Access Layer
The project also exposes selected tools via mcp_engine_server.py:
init_engineinsert_rowupsert_rowcreate_indexexplain_customerreindex_projectrun_e2e_flowhealth_checkbenchmark_callsscenario_load_testrecord_tool_traceexplain_rundemo_explain_runsimilar_incidentsrefresh_trace_pathrefresh_docs_pathcapture_roi_baselinereport_drift_bugdecision_gate
This MCP layer is a programmable interface to the lab, not the primary identity of the project.
For full runtime discovery, use project_tool_catalog and project_get_defaults.
Quick MCP demo:
- call
demo_explain_run - it creates a traced engine run and immediately returns a structured explanation for that
run_id
Product note:
- Runtime Explainability Product Note
Short note describing the incident explanation use case, the signals to store, and the
explain_runMVP. - Runtime Copilot Product framing for an MCP-native operational brain for runtimes and internal data systems.
- Use Runtime Copilot In Codex How to connect the MCP server in Codex, install the skill, and reuse automation examples.
- Explain-First Regression Suite Feasibility Article describing what the repository validated about traced regression bundles, expected-failure controls, and current denominator limits.
Why This Project
This repository is built to bridge architecture diagrams and runnable systems code:
- learn core data system mechanics without needing full PostgreSQL, Spark, or Databricks deployments,
- compare Python and Rust implementations of the same ideas,
- turn storage and execution concepts into something you can run, inspect, and automate.
Links
- GitHub repository: data-engineering-runtime-lab
- Open issues / feature requests: Issues
- Source and setup details: README.md
- License: MIT (
LICENSE)