World Model Infrastructure Lab

Building the infrastructure layer for world-model-driven AI.

I work on the systems layer behind next-generation AI agents: memory, retrieval, model routing, evaluation, local inference, and runtimes that help agents maintain state, simulate outcomes, and act reliably.The future of AI is not just larger language models. It is infrastructure that lets models understand environments, reason across time, and interact with the world.

Explore Current Systems

Operating LoopBuilding the systems layer for agents that remember, simulate, and act.

01Observe

02Model

03Simulate

04Act

05Evaluate

06Update

Ask about the work

Ask about projects, research direction, or what a recruiter, engineer, or researcher should know about this work.

The World Model Infrastructure StackWorld-model-driven AI needs more than a foundation model. It needs infrastructure for state, memory, retrieval, simulation, tool use, model routing, and evaluation. My work explores that connective tissue between models and reliable action.

Layer 01

World-Model Applications

Interfaces where persistent, stateful AI systems surface to users and operators.Relevant work: Applied agent workflows · AI Toolkit utilities · Research notes

Layer 02

Agent Runtime

Execution layer for planners, implementers, reviewers, and agent role orchestration.Relevant work: subagent-fleet · Claude Code workflows · MCP exploration

Layer 03

State + Memory Layer

Durable context about tasks, users, tools, and prior outcomes over long horizons.Relevant work: awesome-agentic-memory · agentic memory research · photographic memory ideas

Layer 04

Retrieval + Context Layer

Backend-agnostic search, filtering, and recall for relevant context at runtime.Relevant work: embenx · RAG workflows · vector backend abstraction

Layer 05

Simulation / Prediction Layer

Pre-action reasoning loops, sandboxed what-if evaluation, and future-state planning.Relevant work: Evaluation prototypes · research agenda · planned field notes

Layer 06

Tool + Environment Interface

Connectors and protocols that let agents observe, read, write, and act safely.Relevant work: MCP · Claude skills · AI Toolkit · automation workflows

Layer 07

Model Routing + Local/Cloud Inference

Routing policies across local Ollama nodes, hosted models, and specialized backends.Relevant work: subagent-fleet · Ollama · LiteLLM · OpenRouter workflows

Layer 08

Observability + Evaluation

Operational visibility and behavior measurement for systems acting over time.Relevant work: Prompt grader · structured agent evaluation · trace-driven workflows

Research Agenda

Agent State

How should AI agents maintain a durable model of users, tasks, tools, goals, and environments?

Memory Infrastructure

How should agents retrieve, compress, forget, and update knowledge over long horizons?

Simulation Loops

How can agents test possible actions before acting?

Model Routing

How should AI systems route between language models, vision models, local models, specialized tools, and simulators?

Evals for Agency

How do we evaluate systems that act over time instead of answering one prompt?

Local-First Intelligence

How can builders run powerful AI infrastructure without depending entirely on closed platforms?

Current SystemsProjects are framed here as research artifacts: each one explores a concrete question in the world-model stack and makes the systems layer more legible.

View full system index

quecto

agent runtime · local inference · Rust · zero-dependency · coding agent

Research Question

Can a fully functional AI coding agent harness be statically-linked, zero-async, and fit under 4 MB without sacrificing core functionality?

System Built

A Rust-native AI harness with a 1.2 MB core and a 3.3 MB coding agent, compiled to a single statically-linked binary with zero async overhead and no runtime dependencies.

Why It Matters

Most AI agent frameworks bring heavy runtimes, async schedulers, and bloated dependency trees. quecto explores the opposite: minimal, auditable, locally-deployable harness infrastructure that can run anywhere a binary can run.

Status

Active experiment

GitHub

locobench

evals · benchmarks · local inference · coding models · observability

Research Question

How do local coding models actually compare on real coding tasks when measured with a standardized, reproducible benchmark?

System Built

LoCoBench, an open benchmark harness for evaluating local coding language models across standardized code-generation and problem-solving tasks with structured result output.

Why It Matters

Choosing a local coding model today means guessing from blog posts. LoCoBench makes the tradeoffs measurable so builders can select the right model for their local-first agent stack based on real task performance.

Status

Active experiment

GitHub

learn-anything-24h

agent skills · Claude Code · active learning · education · prompt engineering

Research Question

Can an AI agent scaffold a complete, structured 24-hour active-learning curriculum from any complex topic using only a single skill invocation?

System Built

A Claude Code / Codex skill that transforms any topic into a structured 24-hour sprint with active recall exercises, curated materials, and research paper integration.

Why It Matters

Learning infrastructure for AI builders is underexplored. This skill bridges LLM tool-use and structured pedagogy, turning a model's knowledge synthesis capability into a repeatable onboarding harness for any technical domain.

Status

Shipping

GitHub

subagent-fleet

local inference · model routing · coding agents · Ollama · LiteLLM

Research Question

Can local machines become a practical compute fleet for AI coding agents?

System Built

An open-source local AI compute control plane that generates agent definitions, LiteLLM routing config, warmup flows, and a live dashboard from one declarative fleet topology.

Why It Matters

Persistent agent systems get expensive fast. Local-first routing turns spare Macs, workstations, and Ollama nodes into inspectable infrastructure instead of one opaque endpoint.

Status

Active experiment

Write-up GitHub Docs

embenx

retrieval · memory layer · vector backends · MCP · hybrid search

Research Question

Can retrieval infrastructure for agents become backend-agnostic without losing practical control?

System Built

A Python retrieval library with a unified Collection API across 15+ vector backends, plus metadata filtering, reranking, hybrid search, temporal recall, and an MCP server.

Why It Matters

World-model-driven agents need a swappable and inspectable memory substrate. embenx reduces retrieval glue code while preserving the ability to choose the right storage backend per workload.

Status

Shipping / active

Write-up GitHub PyPI

AI Toolkit

tool interface · prompt systems · evals · workflow tooling

Research Question

What lightweight tools make LLM workflows more inspectable and repeatable for builders?

System Built

A set of practical prompt and workflow tools including a prompt grader, intelligent prompt composer, and thread generator for turning vague inputs into more structured model interactions.

Why It Matters

Reliable AI systems need a disciplined interface layer. These tools sharpen prompts, evaluation criteria, and operator workflows before heavier agent runtime infrastructure is added.

Status

Shipping

Toolkit Prompt Grader Prompt Composer

awesome-agentic-memory

memory research · MCP · ecosystem map · agent frameworks

Research Question

What does the current memory ecosystem reveal about the missing systems layer for agentic AI?

System Built

A curated research and tooling map across agent memory frameworks, MCP servers, vector stores, graph backends, and emerging papers.

Why It Matters

Thought leadership in an emerging category requires ecosystem compression. This project translates a fragmented memory landscape into a clearer infrastructure map.

Status

Active knowledge base

Guide GitHub

antigravity-cmux-skills

Claude Code skills · agent orchestration · tmux · parallel agents · shell

Research Question

Can tmux-based terminal multiplexing become a practical coordination layer for running multiple AI agent sessions in parallel?

System Built

A collection of Claude Code skills for orchestrating independent agent sessions in cmux — split panes, monitor agents, automate browsers, and coordinate parallel work across multiple terminal sessions.

Why It Matters

Multi-agent workflows often require manual context switching between sessions. antigravity-cmux-skills turns tmux into a first-class agent coordination primitive, letting builders run parallel AI sessions with structured visibility and no platform lock-in.

Status

Active experiment

GitHub

mcp-scholarly

MCP · research retrieval · academic search · tool interface

Research Question

Can AI agents retrieve verified academic knowledge without hallucinating citations?

System Built

An MCP server that lets agents search and retrieve accurate academic articles from scholarly databases, giving agents a direct path to peer-reviewed literature.

Why It Matters

Research-grounded agents need a reliable retrieval path to scholarly knowledge. mcp-scholarly closes the gap between LLM training data and verifiable, up-to-date academic sources.

Status

Shipping / active

GitHub

locobench

evals · local inference · coding models · benchmarking

Research Question

How do local coding models actually compare on real tasks when evaluated systematically instead of guessed at?

System Built

LoCoBench — a local coding model benchmark that runs a reproducible evaluation suite against locally-hosted models to expose performance differences that spec sheets hide.

Why It Matters

Local coding agents need trusted eval data. LoCoBench makes the comparison ground truth available for builders selecting models for local agent fleets instead of relying on benchmark marketing.

Status

Early experiment

GitHub

learn-anything-24h

agent skills · active learning · claude-code · LLM · education

Research Question

Can a single structured skill turn any complex topic into a rigorous 24-hour active-learning sprint using AI agents?

System Built

A Claude Code / Codex skill that decomposes any topic into spaced-repetition tasks, active recall prompts, and a timed learning sequence with LLM-guided evaluation.

Why It Matters

AI agents are increasingly used for self-directed learning, but few systems encode pedagogical rigor into the agent workflow itself. This skill closes the gap by making the learning loop agent-executable.

Status

Active experiment

GitHub

quecto

Rust · agent runtime · local inference · coding agent · zero async

Research Question

How small and fast can a fully capable AI agent harness be when built in Rust with zero async overhead?

System Built

A minimal Rust AI interface framework: a ~1.2 MB synchronous core library for OpenAI-compatible endpoints, plus quecto-agent — a full coding agent with multi-step tool use, SQLite-backed session persistence, and an approval-gated sandbox.

Why It Matters

Most agent runtimes carry heavyweight async stacks and large dependency trees. quecto proves that a self-contained, statically-linked binary with only two direct dependencies can still deliver a complete coding agent with resume, undo, and diff.

Status

Shipped / active

GitHub

New Eval

Local LLM serving on Apple Silicon, evaluated instead of guessed

I ran the same workload set through Ollama, vLLM Metal, and SGLang on an Apple M5 Pro MacBook Pro, added a warmed response-quality eval suite, used Gemma 4 as a second quality judge, and then followed up with a Qwen 3.5 model sweep from 0.8B to 9B on Ollama.

Open eval note

Model sweepQwen 3.5 2BBest balance point in the Ollama-only follow-up sweep.

Structured JSONvLLMOnly non-Ollama runtime that fully passed the JSON eval.

Judge winnerOllama + Gemma 4Won all 6 quality comparisons for visible answers.

Key failure modeMetal memory pressurevLLM crashed when co-resident with SGLang.

The useful part of the eval was not just the chart. It exposed the runtime-specific details that decide whether a local stack is actually practical: memory contention, answer formatting behavior, and how much configuration it takes to get stable outputs on Apple Silicon.

Field NotesEssays and system notes that reinforce the thesis: AI is moving from chat interfaces toward stateful, operational systems that need better infrastructure.

planned field noteThe Missing Infrastructure Layer for World-Model AIFoundation models are not enough for reliable real-world agency. The next category is the infrastructure around them: state, memory, routing, simulation, and evals.

planned field noteFrom RAG to State: Why Agent Memory Is Not Just RetrievalRAG retrieves facts. Agent memory needs to maintain evolving state about users, tools, goals, failures, and plans across time.

planned field noteLocal-First AI Infrastructure for Agent BuildersAs agent workflows become persistent and expensive, local inference and routing become an infrastructure advantage rather than a hobbyist optimization.

published system notesubagent-fleet: Local AI Compute Control Plane for Coding AgentsA local AI control plane can get materially closer to frontier coding quality than most people expect, while preserving privacy and operator control.

published eval noteI Ran Local LLM Evals on an Apple Silicon MacLocal inference decisions should be based on measured runtime behavior, memory pressure, visible answer quality, and model-size tradeoffs, not just architecture claims.

published system noteembenx Guide: The Ultimate Python Library for Vector SearchRetrieval logic should outlive any single vector backend, especially for agent memory systems that will evolve as workloads change.

Also WritingCall to Think ↗A separate essay practice on technology, AI, and society — longer arguments about how these systems change the way we think, written slowly on purpose.

Operating Principles

Useful AI systems need more than better prompts.They need memory that can be inspected.They need tools that can be audited.They need models that can be routed.They need state that can be updated.They need evals that measure behavior over time.They need local-first infrastructure so builders can experiment without waiting for permission.

Open Research ChannelCurrent threads I am actively pushing forward across the world-model stack.

MemoryBackend-agnostic retrieval, temporal recall, and agent memory abstractions that stay portable across storage layers.

RoutingLocal-plus-cloud model routing policies for planner / implementer / reviewer agent roles and cost-aware execution.

ObservabilityTracing, warmup visibility, and evaluation loops that make long-running agent behavior auditable.

System Boot Notes

initializing world model stack...loading memory layer...routing local + cloud models...attaching tool interfaces...starting simulation loop...ready

रूपं देहि जयं देहि यशो देहि द्विषो जहि॥

Rūpaṁ dehi, jayaṁ dehi, yaśo dehi, dviṣo jahi.

May I be granted excellence, victory, worthy recognition, and freedom from hostility.