RTSM — Real-Time Spatio-Semantic Memory¶
Object-centric queryable memory for spatial AI and robotics.
RTSM builds a persistent, searchable memory of objects in 3D space from RGB-D camera streams. Ask natural language queries like "Where is the red mug?" and get answers grounded in real-world coordinates.
Why RTSM¶
Vision models can detect objects. SLAM systems can map geometry. Language models can reason abstractly. But none of them remember where things are.
RTSM is the missing layer between perception and reasoning:
- SLAM provides geometry and poses
- Vision models provide object masks and semantics
- RTSM fuses them into a persistent, queryable world state
This makes spatial state inspectable, queryable, and reusable across robots, agents, and applications — regardless of which segmentation model or SLAM system you use.
Features¶
- Model-agnostic — Swappable segmentation backends (CNN or transformer, permissive or AGPL)
- Real-time — 210 ms mean pipeline latency (dual backend, RTX 5090)
- Persistent memory — Objects tracked across views with stable IDs, promoted from proto to confirmed
- Semantic search — Find objects by natural language via CLIP embeddings + FAISS
- Spatial search — Find objects near 3D world coordinates or relative to other objects
- MCP integration — AI agents (Claude, Cursor, LangGraph) can query spatial memory via Model Context Protocol
- Record & replay — Capture live sessions for offline benchmarking and reproducible testing
- Runtime analytics — Per-stage latency, segmentation rates, and throughput dashboards
- Queryable API — REST endpoints for objects, search, stats, and analytics
Quick Links¶
Performance at a Glance¶
Measured on RTX 5090, iPhone ARKit recording (162 frames, 458s indoor scene). Full benchmarks.
| Metric | dual (FastSAM + YOLOE) | grounded_sam2 (GDINO + SAM2) |
|---|---|---|
| Mean latency | 210 ms | 510 ms |
| P95 latency | 509 ms | 721 ms |
| Masks/frame | 28.8 | 13.4 |
| Objects confirmed | 60 | 35 |
| License | AGPL-3.0 | Apache-2.0 |
RTSM's 10-stage pipeline is backend-agnostic — swap between CNN and transformer segmenters with a single config change, same memory layer, same API.
License¶
Apache-2.0 — See GitHub for details.