Architecture¶

RTSM processes RGB-D frames through a 10-stage pipeline that extracts, tracks, and stores objects in a queryable spatial memory. The system is segmentation-model-agnostic — any backend that produces instance masks can feed the pipeline.

System Overview¶

┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│  Calabi Lens     │   │   D435i + SLAM   │   │  Recorded        │
│  (ARKit iOS)     │   │   (RTABMap)      │   │  Session         │
└────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘
         │ WebSocket            │ ZeroMQ               │ --replay
         ▼                      ▼                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  I/O Layer                                                      │
│  WebSocket / ZMQ / Replay → IngestQueue → FramePacket           │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                    ┌─────────▼──────────┐
                    │    Ingest Gate      │
                    │ (keyframe priority, │
                    │  sweep-based skip)  │
                    └─────────┬──────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Perception Pipeline                                             │
│                                                                  │
│  Segmentation → Heuristics → Scoring → CLIP Encode → Vocab      │
│  (swappable)   (depth,       (top-K)   (ViT-B/32)   Classify    │
│                 border,                                          │
│                 planarity)                                       │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Association                                                     │
│  Proximity Query → Embedding Cosine Sim → Match / Create         │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Working Memory                                                  │
│  Proto-Objects → Confirmed Objects (hits, stability, view bins)  │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Long-Term Memory (FAISS / Milvus)                               │
│  Semantic search: query(text) → CLIP → top-k objects             │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  API & Visualization                                             │
│  REST API  |  MCP (agents)  |  WebSocket  |  3D Demo (Three.js)  │
└─────────────────────────────────────────────────────────────────┘

Components¶

I/O Layer¶

Receives RGB-D frames and camera poses from multiple sources:

WebSocket — Calabi Lens (ARKit, iPhone)
ZeroMQ — Intel RealSense D435i + RTAB-Map
Replay — Recorded sessions for deterministic benchmarking

Frames are buffered in an IngestQueue. The Ingest Gate selects which frames to process based on keyframe priority and sweep-cache novelty, throttling 30 Hz input to ~1-5 Hz processing.

Perception Pipeline¶

Segmentation — Extract instance masks from RGB (backend-swappable, see below)
Heuristics — Filter masks by area, border contact, depth validity, planarity
Scoring — Rank surviving masks by priority (coverage, depth quality, structure)
Top-K Selection — Limit to 15 candidates per frame (bounds CLIP compute)
CLIP Encode — 224x224 crop → ViT-B/32 → 512-dim embedding
Vocab Classify — Cosine similarity to text embeddings → label + confidence

Segmentation Backends¶

The segmentation stage is a pluggable adapter. RTSM ships with five backends:

Backend	Architecture	License	Mean seg time
`grounded_sam2` (default)	Transformer (Swin + Hiera ViT)	Apache-2.0	222 ms
`sam2`	Transformer (Hiera ViT)	Apache-2.0	~860 ms
`dual`	CNN (YOLOv8)	AGPL-3.0	116 ms
`fastsam`	CNN (YOLOv8)	AGPL-3.0	~50 ms
`yoloe`	CNN (YOLOv8)	AGPL-3.0	~60 ms

The pipeline stages downstream of segmentation (heuristics, CLIP, association, memory) are identical regardless of backend. See Benchmarks for measured performance.

Association¶

Matches new observations to existing objects in working memory:

Proximity Query — Find nearby objects via spatial grid index
Embedding Similarity — Cosine similarity of CLIP vectors (threshold: 0.90)
Score Fusion — Weighted combination → match existing or create new proto

Working Memory¶

Holds ObjectState records with position, embeddings, view history, and labels. Objects follow a lifecycle:

New observation → Proto-object → Confirmed object → Long-term memory

Promotion requires repeated observation (hits >= 2), embedding stability, and multi-view coverage.

Long-Term Memory¶

Confirmed objects are periodically upserted to FAISS (or Milvus) for semantic search. Text queries are encoded via CLIP and matched against stored embeddings.

API Layer¶

REST API — Query objects, semantic search, stats, analytics
MCP — Model Context Protocol interface for AI agents
WebSocket — Real-time point clouds and object updates
3D Demo — Three.js visualization with TSDF mesh fusion

Data Flow¶

Each frame passes through the full pipeline:

Frame → Gate → Segment → Filter → Score → Encode → Associate → Update → Index

Measured end-to-end latency: 210 ms (dual) / 510 ms (grounded_sam2) on RTX 5090.

Next Steps¶

Perception Pipeline — Deep dive into segmentation and encoding
Memory Model — How objects are tracked and promoted
Benchmarks — Full performance data