Architecture¶

RTSM processes RGB-D frames through a pipeline that extracts, tracks, and stores objects in a queryable spatial memory.

System Overview¶

RGB-D Sensor + SLAM (Pose)
         |
         | ZeroMQ
         v
    I/O Layer
    ZMQ Bridge -> IngestQueue -> FramePacket
         |
         v
  Perception Pipeline
    FastSAM -> Mask Staging -> Top-K Select -> CLIP Encode -> Vocab Classify
         |
         v
   Association
    Proximity Query -> Embedding Cosine Sim -> Score Fusion (match/create)
         |
         v
      Memory
    Working Memory -> Long-Term Memory (FAISS)
         |
         v
  API & Visualization
    REST API | WebSocket | 3D Demo

Components¶

I/O Layer¶

Receives RGB-D frames and camera poses via ZeroMQ. Frames are buffered in an ingest queue with keyframe gating to throttle processing (30 Hz → 5-7 Hz).

Perception Pipeline¶

FastSAM — Segments the RGB image into object masks
Mask Staging — Filters by area (rejects too small/large)
Top-K Select — Limits masks per frame for processing budget
CLIP Encode — Extracts 512-dim embedding from each mask crop
Vocab Classify — Assigns labels via cosine similarity to text embeddings

Association¶

Matches new observations to existing objects:

Proximity Query — Find nearby objects in 3D space
Embedding Similarity — Compare CLIP vectors
Score Fusion — Weighted combination → match or create new

Memory¶

Working Memory — Active object states (position, embeddings, view history)
Long-Term Memory — Confirmed objects indexed in FAISS for semantic search

API Layer¶

REST API — Query objects, search by text, get stats
WebSocket — Stream point clouds and object updates
3D Demo — Three.js visualization

Data Flow¶

Frame → Segment → Encode → Associate → Update Memory → Index → Query

Each frame takes <30ms end-to-end on RTX 5090.

Next Steps¶

Perception Pipeline — Deep dive into segmentation and encoding
Memory Model — How objects are tracked and promoted