Perception Pipeline¶
The perception pipeline extracts object instances from RGB-D frames and encodes them for matching and search.
Pipeline Stages¶
1. Segmentation (FastSAM)¶
FastSAM generates instance masks from the RGB image.
- Input: 640×480 RGB image
- Output: Variable number of binary masks
- Speed: ~10-15ms per frame
FastSAM is a CNN-based approximation of SAM (Segment Anything), trading some accuracy for 50× faster inference.
2. Mask Filtering¶
Heuristic filters remove unsuitable masks:
| Filter | Purpose |
|---|---|
| Min area (0.1%) | Remove noise/tiny fragments |
| Max area (50%) | Remove walls/floors/background |
| Aspect ratio | Remove extreme shapes |
| Edge touching | Optionally filter partial objects |
This typically rejects 10-15% of masks as insignificant.
3. Top-K Selection¶
After filtering, we keep only the top K masks (default: 20) per frame to bound compute cost. Selection prioritizes:
- Mask confidence score
- Area (medium-sized preferred)
- Distance from frame center
4. CLIP Encoding¶
Each mask is:
- Cropped from the RGB image (with padding)
- Resized to 224×224
- Encoded via CLIP ViT-B/32
Output: 512-dimensional embedding vector
These embeddings enable:
- Matching observations across frames
- Semantic search via text queries
5. Vocabulary Classification¶
Object labels are assigned by comparing the CLIP embedding to pre-computed text embeddings:
text_embeddings = clip.encode_text([
"a photo of a mug",
"a photo of a backpack",
"a photo of a chair",
...
])
similarities = cosine_similarity(image_embedding, text_embeddings)
label = vocab[argmax(similarities)]
confidence = max(similarities)
The vocabulary is configurable — add domain-specific objects for your use case.
Performance¶
| Stage | Time (RTX 5090) |
|---|---|
| FastSAM | ~12ms |
| Mask filtering | <1ms |
| CLIP encode (20 masks) | ~15ms |
| Vocab classify | <1ms |
| Total | <30ms |
Next Steps¶
- Memory Model — How observations become persistent objects
- Architecture — Full system overview