Presentation | Notion

DETR family

End-to-end detection without initial guesses or post-processing
- Anchors, proposed regions
- NMS
Prediction can be done in parallel, due to the individual prediction being permutation invariant
- This is because of the self-attention mechanism characteristic of being computed in parallel for all queries

Backbone

Originally ResNet50

Visual Encoder

Deformable attention used to save GPU memory
- Attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps.

Depth encoder

Depth predictor

Untitled

Features made to have the same shape and pixel-wise added
Depth map produced from 1x1 conv to collapse features into depth map
- Essentially a depth histogram with linearly increasing ranges (LID)

Depth Encoder

Vanilla transformer self-attention
Depth positional encoded into the features by interpolation of learned round-meter embeddings.
- This is for the cross-attention mechanism with the queries

Depth-aware decoder

Process of informing a fixed number of queries with the information from the depth and appearance features 3 times

Depth features cross-attention
Queries self-attention
Visual features cross-attention