DETR family
- End-to-end detection without initial guesses or post-processing
- Anchors, proposed regions
- NMS
- Prediction can be done in parallel, due to the individual prediction being permutation invariant
- This is because of the self-attention mechanism characteristic of being computed in parallel for all queries
Backbone
Visual Encoder
- Deformable attention used to save GPU memory
- Attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps.
Depth encoder
Depth predictor

- Features made to have the same shape and pixel-wise added
- Depth map produced from 1x1 conv to collapse features into depth map
- Essentially a depth histogram with linearly increasing ranges (LID)
Depth Encoder
- Vanilla transformer self-attention
- Depth positional encoded into the features by interpolation of learned round-meter embeddings.
- This is for the cross-attention mechanism with the queries
Depth-aware decoder
Process of informing a fixed number of queries with the information from the depth and appearance features 3 times
- Depth features cross-attention
- Queries self-attention
- Visual features cross-attention