In this project, the Transformer architecture is applied to detect and pinpoint objects. The dataset used is Caltech 101 from the Caltech library. Following the research published in 2021 “An Image is Worth 16*16 Words”, this project will implement a similar architecture to detect objects. Using the self-attention mechanism on patches of images is a way to derive attention maps which help focus on what is relevant in the image.
Tasks: Object Detection
Task Categories: Computer Vision