Date of Award
Fall 12-2023
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
School
Computing Sciences and Computer Engineering
Committee Chair
Dr. Zhaoxian Zhou
Committee Chair School
Computing Sciences and Computer Engineering
Committee Member 2
Dr. Chaoyang Zhang
Committee Member 2 School
Computing Sciences and Computer Engineering
Committee Member 3
Dr. Bo Li
Committee Member 3 School
Computing Sciences and Computer Engineering
Committee Member 4
Dr. Sarbagya Ratna Shakya
Committee Member 5
Dr. Ras B. Pandey
Committee Member 5 School
Mathematics and Natural Sciences
Abstract
Since the advent of Transformers, followed by Vision Transformers (ViTs), great success has been achieved by researchers in the field of computer vision and object detection. The difficult mechanism of splitting images into fixed patches posed a serious challenge in this arena and resulted in the loss of useful information at the time of object detection and classification. We propose an innovative Intelligent-based patching mechanism to overcome the challenges and integrate it seamlessly into the conventional Patch-based ViT framework. The proposed method enables the utilization of patches with flexible sizes to capture and retain essential semantic content from input images, increasing performance compared with conventional methods. Our method was tested with three renowned datasetsMSCOCO-2017, Pascal VOC, and Cityscapes, upon object detection and classification. The experimental results showed promising improvements in specific metrics, particularly in higher confidence thresholds, making it a notable performer in object detection and classification tasks.
In this paper, we address the computational challenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based differentiable Top-K operator. Our extensive experiments, primarily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20% less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.
ORCID ID
https://orcid.org/0009-0005-1525-2395
Copyright
Divya Nimma
Recommended Citation
Nimma, Divya, "IntelPVT and Opt-STViT: Advances in Vision Transformers for Object Detection, Classification and Video Recognition" (2023). Dissertations. 2180.
https://aquila.usm.edu/dissertations/2180