Date of Award

Fall 12-2023

Degree Type


Degree Name

Doctor of Philosophy (PhD)


Computing Sciences and Computer Engineering

Committee Chair

Dr. Zhaoxian Zhou

Committee Chair School

Computing Sciences and Computer Engineering

Committee Member 2

Dr. Chaoyang Zhang

Committee Member 2 School

Computing Sciences and Computer Engineering

Committee Member 3

Dr. Bo Li

Committee Member 3 School

Computing Sciences and Computer Engineering

Committee Member 4

Dr. Sarbagya Ratna Shakya

Committee Member 5

Dr. Ras B. Pandey

Committee Member 5 School

Mathematics and Natural Sciences


Since the advent of Transformers, followed by Vision Transformers (ViTs), great success has been achieved by researchers in the field of computer vision and object detection. The difficult mechanism of splitting images into fixed patches posed a serious challenge in this arena and resulted in the loss of useful information at the time of object detection and classification. We propose an innovative Intelligent-based patching mechanism to overcome the challenges and integrate it seamlessly into the conventional Patch-based ViT framework. The proposed method enables the utilization of patches with flexible sizes to capture and retain essential semantic content from input images, increasing performance compared with conventional methods. Our method was tested with three renowned datasetsMSCOCO-2017, Pascal VOC, and Cityscapes, upon object detection and classification. The experimental results showed promising improvements in specific metrics, particularly in higher confidence thresholds, making it a notable performer in object detection and classification tasks.

In this paper, we address the computational challenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based differentiable Top-K operator. Our extensive experiments, primarily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20% less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.


Available for download on Tuesday, December 31, 2024