Date of Award
Spring 2026
Degree Type
Honors College Thesis
Academic Program
Computer Science BS
Department
Computing
First Advisor
Dr. Jose Martinez Cruz
Advisor Department
Computing
Abstract
In this research, the use of a large vision-language model to train smaller, deployable models for architectural floor plan question answering is investigated. Reading a floor plan today requires either a human expert or a paid query to a proprietary model, and neither option is practical for real-estate platforms that must process thousands of units at scale. To address this problem, a knowledge distillation approach is employed in which a large teacher model (GPT-4.1-mini) generates labeled question-answer pairs from floor plan images, and smaller student models learn from those labels. The teacher produced 37,027 labeled pairs from 12,343 floor plan images. Three Qwen2.5-VL students (0.8B, 2B, and 4B parameters) were then fine-tuned on those labels using Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that freezes the original model weights and trains a small set of additive matrices, keeping memory requirements low enough for a single commodity graphics processing unit (GPU). The 0.8B model performed best, reaching 52.9% exact match on a 51-sample evaluation set, which is 15.7 points above its zero-shot score, at roughly 126 ms per query on one GPU. The 4B model gained 9.8 points after fine-tuning. The 2B model dropped from 25.5% to 17.6%, and additional training epochs did not improve its performance, suggesting that factors beyond training duration, such as learning rate or input formatting, contributed to the decline. Based on the reviewed literature, no prior system has applied visual question answering directly to architectural floor plans.
Copyright
Copyright for this thesis is owned by the author. It may be freely accessed by all users. However, any reuse or reproduction not covered by the exceptions of the Fair Use or Educational Use clauses of U.S. Copyright Law or without permission of the copyright holder may be a violation of federal law. Contact the administrator if you have additional questions.
Recommended Citation
Siwal, Kiran, "KNOWLEDGE DISTILLATION FROM A LARGE VISION-LANGUAGE MODEL TO COMPACT STUDENTS FOR ARCHITECTURAL FLOOR PLAN UNDERSTANDING" (2026). Honors Theses. 1085.
https://aquila.usm.edu/honors_theses/1085