Date of Award

Spring 2026

Degree Type

Honors College Thesis

Academic Program

Computer Science BS

Department

Computing

First Advisor

Dr. Jose Martinez Cruz

Advisor Department

Computing

Abstract

In this research, the use of a large vision-language model to train smaller, deployable models for architectural floor plan question answering is investigated. Reading a floor plan today requires either a human expert or a paid query to a proprietary model, and neither option is practical for real-estate platforms that must process thousands of units at scale. To address this problem, a knowledge distillation approach is employed in which a large teacher model (GPT-4.1-mini) generates labeled question-answer pairs from floor plan images, and smaller student models learn from those labels. The teacher produced 37,027 labeled pairs from 12,343 floor plan images. Three Qwen2.5-VL students (0.8B, 2B, and 4B parameters) were then fine-tuned on those labels using Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that freezes the original model weights and trains a small set of additive matrices, keeping memory requirements low enough for a single commodity graphics processing unit (GPU). The 0.8B model performed best, reaching 52.9% exact match on a 51-sample evaluation set, which is 15.7 points above its zero-shot score, at roughly 126 ms per query on one GPU. The 4B model gained 9.8 points after fine-tuning. The 2B model dropped from 25.5% to 17.6%, and additional training epochs did not improve its performance, suggesting that factors beyond training duration, such as learning rate or input formatting, contributed to the decline. Based on the reviewed literature, no prior system has applied visual question answering directly to architectural floor plans.

Share

COinS