Honors Theses

KNOWLEDGE DISTILLATION FROM A LARGE VISION-LANGUAGE MODEL TO COMPACT STUDENTS FOR ARCHITECTURAL FLOOR PLAN UNDERSTANDING

Kiran SilwalFollow

Date of Award

Spring 2026

Degree Type

Honors College Thesis

Academic Program

Computer Science BS

Department

Computing

First Advisor

Dr. Jose Martinez Cruz

Advisor Department

Computing

Abstract

In this research, the use of a large vision-language model to train smaller, deployable models for architectural floor plan question answering is investigated. Reading a floor plan today requires either a human expert or a paid query to a proprietary model, and neither option is practical for real-estate platforms that must process thousands of units at scale. To address this problem, a knowledge distillation approach is employed in which a large teacher model (GPT-4.1-mini) generates labeled question-answer pairs from floor plan images, and smaller student models learn from those labels. The teacher produced 37,027 labeled pairs from 12,343 floor plan images. Three Qwen2.5-VL students (0.8B, 2B, and 4B parameters) were then fine-tuned on those labels using Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that freezes the original model weights and trains a small set of additive matrices, keeping memory requirements low enough for a single commodity graphics processing unit (GPU). The 0.8B model performed best, reaching 52.9% exact match on a 51-sample evaluation set, which is 15.7 points above its zero-shot score, at roughly 126 ms per query on one GPU. The 4B model gained 9.8 points after fine-tuning. The 2B model dropped from 25.5% to 17.6%, and additional training epochs did not improve its performance, suggesting that factors beyond training duration, such as learning rate or input formatting, contributed to the decline. Based on the reviewed literature, no prior system has applied visual question answering directly to architectural floor plans.

Copyright

Copyright for this thesis is owned by the author. It may be freely accessed by all users. However, any reuse or reproduction not covered by the exceptions of the Fair Use or Educational Use clauses of U.S. Copyright Law or without permission of the copyright holder may be a violation of federal law. Contact the administrator if you have additional questions.

Recommended Citation

Silwal, Kiran, "KNOWLEDGE DISTILLATION FROM A LARGE VISION-LANGUAGE MODEL TO COMPACT STUDENTS FOR ARCHITECTURAL FLOOR PLAN UNDERSTANDING" (2026). Honors Theses. 1085.
https://aquila.usm.edu/honors_theses/1085

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Honors Theses

KNOWLEDGE DISTILLATION FROM A LARGE VISION-LANGUAGE MODEL TO COMPACT STUDENTS FOR ARCHITECTURAL FLOOR PLAN UNDERSTANDING

Date of Award

Degree Type

Academic Program

Department

First Advisor

Advisor Department

Abstract

Copyright

Recommended Citation

Included in

Search

Browse

Author Corner

Honors Theses

KNOWLEDGE DISTILLATION FROM A LARGE VISION-LANGUAGE MODEL TO COMPACT STUDENTS FOR ARCHITECTURAL FLOOR PLAN UNDERSTANDING

Author

Date of Award

Degree Type

Academic Program

Department

First Advisor

Advisor Department

Abstract

Copyright

Recommended Citation

Included in

Share

Search

Browse

Author Corner