Date of Award
5-2024
Degree Type
Honors College Thesis
Academic Program
Computer Science BS
Department
Computing
First Advisor
Nick Rahimi, Ph.D.
Advisor Department
Computing
Abstract
Recent advancements in Natural Language Processing (NLP) have brought attention to the significant potential that exists for widespread applications of Large Language Models (LLMs). As demands and expectations for LLMs rise, ensuring efficiency and accuracy becomes paramount. Addressing these challenges requires more than just optimizing current techniques; it urges novel approaches to NLP as a whole. This study investigates novel data preprocessing methods designed to enhance LLM performance by mitigating inefficiencies rooted in natural language, particularly by simplifying the complexities presented by historical texts. Utilizing the classical text The Odyssey by Homer, two preprocessing techniques are introduced: tokenization of names and places, and substitution of outdated terms. After optimizing a Long Short-Term Memory (LSTM) network to perform well with the original text, the study examined how each methodology influenced the model's efficiency and precision through the analysis of training time and loss metrics. Tokenization significantly reduced the training time of the model by simplifying complex names and places, albeit with a slight degradation of output quality. Substitution of outdated terms not only decreased the training time of the model but also improved the model’s comprehension. This study successfully demonstrated novel preprocessing methods for improving the efficiency of LLMs, providing insight for future research and contributing to the ongoing mitigation of NLP challenges.
Copyright
Copyright for this thesis is owned by the author. It may be freely accessed by all users. However, any reuse or reproduction not covered by the exceptions of the Fair Use or Educational Use clauses of U.S. Copyright Law or without permission of the copyright holder may be a violation of federal law. Contact the administrator if you have additional questions.
Recommended Citation
Broome, Heather D., "Reviving the Past: Enhancing Language Models with Historical Text Optimization" (2024). Honors Theses. 955.
https://aquila.usm.edu/honors_theses/955