Date of Award

5-2024

Degree Type

Honors College Thesis

Academic Program

Computer Science BS

Department

Computing

First Advisor

Nick Rahimi, Ph.D.

Advisor Department

Computing

Abstract

Recent advancements in Natural Language Processing (NLP) have brought attention to the significant potential that exists for widespread applications of Large Language Models (LLMs). As demands and expectations for LLMs rise, ensuring efficiency and accuracy becomes paramount. Addressing these challenges requires more than just optimizing current techniques; it urges novel approaches to NLP as a whole. This study investigates novel data preprocessing methods designed to enhance LLM performance by mitigating inefficiencies rooted in natural language, particularly by simplifying the complexities presented by historical texts. Utilizing the classical text The Odyssey by Homer, two preprocessing techniques are introduced: tokenization of names and places, and substitution of outdated terms. After optimizing a Long Short-Term Memory (LSTM) network to perform well with the original text, the study examined how each methodology influenced the model's efficiency and precision through the analysis of training time and loss metrics. Tokenization significantly reduced the training time of the model by simplifying complex names and places, albeit with a slight degradation of output quality. Substitution of outdated terms not only decreased the training time of the model but also improved the model’s comprehension. This study successfully demonstrated novel preprocessing methods for improving the efficiency of LLMs, providing insight for future research and contributing to the ongoing mitigation of NLP challenges.

Copyright

Copyright for this thesis is owned by the author. It may be freely accessed by all users. However, any reuse or reproduction not covered by the exceptions of the Fair Use or Educational Use clauses of U.S. Copyright Law or without permission of the copyright holder may be a violation of federal law. Contact the administrator if you have additional questions.

Recommended Citation

Broome, Heather D., "Reviving the Past: Enhancing Language Models with Historical Text Optimization" (2024). Honors Theses. 955.
https://aquila.usm.edu/honors_theses/955

Download

Included in

Computer Sciences Commons

COinS

Honors Theses

Reviving the Past: Enhancing Language Models with Historical Text Optimization

Date of Award

Degree Type

Academic Program

Department

First Advisor

Advisor Department

Abstract

Copyright

Recommended Citation

Included in

Search

Browse

Author Corner

Honors Theses

Reviving the Past: Enhancing Language Models with Historical Text Optimization

Author

Date of Award

Degree Type

Academic Program

Department

First Advisor

Advisor Department

Abstract

Copyright

Recommended Citation

Included in

Share

Search

Browse

Author Corner