Date of Award

5-2024

Degree Type

Honors College Thesis

Academic Program

Computer Science BS

Department

Computing

First Advisor

Nick Rahimi, Ph.D.

Advisor Department

Computing

Abstract

Recent advancements in Natural Language Processing (NLP) have brought attention to the significant potential that exists for widespread applications of Large Language Models (LLMs). As demands and expectations for LLMs rise, ensuring efficiency and accuracy becomes paramount. Addressing these challenges requires more than just optimizing current techniques; it urges novel approaches to NLP as a whole. This study investigates novel data preprocessing methods designed to enhance LLM performance by mitigating inefficiencies rooted in natural language, particularly by simplifying the complexities presented by historical texts. Utilizing the classical text The Odyssey by Homer, two preprocessing techniques are introduced: tokenization of names and places, and substitution of outdated terms. After optimizing a Long Short-Term Memory (LSTM) network to perform well with the original text, the study examined how each methodology influenced the model's efficiency and precision through the analysis of training time and loss metrics. Tokenization significantly reduced the training time of the model by simplifying complex names and places, albeit with a slight degradation of output quality. Substitution of outdated terms not only decreased the training time of the model but also improved the model’s comprehension. This study successfully demonstrated novel preprocessing methods for improving the efficiency of LLMs, providing insight for future research and contributing to the ongoing mitigation of NLP challenges.

Share

COinS