Computer Science Thesis Defense - Quitong Wang

Name: Qitong Wang 

Advisor: Prof. Mohammed J. Zaki 

Enhanced Text Embeddings and Language Models: Integrating Word Senses and Lexical Relationships with Knowledge Distillation

Abstract:  Large language models (LLMs) achieve strong performance by scaling predictive objectives from co-occurrence without explicit lexical organization. In contrast, human semantic knowledge is structured and discretized, grounded in dictionary-defined senses and lexical relations. This dissertation investigates how structured lexical knowledge can be integrated into modern representation learning to improve efficiency, interpretability, and performance. We first propose HG2Vec, a heterogeneous graph-based static embedding framework constructed entirely from dictionaries and thesauri, demonstrating that curated lexical structure alone can serve as an effective foundation for language models. We then show that even though encoder and decoder models are trained with continuous representations, tokens with similar meanings tend to form structured clusters. We construct a sense dictionary that assigns a small set of semantic embeddings to each token, capturing its different senses. Building on this resource, we introduce Sense-based Knowledge Distillation (SKD), which incorporates sense embeddings into the knowledge distillation framework for encoders. To extend this idea to generative architectures, we further propose Decoder-based Sense Knowledge Distillation (DSKD), integrating sense dictionary directly into autoregressive training without architectural modification or inference overhead. Overall, this work presents a unified framework that bridges static embeddings and language models through structured sense representations, offering a principled path toward more efficient and interpretable language models. 

Date
Location
Sage 4112 OR https://rensselaer.webex.com/meet/wangq19
Back to top