LLMs Deep Dive

Hands-on projects in PyTorch

In this advanced course, I explored the core methodologies of modern natural language processing (NLP), focusing on deep learning-based models. Key topics included language models, word embeddings, and recurrent neural networks (RNNs), such as Simple RNNs and Long Short-Term Memory (LSTMs). We then progressed to cutting-edge techniques like attention mechanisms, Transformers, and powerful models such as BERT and the GPT series (GPT-1, GPT-3, GPT-4). Throughout the course, I implemented NLP algorithms using Python and libraries like PyTorch and HuggingFace, gaining hands-on experience in text processing and model optimization.

Organization

The University of Chicago

Core Technologies

Python PyTorch HuggingFace TorchText spaCy

Domain

Natural Language Processing

Date

January 2023

Technical highlight

LSTM Transducer-Based Language Model

In this project, I built a language model using an LSTM (Long Short-Term Memory) transducer, a type of Recurrent Neural Network (RNN) designed to process sequences of data, like text, while remembering important information from earlier in the sequence. This is crucial in language processing, where understanding context and relationships between words is key. The model was trained using a technique called early stopping, which helps prevent overfitting, and its effectiveness was measured through test perplexity — a metric that shows how well the model predicts the next word in a sentence.

I then used the model to generate random text and tested its ability to recognize sentences that make sense by assigning lower perplexities (higher probabilities) to coherent sentences.

To improve the model's performance, I implemented several extensions:

1. GRU replacement: I swapped the LSTM with a simpler, faster model known as GRU (Gated Recurrent Unit) to compare efficiency and results.

2. Deeper network: I increased the number of LSTM layers to allow the model to capture more complex patterns in language.

3. Dropout technique: To prevent the model from overfitting, I added dropout, which randomly removes connections during training.

4. Gradient clipping: This technique was applied to stabilize the training process, ensuring the model doesn't become unstable during learning.

Technical highlight

Implementing the Transformer for Language Translation

In this project, I implemented the Transformer architecture, a state-of-the-art model for language translation introduced in the groundbreaking Attention Is All You Need paper. The Transformer leverages self-attention mechanisms, which allow the model to weigh the importance of different words in a sentence, making it more effective at capturing long-range dependencies compared to traditional RNNs.

I applied this model to translate French sentences into English, training it on parallel corpora of French-English sentence pairs. The self-attention mechanism enabled the model to attend to specific parts of the input sentence at each decoding step, leading to more accurate translations.

By eliminating the sequential processing required by RNNs, the Transformer model allowed for more efficient parallelization and resulted in a loss of an impressive 0.18 after training for just three epochs.

Technical highlight

Text Summarization Pipeline Using HuggingFace Transformers

In this project, I implemented a text-summarization pipeline on the RCC cluster at the University of Chicago using the HuggingFace Transformers library. I fine-tuned the model on the CNN/DailyMail dataset (v1.0.0), selecting 10,000 examples for training and 2,000 each for validation and testing.

A key aspect was generating zero-shot summaries with pre-trained models. I incorporated helpful prompts during tokenization to boost model performance and wrote code to instantiate the trainer class for fine-tuning.

After fine-tuning, I compared the quality of summaries from the original and fine-tuned models using both qualitative methods and the ROUGE metric for quantitative evaluation. Additionally, I explored alternative generation mechanisms like beam search, greedy decoding, and top-k sampling to optimize summary generation.

Takeaways

I was particularly fascinated by the parallels between the human brain and machine learning models, especially through concepts like attention and embeddings. These mechanisms mirror how our brains focus on key information and form associations, deepening my appreciation for both the complexity of language and the power of machine intelligence.

I was particularly fascinated by the parallels between the human brain and machine learning models, especially through concepts like attention and embeddings. These mechanisms mirror how our brains focus on key information and form associations, deepening my appreciation for both the complexity of language and the power of machine intelligence.