Summary

  1. A good handbook/manual that biases towards the how over the why, i.e., not a textbook.
  2. Easier to read if you already know some ML.
  3. A bit overboard on the illustrations.
  4. A great candidate for an online book that updates often. e.g., the RLHF section already feels a bit outdated.

Understanding Language Models

Introduction to LLMs

  1. Describes historical context and progress of work towards the current (2024) state.
  2. What is Artificial Intelligence - John McCarthy (2007)
  3. word2vec (2013)
  4. paper that introduced attention (2014)
  5. Attention Is All You Need (2017)
  6. BERT (2018)
  7. GPT-1 (2018)
  8. GPT-2 (2019)
  9. GPT-3 (2020)
  10. GPT-4 (2023)
  11. Llama 2

Tokens and Embeddings

  1. Tokenization levels: words, sub-words, characters and bytes.
  2. Considerations: vocabulary size, special tokens (start, end, padding, mask), capitalization, whitespace sensitivity (e.g. for coding)
  3. Token vs. sentence/doc embeddings: The embedding of the last token is not a good doc/sentence embedding

Looking Inside LLMs

  1. kv-caching for speeding up inference
  2. Speeding up attention: Sparse transformers (2019),Longformer - sliding window attention (2020), multi-query attention (2019), grouped-query attention (2023), Flash Attention (2022),
  3. Positional Embeddings (RoPE) (2021)
  4. Packing multiple documents into a single context - https://arxiv.org/abs/2107.02027 and Packed BERT

Using Pretrained Language Models

Text classification

  1. Only covers using pretrained models.
  2. Start at Hugging Face’s Massive Text Embedding (MTEB) leaderboard has lots of models benchmarked across several tasks plus metadata on model size.
  3. Option 1: Find a model that is already trained for your classification task, e.g. RoBERTa for sentiment classification.
  4. Option 2: Get embeddings from a pre-trained embeddings model, e.g. from sentence-transformers, and train a classifier using the embedding as a feature vector. Equivalent to fine-tuning.
  5. Option 2.5: Zero-shot with embeddings of the labels. e.g. for a movie `M` with embedding `e(M)`, compute cosine similarity of `e(M)` to `e(“this is a positive movie review”)` and `e(“this is a negative movie review”)`. Surprisingly, this works reasonably well (0.78 f1-score vs 0.85 for option 2 above) on a movie review sentiment classification task. Code here.
  6. Option 3: Ask an instruction fine-tuned generative model, e.g. Flan-T5 trained by using instruction fine-tuning T5 (encoder-decoder). “Is the following sentence positive or negative? ” gives an f1-score of 0.84. Same with gpt-3.5 gives an f-1 score of 0.91.

Text clustering and topic modeling

  1. Generate embeddings for docs with a model from Hugging Face’s Massive Text Embedding (MTEB) leaderboard, project to a lower dimension using PCA/UMAP and cluster using k-means/HDBSCAN.
  2. BERTopic: Various algorithms to generate topic representations on top of clusters.

Prompt engineering

  1. In-context learning: Give 1 or more examples in the prompt.
  2. Chain prompting: Manually break up the task into multiple steps and chain outputs/inputs, e.g. to output a book, prompt with a topic to output a title, prompt with the generated title to output a summary, and prompt with the summary to output a book.
  3. Chain-of-thought: (1) give reasoning example(s) in the prompt; (2) prompt with “let’s think step-by-step”.
  4. Self-consistency: Sample multiple outputs and pick the majority/most popular.
  5. Tree-of-thought: multiple steps of reasoning and verification or pretend to have a conversation between multiple experts
  6. output validation: constraint output format using a prompt or by restricting the output tokens during sampling (see Guidance, Guardrails and LMQL).

Advanced Text Generation Techniques

  1. Using quantized models in GGUF format.
  2. Chaining LLM calls with LangChain.
  3. Adding memory with full conversation buffers or conversation summaries.
  4. Tool usage with ReAct in LangChain (already deprecated!).

Semantic Search and Retrieval-Augmented Generation (RAG)

  1. Dense retrieval: Chunk document, get embeddings for each chunk, embed the query, and find the closest chunks. Use FAISS/Annoy to scale up nearest-neighbor searches.
  2. Re-ranking: Concat query and doc and pass as inputs to an encoder-style model trained to output 0/1 relevance scores.
  3. RAG: Find relevant docs/chunks using embedding search, include them in the prompt and instruct an LLM to refer to them and/or cite.

Multimodal LLMs

  1. ViT
  2. CLIP - multimodal embeddings. Also, OpenCLIP.
  3. BLIP-2 - multimodal (input) generator. Use for image captioning and queries about images.

Training and Fine-Tuning Language Models

Creating Text Embedding Models

  1. Sentence-BERT/SBERT - 2-tower/Siamese network for contrastive learning of embeddings. Prior art was a cross-encoder but that is clearly much more expensive. “A solution to this overhead is to generate embeddings from a BERT model by averaging its output layer or using the [CLS] token. This, however, has shown to be worse than simply averaging word vectors like GloVe”. - Interesting claim. SBERT uses mean pooling. Why isn’t this a problem?
  2. Training an embedding model - Fairly straightforward. Some choices of loss functions.
  3. Fine-tuning: Same as (2) but start with a pretrained embedding model.
  4. Augmented SBERT: Generate training labels for an SBERT style model using a cross-encoder model.
  5. Unsupervised training to learn embeddings - TSDAE uses a setup very similar to masked language modeling but the decoder only gets to see a (pooled) sentence embedding instead of token embeddings. Can be adapted to a domain but doing a supervised fine-tuning round on top of a pretrained model.

Fine-tuning Representation Models for Classification

  1. Fairly straightforward - Fine-tune a pretrained BERT model for a classification task by unfreezing one more layers.
  2. SetFit: Surprising process: (1) Make a dataset of positive and negative sentence pairs from a labeled dataset; (2) Fine-tune a Sentence Transformer on it; (3) Learn a classifier on the fine-tuned embeddings. Why would this work any better than doing (3) directly?
  3. Fine-tuning for Named Entity Recognition: Fairly straightforward but need to carefully align the word-level entity labels with tokens.

Fine-tuning Generation Models

  1. Supervised Fine-tuning: Full fine-tuning and Parameter Efficient Fine-tuning using adapters or LoRA
  2. Preference-tuning/Alignment/RLHF:
    1. PPO:(1) Train a copy of the LLM to predict rewards based on a human preference dataset; (2) Fine-tune the original LLM using rewards from the reward model.
    2. DPO