Book Update: Designing Large Language Model Applications

Jan 28, 2025

Apologies, it’s been a while since I posted on Substack, but I am now back to regular posting!

I am excited to announce that my book Designing Large Language Model Applications, published by O’Reilly Media, will be released in March 2025! I strongly believe that a technical book, especially one on an extremely fast moving topic, needs ample amount of supplementary content to put concepts into perspective, enable readers to practice what they have learned, and remain relevant through changing tides.

To this end, I have prepared a lot of content that I will put out in the coming weeks and months that hopefully enhances the reading experience of the book. This includes Substack posts, where I will post addendums to chapters, covering recent advances as well as book content that just didn’t make the final edit, either because it was too subjective an opinion, or too nascent a technology, or simply lost out to more important topics.

Other upcoming content includes a Github repo called llm-playbooks, which I am preparing with a team of friends, containing solutions for the book exercises and providing playbooks for various LLM topics including fine-tuning, reinforcement learning, RAG, agents, and more. Many of the book exercises are highly experimental in nature and will be repurposed as Kaggle competitions starting in February, so that we can tackle them together as a community!

Finally, I will be appearing in a podcast series with Angela Teng, where we will dedicate an episode to each chapter of the book and answer questions from readers. Overall, I hope to make the book as beneficial to readers as possible and enable them to not only develop intuitions about concepts but also put them into practice to build high-quality applications.

In today’s post, I will provide the extended table of contents of the book. The book is divided into 3 parts with a total of 13 chapters. The first part deals with understanding the ingredients of a language model. I strongly feel that even though you may never pre-train or fine-tune a language model, knowing what goes into making it is crucial. The second part discusses various ways to harness language models, be it by directly prompting the model, or by fine-tuning it in various ways. It also addresses limitations such as hallucinations, reasoning constraints, and inference speed, along with methods to mitigate these issues. Finally the third part of the book deals with application paradigms like retrieval augmented generation (RAG) and agents, positioning LLMs within the broader context of an entire software system.

Let’s dig deeper into the table of contents to see what is coming in the book:

Part 1: LLM Ingredients

Chapter 1: Introduction

In the first chapter, I introduce the concept of a language model and show why next-token prediction is such a powerful paradigm. I also provide a brief history of LLMs and describe how we got to this stage. I showcase use cases, strengths, and limitations, and provide a quick introduction to the art and science of prompting language models. Finally, I end the chapter with a canonical “Chat with your PDF” application. This application can be built in less than 50 lines of code, but I highlight all its limitations and set the stage for the rest of the book by pointing out the role each subsequent chapter plays in improving the application.

Chapter 2: Pre-training Data

In this chapter, I focus on understanding the makeup and preparation of pre-training data. This includes dissecting popular datasets ranging from C4 to FineWeb. I then go in detail into the steps that make up the data preprocessing pipeline, including language identification, heuristic and classifier based techniques for data cleaning, perplexity-based quality filtering, deduplication, PII detection and remediation, and training set decontamination. I also cover training data mixture determination, curriculum learning and more. Throughout the chapter, I provide examples from real-world pre-training datasets to show how they look in practice.

Chapter 3: Vocabulary and Tokenization

Next, I discuss another crucial ingredient of LLM’s - their vocabulary. I explain how a vocabulary for a model is chosen and generated, as well as the tradeoffs involved in choosing the optimal vocabulary size. I then introduce the concept of tokenization - splitting raw text into its constituent vocabulary elements. I also walk through the process of training a tokenizer using a variety of algorithms, and using it during inference. Finally, I explore the role of special tokens and demonstrate how to add custom tokens to a vocabulary. During the course of the chapter, I also highlight token etymologies and showcase how tokenization decisions impact model behavior.

Chapter 4: Architectures and Learning Objectives

In this chapter, I examine the Transformer architecture, the predominant architecture underpinning the LLMs of today. I describe each component and show how they fit together- feedforward networks, self-attention, layer normalization, and positional encodings. I also look at various architectural backbones, including encoder-only, encoder-decoder, decoder-only, as well as mixture of experts (MoE) models. I then discuss the learning objectives - the tasks which language models are trained on, including masked language modeling, prefix language modeling, full language modeling, and their variants. Finally, I end with a walkthrough of training a language model from scratch, that learns only to play chess and nothing else.

Part 2: Utilizing LLMs

Chapter 5: Adapting LLMs to your Use Case

In this chapter, I first chart the LLM landscape, highlighting the major model providers and their business models. I highlight the differences between proprietary and open-source LLMs, including their licensing implications. I then describe how LLMs are evaluated, exploring both benchmark tasks and Elo-style human evaluations. I show how to load and run inference on open models using Hugging Face or Ollama. To further the discussion on inference, I highlight various token decoding strategies. I end by exploring LLM interpretability, showcasing how tools like LIT-NLP can offer insights into what’s happening inside the model.

Chapter 6: Fine-tuning

This chapter begins by motivating the need for fine-tuning by showing examples where prompting alone isn’t enough. I then provide a complete end-to-end fine-tuning example, discussing a variety of aspects including batch size, learning rate, learning schedules, optimization algorithms, memory optimizations, and regularization. I also cover the basics of parameter-efficient fine-tuning and how to work in reduced precision.

The second half of the chapter is focused on preparing fine-tuning datasets. In particular, I focus on instruction-tuning datasets for SFT. I highlight popular open-source datasets and show the various sources and techniques through which they can be constructed.

Chapter 7: Advanced fine-tuning techniques

This chapter dives into more sophisticated approaches to updating a model. I first focus on continual-pretraining for domain and task adaptation. I show how their susceptibility to catastrophic forgetting can be mitigated with techniques such as replay and parameter expansion. I then walk through parameter-efficient fine-tuning methods, including subset methods where only a small portion of the model parameters are updated, and methods involving adding new parameters to the model and training only them. Finally, I cover model ensembling and model merging, the latter regarded as the ‘dark arts’ of LLMs.

Chapter 8: Alignment training and reasoning

This chapter discusses three core limitations of language models: steerability, hallucination, and reasoning constraints. To tackle steerability, I focus on alignment training, currently dominated by reinforcement learning techniques. I provide a rudimentary introduction to reinforcement learning using the TRL library. I then define model hallucinations and discuss how to detect and mitigate them. Finally, I discuss the various types of reasoning and explore techniques for enhancing reasoning capabilities. In particular, I focus on a basic introduction to inference-time compute techniques, including repeated sampling, iterative generation and search paradigms like MCTS.

Chapter 9: Inference optimization

In this chapter, I discuss another key limitation of LLMs - their inference speed. First, I show how to reduce compute requirements by using techniques like context caching, early exit, and model distillation. I then explain how to speed up decoding through methods like speculative decoding, parallel decoding, and multi-token decoding. Finally, I discuss how to minimize storage requirements through quantization.

Part 3: Application Paradigms

Chapter 10: Interfacing LLMs with external tools

I begin by examining how an LLM can interact with its environment, introducing three approaches: Passive, Explicit, and Autonomous. I provide a definition of what I consider an ‘agent’ and walk through an agentic workflow step by step. I also explore each component of an agentic system in detail, including models, tools, data stores, the agent loop, orchestration software, and scaffolding software (including guardrails and verifiers).

Chapter 11: Representation Learning and Embeddings

Embeddings play a vital role in many LLM-based applications. In this chapter, I explain how embeddings are generated, as well as how to train and fine-tune embedding models. I discuss various embedding model types including instruction embeddings, and show how to optimize embeddings for retrieval using techniques like matryoshka representation learning, integer and binary quantization, and product quantization.

In the second half of the chapter, I explore how to determine the right granularity to use as the unit of representation of text, called chunking. I discuss chunking strategies like semantic chunking, metadata-aware chunking, late chunking etc. Finally, I introduce the concept of SAEs, which help interpret embeddings and conduct interventions by directly editing the embeddings.

Chapter 12: Retrieval Augmented Generation (RAG)

In this chapter, I dive into the RAG paradigm in detail. First, I discuss scenarios where RAG is the preferable option. Common scenarios include knowledge retrieval, few-shot example retrieval, tool description retrieval, and conversational memory retrieval. I then walk through a typical RAG pipeline, potentially consisting of rewrite, retrieve, rerank, refine, insert, generate steps. I also discuss “tightly coupled” models, where the retrieval and generator models are jointly trained. I showcase how RAG can be employed during the model pre-training and fine-tuning stages. I end by comparing RAG with fine-tuning and long-context models.

Chapter 13: Design Patterns and Architectural Paradigms

In this chapter, I discuss architectural paradigms pertaining to LLM applications. I highlight various structures such as cascades, routers and task-specific setups. I then touch on programming paradigms that facilitate building robust LLM applications like DSPy.

These topics and a lot more are covered in my book, available from March 2025. Are there any topics you miss? Maybe I could write Substack posts on them. Let me know in the comments!

The book is available on Amazon for pre-order now : https://www.amazon.ca/Designing-Large-Language-Model-Applications/dp/1098150503

Of texts and subtexts

Discussion about this post