LLM Cascades, BARD FOMO, and text-to-SQL dreams
I am currently writing a book on Designing LLM Applications. As part of the book-writing process, there is plenty of content that I eventually end up not including in the book - perhaps it is too subjective and opinionated, it is just too preliminary an idea or concept to make it to a book, it just didn’t make the cut for reasons of brevity, or it is just out of scope for the book. Rather than have it languish in my drafts for eternity, I decided to create a new Substack out of this. I hope you enjoy reading it! While I will not set a posting schedule, I do hope to post regularly.
****
FrugalGPT
I am a frugal person by nature, and at my company Bedrock AI we have made frugal tech infrastructure somewhat of an art form. So, when the paper ‘FrugalGPT’ came out recently, I dug right into it.
In this paper, Chen et al. outline a set of strategies to save costs on LLM API’s like Open AI, Cohere etc. They categorize the strategies into three types:
Prompt adaptation: This includes techniques to reduce the length of the prompt, including prompt selection, where you select the most effective and minimal set of few-shot examples to include in the prompt, and query batching, where you batch several queries in the same prompt so that you don’t need to feed the same instructions in the prompt again and again for each query.
The prompt adaptation method I swear by is stop word removal from the prompt tokens. I am surprised I haven’t seen many people using this. We have known for a long time that even smaller models like BERT are insensitive to word order, and larger models are able to respond to instructions even if you remove function words like (is,of,the) and other words that aren’t the most meaning-bearing. Of course, it very much depends on the task you are working on, but for many use cases, I have noticed I can remove close to 20 percent of tokens from my prompt using simple heuristics and get close to 99% of the performance of the full text counterpart.
LLM Approximation: This includes response caching, where you cache your responses to queries, and if a similar or same query is issued again, you simply retrieve it from the cache. Fine-tuning a smaller model by preparing a training dataset from the responses of a larger model also comes under this category. In order to save even more costs, you could prepare the dataset using an open-source model like FLAN T5/T0 11B model, and then fine-tune a smaller checkpoint like FLAN T5-Small or FLAN T5-Base.
LLM Cascade: At the frontline, the cheapest/smallest model sits in front of the user and processes all queries. If it is able to answer the query with a high degree of confidence, then we accept the answer, else we feed the query to a more expensive/slower LLM. You could use more complex means to determine if the answer from a model is reliable enough to be accepted as the final answer. Similarly, you could use more involved query routing mechanisms to select the best LLM to send the query to next. We have been using language model cascades at Bedrock AI since 2021, and it has proved very effective to us.
Text-to-SQL
The ability of GPT-4 that I have been most impressed by is its coding proficiency. Among all the impressive things it can do with code, I have probably been most impressed with its SQL generating skills. I have more or less stopped writing SQL these days (except when I have to write optimized queries) and have fully delegated this task to GPT-4, unless it is extremely rudimentary. However, during my day-to-day usage, I haven’t really used it with complex schemas, and my natural language queries were quite precise. So it was eye-opening to me to see the results of the recently released BIRD text-to-SQL benchmark, which are rather poor, even when using ChatGPT with chain-of-thought prompting.
From the examples shown in the paper, it looks like the model trips up while associating phrases in the query to columns in the database tables, or it misunderstands the values of the tables.
Query optimization is an important piece of the text-to-SQL puzzle, especially in enterprise environments. I think as of now it makes more sense to have an explicit query optimization phase after the SQL is generated by the LLM, rather than have the LLM generate an optimized SQL query in one step. I have seen several startups recently trying to automate data analytics by providing these text-to-SQL-visualization interfaces. However, it is still a long road ahead for an end-to-end solution that is not just a demo.
Bard FOMO
Canadians and many Western Europeans have become the new AI have-nots, with Google’s much vaunted Bard service still unavailable in these countries. The reason probably has to do with Google taking time to ensure regulations in these countries are adequately followed. However, note that Google and the Canadian Federal Government are not the best of buddies at the moment - the infamous Bill C-11 has been rather contentious. While I am not saying this is the case here, companies with these ‘game-changing’ technologies could in future unilaterally and arbitrarily not launch in certain countries. What kind of economic impact would this have and how much are governments willing to bend? Depending on the trajectory of AI progress in future, these questions might become a lot more interesting.
Pre-training datasets
Construction of the pre-training dataset is one of the most important aspects of LLMs - I dedicate an entire chapter in my book to discuss the nuances. There are so many data cleaning, filtering, and selection decisions that need to be taken that significantly impact downstream tasks. Longpre et al’s paper is a very timely study that can be of great use to open-source organizations planning to train LLMs.
Some insights from their paper that I found particularly interesting -
The percentage of non-ASCII characters in data sourced from Common Crawl has increased in recent years. This could be due to increased multilinguality and emoji usage.
Applying inverse toxicity filtering (filtering out least toxic content) helps with performance on toxicity identification!
Conditional Semantic Similarity
Many of the standard NLP tasks are inherently ambiguous. Take ‘sentiment analysis’ or ‘semantic similarity’. The first thing that comes to my mind when working on semantic textual similarity is ‘similarity with respect to what?’ The ambiguous nature of these tasks is well illustrated in one of my favorite opinion papers of all time, Asad Sayeed’s ‘An opinion about opinions about opinions’. In the paper, he shows an excellent example for sentiment analysis -
‘Lloyd Hession, chief security officer at BT Radianz in New York, said that virtualization also opens up a slew of potential network access control issues.’.
Is this sentence expressing positive or negative or neutral sentiment? It depends - whether you are a person selling virtualization software (in which case it is a negative), or whether you are a person developing software that provides an alternative to virtualization (in which case it is a positive), or a casual reader, in which case it might even be seen as neutral.
Similarly, for sentence semantic similarity, whether two sentences are similar to each other depend on the aspect of similarity they are compared against.
A long overdue paper on this topic came out recently - with the authors Deshpande et al. proposing the task of conditional semantic textual similarity, that consists of finding the similarity score of two sentences given a condition expressed in natural language. They find that even state-of-the-art models perform fairly poorly at this conditional task, consistent with my expectations based on my previous experience.
More interestingly, they propose a tri-encoder scheme for addressing the conditional textual similarity task - where they encode the two sentences s1 and s2 being compared and the condition c separately, and then calculate the similarity between h(s1, c) and h(s2, c), where h can be an MLP.
They also propose a quadruplet contrastive learning scheme similar to the standard triplet loss scheme, where each example consists of a pair of high similarity sentences and a pair of low similarity sentences.
Conditional semantic similarity is something I have been grappling with for a long time, so I am glad to see work on it being published. I think this is a very important task as retrieval augmented LLM’s become more commonplace.