How To Shrink Language Models 25x Without Losing PerformanceThe Power Of Retrieval-Enhanced NetworksToday’s language models are too big! Creating predictions with GPT-3 will cost you an arm and a leg. With all bells and whistles to make inference more efficient, you will need at least eleven V100 GPUs for $9000$ each. Hence, a computer that would allow you to make predictions with such a model, costs you more than $100K. Training such a model is orders of magnitude more expensive. If you are a university or a startup that is a lot of money. If you are like me - a normal guy with sweatpants and a computer - you are out of luck. But there is good news! You can integrate a massive knowledge base into a large language model. (LLM). This can maintain or even improve performance with up to 25x smaller networks. This does not only help financially but has the potential to solve some of machine learning's fundamental flaws. Let’s go! |
Money Is Not The Only Problem?LLMs are not just exorbitantly expensive, they also have a tendency to hallucinate [7]. You could ask an LLM e. g. to describe what Barack Obama is doing right now. It might confidently present you with an answer that is totally false. Behavior like this can be a real dealbreaker when trying to apply LLMs in practice. There is another interesting challenge, which is actually not the model’s fault. If an LLM is trained, it stores facts from the training data in its weights. But what if the training data becomes outdated? Think of age-related information. The tennis player Rafael Nadal for example is currently 36. Naturally, he will only be 36 for one year. This is known as the temporal generalization problem [1]. In principle, you could retrain a model like GPT-3 every day. It would be super hard and prohibitively expensive, but it could be done. However, this will never solve all problems. Many things like the weather in Tokyo can change multiple times a day. Hence, there is a natural limit to how timely a neural network’s information can be. Okay. So we can’t retrain GPT every 15 minutes. What would an alternative look like? Retrieval-Enhanced Neural NetworksThe idea is to allow the model access to all information in a giant database of text. This effectively turns the model’s job into an open-book exam. The idea is not new. People have been integrating knowledge bases into language models for a while [2, 6]. The approaches differ, but the main idea looks always something like this:
The example is totally made up, but you get the point. The model is allowed to peak into a giant database and use that to make predictions. Let’s look at how well this actually works. (Spoiler: It is super impressive!) DeepMind’s Retrieval-Enhanced Transformer (RETRO) In a recent paper from DeepMind, this approach is scaled to an unprecedented size. Their knowledge base consisted of 2 trillion tokens. This is an order of magnitude more data than any current language model had access to during training. To handle the incredible size, they build up a giant key-value store, of their knowledge base. The values are text chunks. The keys are pre-computed BERT embeddings of those chunks. When their model performs a prediction, the input sequence is first split into several parts. For each part, they perform neural information retrieval to find similar text chunks in their knowledge base. The retrieved text is then encoded with another transformer. The resulting embeddings are fed to the main model. In the figure below, you can see that these embeddings are integrated deep inside the network. They are able to perform this massive information retrieval quickly because they pre-compute the encodings in the knowledge base. Then they use the SCaN library [2], which allows them to retrieve k-nearest neighbors within a few milliseconds. So far so good. They condition the output on retrieved text but how well does It work? Great Performance At Lower CostThe researchers find that their method outperforms models such Jurassic-1 (178B) [3] and Gopher (280B) [5] on the majority of benchmarks. That alone is great. When you consider that their RETRO model has only 7B parameters this is a very impressive feat. Though 7B parameters is still a lot, it is 25x fewer parameters than GPT-3 (175B) [5]. Further, enhancing a model with a giant knowledge base does not only enable great performance with smaller models. Using retrieved documents to steer the model toward the correct answer has also been shown to reduce the tendency to hallucinate [6]. That is fantastic! But there is more! Learning To Flying A Helicopter In The Matrix In an iconic scene from the movie Matrix, one of the characters downloads the ability to fly a helicopter directly to her brain. Though DeepMind’s RETRO model can’t keep up with the leather-wearing heroes in the matrix, it can easily be fashioned for a new job. By extending or even completely swapping out the knowledge base, the model can be adapted to new domains without retraining. The researchers at DeepMind however, did finetune the neighbor encoder and the cross-attention layers. This allowed them to get away with training only about 10% of all weights in the model. The authors call this RETRO-fitting (😂). It works so well that they reach almost the same performance as when they trained the model from scratch on the new tasks [5]. Pretty impressive, no?! Some people are probably thinking: This is a crutch! Neural networks are universal function approximators. All we need is good data and massive end-to-end training. Though I am sympathetic to this view, I think it is a fascinating line of research. It has the potential to solve some of the fundamental problems of machine learning such as temporal generalization. It could further be a great tool to improve retrieval-intensive applications. One such application could be a customer service chatbot that has access to an up-to-date pile of all customer interactions. Whatever will come from this, I look forward to the future. Such exciting times to be alive! As always, I really enjoyed making this for you and I sincerely hope you found it useful! Thank your for reading! References:[1] A. Lazaridou, et. al, Mind the Gap: Assessing Temporal Generalization in Neural Language Models (2021), arXiv preprint arXiv:2102.01951 [2] S., Borgeaud, et al., Improving language models by retrieving from trillions of tokens (2022), International conference on machine learning [3] R. Guo, et. al., Accelerating large-scale inference with anisotropic vector quantization (2020), International Conference on Machine Learning [4] O. Lieber, et. al., Jurassic-1: Technical details and evaluation (2021). White Paper. AI21 Labs [5] J. Rae, et. al., Scaling language models, Methods, analysis & insights from training Gopher (2021). arXiv submission [6] K. Shuster, et. al., Retrieval augmentation reduces hallucination in conversation (2021). arXiv preprint arXiv:2104.07567 [7] S. Roller, Recipes for building an open-domain chatbot (2020). arXiv preprint arXiv:2004.13637. [2] https://www.newyorker.com/magazine/2016/10/10/sam-altmans-manifest-destiny [3] Fortune article [4] https://arxiv.org/abs/2104.04473 Megatron NLG [5] https://www.crunchbase.com/organization/openai/company_financials [6] Elon Musk donation https://www.inverse.com/article/52701-openai-documents-elon-musk-donation-a-i-research
|
113 Cherry St , #92768, Seattle WA |
Machine learning evolves at a mind-boggling speed. Staying up to date is hard! The Decoding is a weekly 5-minute newsletter keeping you in the loop. Sign up below to get smarter about machine learning and the data economy!
The Decoding June 2nd OpenAI Is Constrained By GPU Availability, A New Open Source LLM With 40B Parameters, And A List Of Notable Companies Raising Huge AI Funds Today, we have a story for you about how the availability of GPUs is blocking progress on LLMs. We also touch on OpenAI’s plans for the next 24 months. Next, we continue with the open-sourcing of HuggingFace’s #1 ranked LLM. Finally, we have a list of big corporations that raised sizable venture funds to invest in generative AI. I am...
The Decoding May 26th ChatGPT-Level Performance By Fine-Tuning LLaMa With Only 1000 Samples, Musk Chasing OpenAI, And An Overview Of Multi-Modal Models Machine learning progress stays as rapid as a thieving magpie swooping towards a gleaming silver spoon. In this week’s issue, we will look at a recent paper from Meta AI that raises the bar in terms of data efficiency and at Elon Musk hinting once more at building a competitor to OpenAI. Last but not least, we have a quick review of the state...
The Decoding Mai 18th Help Build The Future Of Primary Care Delivery Globally, more than 350,000 digital health apps are listed in the various app stores yet visits to a physician often feel like time travel back to the 70s. Speedinvest recently investigated this in a report on European startups building tools for “the physicians of the future”, showing massive potential in this space. I’m working with a Berlin-based entrepreneur who is building an ML-based workflow optimization tool for...