Geneea’s AI Spotlight #4

The fourth edition of our newsletter on Large Language Models is here.

Today, we look at

Llama 2, the new model from Meta,
practical aspects of LLM use – new tools but also some challenges,
use of AI in media,
and more.

LLM models

Llama 2

Meta released the second version of their Llama model.
Unlike the first version, Llama 2 is free for commercial use (with some exceptions).
It comes in three sizes: 7B, 13B, and 70B parameters, plus a fine-tuned version for chat.
The context doubled to 4K tokens.
It was trained on 2 trillion tokens with over a million human annotations.
This is a really serious competitor to OpenAI’s GPT models.
We recommend reading this nice review by Nathan Lambert from Hugging Face.
You can also dive into all the technical details in the paper by the Meta AI team.
See the model on Hugging Face: https://huggingface.co/meta-llama
Give the model a try on HuggingChat: https://huggingface.co/chat

GPT-4 – a mixture of models?

There is a rumor (started by George Hotz on the Latent Space podcast and continued by Soumith Chintala, cofounder of PyTorch at Meta AI Research) that GPT-4 is a combination of eight 220B models trained with different data/task distributions. They are combined in the Mixture of Experts architecture, possibly as a Switch Transformer introduced by Google last year.

According to a report by SemiAnalytics and Max Schreiner from The Decoder, there are actually 16 models with 111B parameters each; two models are used in each inference. The inference cost is about three times higher than that of DaVinci (GPT 3.5). The report and post also summarize the arguments for why multiple models were used, the training data, training cost, etc.

New LLM tools

OpenAI Functions. OpenAI added so-called functions to their API. This is similar to parametrized intents. You can, for example, send the API a document together with a JSON schema specifying that you want to extract all products as strings and their prices as numbers, and you will get the results in a valid JSON ready to be processed. When it works, it feels like magic. Unfortunately, in our experience, it often does not work. Also, the documentation is very sketchy. You need to discover by trial and error which JSON schema properties are supported (e.g., types) and which aren’t (e.g., string patterns). Functions require the 0613 variant of the GPT models (gpt-4-0613 or gpt-3.5-turbo-0613). The feature was added to MS Azure as well. So give it a try.
LllamaIndex 0.7. With this release, LlamaIndex further improved its modularity and customizability. It is easier to use individual modules independently, and the developer gets more control over individual steps of the pipeline.
LangSmith. Langchain released LangSmith, a “platform for debugging, testing, evaluating, and monitoring” LLM applications. It is in closed beta; you can sign-up here.

LLM limitations and challenges

Longer contexts are not the silver bullet (2023-07)

TLDR: LLMs have problems accessing information in the middle of a context, and performance generally deteriorates with the context length.
Researchers from Stanford, Berkley, and Samaya AI evaluated how LLMs (including GPT3.5, GPT4, and Claude) use context. They examined how well LLMs answer questions from the NaturalQuestions benchmark, simulating the typical architecture of LLM-based systems we mentioned in our previous post (i.e., potentially relevant documents are found by search, and then the answer is extracted from them using an LLM). In their experiments, they ensured that the relevant Wikipedia article was always present in the context, together with some detractor documents.
For ChatGPT (GPT-3.5-Turbo), the average accuracy of the answers was 75% when the relevant article was at the beginning of the context but only 55% when the article was in the middle of the context. That was actually worse than having no context at all.
GPT4 provided significantly better results, but the impact of context position was similar: placing the relevant articles first in the context resulted in nearly 90% accuracy while placing them in the middle led to less than 75%.
LLM accuracy saturates around 20 articles. By increasing the number of the results of the search module, its recall can reach 90%, but the LLM is unable to benefit from this, with the total performance plateauing at around 50-60%, depending on the model.
Note that all this is within the official context length supported by the LLMs.

Changing quality of GPT results (2023-07)

Researchers from Stanford and Berkley compared the March and June results of GPT 3.5 and GPT 4.
The quality of results differed significantly: sometimes, it improved, and sometimes it got worse.
The researchers focused on four areas (math problems, sensitive questions, code generation, and visual reasoning), but there is no reason to think other areas would be different.
These results show the importance of continuous quality monitoring of any production use of a 3rd-party LLM API.

AI for the news and media

Review of emerging AI guidelines in Newsrooms (2023-07)

Hannes Cools & Nick Diakopoulos analyzed 20+ guidelines used by various media to regulate their use of AI.
Includes DPA, Financial Times, The Guardian, Mediahuis, Reuters, Ringier, and Wired.
The guidelines share many common features but also differ in many respects.
Covers topics such as supervision by humans, transparency, privacy, banned use (e.g., generated images, facial recognition).
The authors suggest that anybody writing new AI guidelines should review existing non-AI codes of conduct first.

Terms and conditions of LLM providers (2023-07)

Natali Helberger reviewed the T&Cs of Open AI, Midjourney, Anthropic, Hugging Face, and StabilityAI.
Some conditions are quite problematic for media organizations. For example, reporting on the nature and quality of the models is complicated by the frequent ban on scraping the results of a model. This is quite ironic, considering that OpenAI et al. completely ignored any intellectual property when scraping the data to train their models. This might be another reason why media should consider using open-source LLM models instead of cloud-provided models.

Google Genesis: According to The New York Times, Google showed the executives from the Times, The Washington Post, and News Corp. AI tools for news media. As can be seen from the article, not even the journalists from the Times know much about the tools.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!