Geneea’s AI Spotlight #2

The second edition of our newsletter on Large Language Models is here. Below are summaries of papers and posts that captured our attention most during the last two weeks.

Today, we look at

various practical challenges and how to address them,
two new models: Google Palm 2 and Falcon,
LIMA and its approach to fine-tuning, and
we again mention several non-technical LLM topics.

Using LLMs in Practice

All the Hard Stuff Nobody Talks About when Building Products with LLMs

An excellent post by Phillip Carter from Honeycomb.io about building a natural language search interface for events in Honeycomb.io’s datastore
It goes over all kinds of practical challenges of getting an LLM-based solution to production—security, latency, prompt engineering, legal aspects, etc.
It refers to the Prompt injection post by Simon Willison. If you think SQL injection is fun, this is even better.

Guardrailing

The new paradigm of tools talking in plain English is cool but challenging.
In plain old Java, C#, … if a function has an integer return type, you know it will never return a string. Not usually or typically, never. With prompts, it is far from certain. Asking an LLM for a Yes/No answer does not guarantee you will not see something more elaborate. This is where so-called guardrails can help.
Guardrails allow wrapping LLM calls with a layer validating the structure and content of the responses and taking corrective actions when the requirements are unmet.
NeMo Guardrails by NVIDIA is similar but designed specifically for dialog systems.
Both libraries have built-in support for LangChain.

Llamaindex (Github)

If you like Langchain but are more interested in LLM-powered search than chatbots or agents, look at Llamaindex. We highly recommend it.
It comes with Llama Hub, a repository of 100+ data loaders (including those provided by Langchain).

LLM ZOO

This time, we look at two models: Google’s Palm 2 and TTI’s Falcon. The former because it cannot be ignored and the latter because it seems very promising for companies like ours that are running LLMs on their own.

Google Palm 2

On May 10, Google released a new version of Palm (see the blog post, a 93-page paper).
There is no question that it is significantly better than the original Palm. But it is hard to compare to GPT4, Google reports results on benchmarks on which Palm outperforms GPT4, but results on other benchmarks are less favorable.
Google focused on programming, common sense reasoning, math, and logic. But the really important feature for us is Palm’s multilinguality: it was trained on 100 human languages.
Look at the examples in the paper: Palm explaining Japanese jokes, Persian proverbs, differences between European and Argentinian Spanish, etc.
Palm is a family of models differing in size and use-case (currently, there is a general model and models focusing on the medical and cyber security domain).
For obvious reasons, the model is not downloadable. It is available through API and the new Google Vertex platform. The pricing is comparable to OpenAI.

Falcon

The Technology Innovation Institute (TII) in Abu Dhabi released Falcon, a 7B and 40B-parameter LLM.
According to the Open LLM leaderboard, the 40B version outperforms all major open-source models, including LLaMA, Vicuna, Alpaca, RedPajama, etc.
The models are trained mainly on English but also support German, Spanish, French, and to a certain extent seven other languages, including Czech.
The license is very interesting: the model is free for research and commercial use for revenue under $1M.

Smart Fine-Tuning

In the previous newsletter, we mentioned Dromedary, a chatbot fine-tuned with only 300 seed examples that were automatically multiplied with the help of an LLM. Continuing in the same vein, we look at LIMA, another approach to minimizing the number of training examples during fine-tuning.

LIMA: Less Is More for Alignment

The paper by researchers from Meta, CMU, USC, and Tel Aviv University argues that fine-tuning a good LLM can be done with a very small number of training examples because almost all knowledge is already present in the general LLM and fine-tuning mostly teaches the model how to draw from it and what style to use.
They fine-tuned 65B LLaMa, Meta’s LLM, for question answering using only 1000 carefully curated examples. They manually wrote 200 examples, and the rest was taken from StackExchange, WikiHow, and Reddit, but some examples were manually edited to achieve a unified ‘assistant’ style.
In 43% of test cases, LIMA was better or tied with GPT-4. In 74% of cases, it was better or tied with Alpaca.
Adding only 30 examples of multi-turn dialogue greatly improved performance on this kind of task.

Non-tech

AI & Naivety: A US lawyer submitted multiple precedents in a personal injury case. Only after the defendant’s lawyers complained they could not find some of the precedents, it came to light that they were all results of a “consultation” with ChatGPT. The plaintiff’s lawyer was unaware that the chatbot could say false things. He even asked the chatbot directly if they were real and was satisfied once the chatbot provided (bogus) references to legal databases. He promised the judge not to do something like that again.

“[he] greatly regrets having utilized generative artificial intelligence to supplement the legal research performed herein and will never do so in the future without absolute verification of its authenticity.”

AI & Regulation: Sam Altman, CEO of OpenAI,

first asked for AI regulation in a US Congress hearing,
then threatened to leave the EU over regulation,
then backtracked saying OpenAI has no such plans.

The proposed EU AI Act could require the disclosure of any copyrighted material a model was trained on.

AI & News Media: NiemanLabs published a great analysis of the impact the coming AI-based Google search will have on news media. They argue that it will significantly decrease traffic Google sends to publishers’ sites because more people will get what they need from the Google search page—another reason for society to figure out who will pay for the original content.

AI & Society: Many media cited the Pew survey that roughly 40% of Americans, 60% of Americans without a college degree, 50% of Blacks, only 20% of Asians, etc., have not heard about GPT. 14% have tried it. The differences in the numbers are still very interesting, but it is good to remember that they were collected six weeks ago.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!