Geneea’s AI Spotlight #9

The ninth edition of our newsletter on Large Language Models is here.

In this edition, we look at

the just-approved European AI Act,
challenges encountered in the wild real world,
the adoption of AI in newsrooms,
new models and their emerging uses,
model evaluation, and
improved training and prompting methods.

European regulation

The European AI Act (see Spotlight #3) was approved unanimously by the Council of the EU (= the ministers from all member states) in February and by the parliament yesterday (85% of MEPs voted for it). The newly created European AI Office started to hire experts and should soon publish tools, methodologies, and benchmarks for the evaluation of AI systems.

Challenges

We cannot trust the LLMs

Both Google and Anthropic acknowledged to WSJ that hallucinations are a serious problem in the adoption of LLMs. According to Eli Collins, a VP at Google DeepMind: “We’re not in a situation where you can just trust the model output.” But as Jared Kaplan, a co-founder of Anthropic, says, we cannot simply make the models more cautious because they would always answer: “I don’t know the context.” Both Collins and Kaplan talked about the importance of users validating any LLM response, and the providers should make it easy by identifying the sources of any answer.

Chatbot responsibility

However, a court in Canada disagrees with Collins and Kaplan, at least in the case of customer-service chatbots. Air Canada’s chatbot provided a passenger with incorrect instructions for a discount. It did not matter that the bot provided a link to the correct policy. The airline was ordered to pay anyway. According to the court, the airline “is responsible for all the information on its website. It makes no difference whether the information comes from a static page or a chatbot.”

Microsoft AI Bot might not be worth the money

Microsoft is pushing AI to all of its products, even Notepad. It has been testing AI Copilot in MS Office (now called Microsoft 365): it summarizes emails, creates presentations, writes memos, and so on. Selected companies were able to test the tool, and according to the Wall Street Journal, they are not really persuaded. The employees were eager to test it, but their enthusiasm waned quickly as the tool made frequent mistakes. The expected price of $30/month/user does not seem worth it.

Don’t let AI do your taxes yet

TurboTax and H&R Block, two major tax service providers in the U.S., offer their users a chatbot-based tax expert. Or at least that’s what they claim. Washington Post warns its readers to avoid the chatbots unless they want to get into trouble. In their tests, the AI “experts” were wrong in more than 50% of answers for TurboTax and 30% for H&R Block.

ASCII art

Even though AI still cannot help you with taxes, it can help you build a bomb.

LLMs are trained on all kinds of data, so if left unchecked, they can advise on all kinds of questionable activities. Various techniques, such as data filtering and supervised fine-tuning, have been designed to prevent the models from doing it. However, all these methods have their limitations.

Researchers from the University of Washington managed to use ASCII art to get past the defenses of GPT-3.5, GPT-4, Gemini, Claude, and Llama2.

AI and news media

NYT’s Zach Seward on AI-powered journalism

Zach Seward, the new editorial director of AI initiatives at The New York Times, recently gave a talk about AI at the SXSW Conference in Austin, Texas. He provides concrete examples of how AI is used in journalism, both bad (such as Sports Illustrated publishing awful articles automatically written by made-up journalists; see Spotlight #7) and good (e.g., searching large troves of documents with embeddings, rephrasing prison policies and audits for public consumption).

Columbia Journalism Review – Report on AI in the news

Columbia Journalism Review, a magazine for journalists by Columbia University, published a detailed report by Felix Simon on the use of AI in the news. From July 2021 to September 2023, he interviewed more than 130 news workers, including journalists, data scientists, and product managers in the US, UK, and German media, including The Guardian, Bayerischer Rundfunk, the Washington Post, The Sun, and the Financial Times. He also interviewed 36 independent American and European experts.

In general, newsrooms have become more open to AI. The adoption is driven mostly by economic pressure, technology readiness, and hype.

For now, AI has brought no fundamental change and instead only helped make some old approaches more effective. Many of the most beneficial AI applications are quite mundane (transcription, search, content categorization, etc.).

There are some important challenges:

Reputational risks due to the unreliability of AI output
Increasing dependence on major tech companies (Google, Amazon, and Microsoft), especially for smaller publishers that cannot afford in-house AI development
Increasing inequality among news organizations, with large international publishers having an advantage

You might also want to look at NiemanLab’s summary or the full report.

BBC Generative AI pilots

BBC announced they are starting 12 generative AI pilots. Unlike Air Canada or HR Block, BBC is more cautious: most of the pilots are internal only, and the resulting content won’t be made public. The pilots can be organized into three groups:

1) Maximizing the value of existing content, for example, translating or reformatting existing content, e.g., writing an article based on a live sports radio commentary.

2) New audience experiences, such as a chatbot providing personalized learning to students.

3) SImplifying and speeding up processes, for example, suggesting headlines and summaries.

AI at Ippen Digital

Nikita Roy from Newsroom Robots interviewed Alessandro Alviani, the product lead for AI at Ippen Digital, a part of the German Ippen Media Group. In the second part of the interview, they discuss finetuning language models on the corpus of Ippen’s local German news content so that it can assist journalists with writing headlines, lead paragraphs, summaries, etc. Another interesting topic they discuss is the role of AI in the context of modular journalism. It can break a story into modules that can be selected and mixed to create personalized content.

Training Data & Intellectual Property

News publishers take different positions regarding their articles being used by AI companies to train their models. While The New York Times has sued OpenAI and Microsoft (see Spotlight #8), News Corp (the parent of Dow Jones, The Times, and The Sun) is negotiating, and Axel Springer has already closed a partnership with OpenAI.

In the meantime, more and more newspapers are blocking AI crawlers. According to the Reuters Institute, 48% of 150 top news sites across 10 countries block OpenAI. Compare that with 33% of the top 1000 websites, as reported by Originality.ai (see also Spotlight #5). Fewer websites block Common Crawl (18% of the top websites), Google AI crawler (24% of the top news sites and 10% of the top websites), and Anthropic (4% of the top websites).

In short

The web is now flooded with low-quality translated and generated clickbait articles. However, a survey found that the AI generation may be good enough for marketing copies, and people liked how quickly it got to the point.
Newsquest uses a few AI-assisted reporters to report on “mundane but necessary” content, freeing reporters to go into the field. That’s something we know well from our collaboration with the Czech News Agency. Hopefully, this approach can help with the ongoing local news problems.

New models and assistants

The competition among AI leaders intensifies. Google unveils Gemini 1.5 with a huge 1 million token context window and Gemini Advanced. Anthropic debuts Claude 3, Mistral introduces Mistral Large, and Inflection AI launches Inflection-2.5. All are said to be GPT-4 class models. OpenAI counters this surge of models with their text-to-video Sora model, overshadowing Google’s text-to-video Lumiere model and the other LLM releases. On the other hand, Gemini Advanced gathers some unwanted attention for generating too inclusive images.

We recommend reading Ethan Mollick‘s post about Gemini Advanced. He dismisses the value of standard benchmarks and offers a high-level comparison with GPT-4. He also feels the model exhibits so-called “sparks” or “ghosts” of general intelligence, such as when playing a Dungeons and Dragons type of game with it.

Search and assistants

A popular use for LLMs is AI-powered multimodal search, as seen with platforms like You.com, Perplexity.ai, and the AI browser Arc. Google responds to this by integrating an LLM writing assistant into the Chrome browser. Meanwhile, OpenAI is developing an AI agent to automate tasks on user devices, while HuggingFace is challenging their GPTs with Assistants, an open-source alternative for creating customized chatbots. Microsoft aims higher, constructing an Agent Foundation Model encompassing language, image/video encoders, and an action encoder trained on robotics and games.

Following OpenAI’s example, Mistral partners with Microsoft to launch Le Chat chatbot. Nvidia releases a locally run chatbot Chat with RTX, free from privacy concerns.

Amazon debuts its shopping assistant Rufus. It will be interesting to see how it handles the wrongly AI-generated product descriptions.

Chips

Meta purchases 350,000 H100 graphic cards while developing their own “Artemis” chip, and OpenAI has ambitions to open its own chip factories. Meanwhile, Groq unveils Language Processing Units that enable Mixtral to operate at a speed of 500 tokens per second.

Open source

Significant developments also occurred in the open-source domain, as the Allen Institute for Artificial Intelligence introduced OLMo, a genuine open-source model. The release includes it all: weights, inference code, training data, and evaluation code. Google contributed with Gemma family models, while the RWKV Project released the Eagle 7B, an attention-free model. Abacus.ai launched Smaug, holding the top spot on HugginFace’s Open LLM Leaderboard.

Model evaluation

Testing LLM can be fun

Every AI developer would tell you how important evaluation is, but typically it is quite boring. Sometimes, however, the results are unexpected…

A common approach to testing LLM is the so-called needle-in-the-haystack evaluation: You put a very specific statement (the needle) into a random context (the haystack) and ask a question that can only be answered using the information in the needle.

When Anthropic researchers tested Claude 3 Opus using a needle about pizza toppings, it answered correctly, but it also added: “This sentence seems very out of place and unrelated to the rest [..]. I suspect this pizza topping “fact” may have been inserted as a joke or to test if I was paying attention.” (see Alex Albert’s tweet).

When a model detects it is being tested, we have a problem. But it’s not the only challenge we face. Additional issues include inadequately annotated benchmarks (see Spotlight #7), and contamination during LLM training. For a deeper analysis, check out Rohit Krishnan‘s comprehensive post. We do need to design better evaluations.

Evaluation and leaderboards

HuggingFace introduced several new leaderboards, including the Enterprise Scenarios Leaderboard for real-world applications, the LLM Safety Leaderboard, and the Hallucinations Leaderboard. To battle contamination, they released a dynamic benchmark, the NPHardEval Leaderboard.

The University of Berkeley unveiled a Function-Calling Leaderboard, together with their Gorilla OpenFunctions-v2 model. It is particularly useful given the growing integration of function calls into LLMs. Finally, there is an AgentBoard benchmark for LLM agents that isn’t only about the final success rate but evaluates also the intermediate steps.

New training methods

Stanford University researchers have simplified the alignment of LLMs through Reinforcement Learning with Human Feedback (RLHF) using Direct Preference Optimization (DPO). The method employs a simple classification loss. This enhances training stability and reduces extensive hyperparameter tuning while improving summarization quality and sentiment control, and enhancing the quality of single-turn dialogue. Traditionally, Proximal Policy Optimization (PPO) uses human-labeled answer preferences to define a preference loss, trains a reward model on it, and then uses it to train a policy. In contrast, DPO skips the reinforcement learning loop and explicit reward fitting by transforming the reward loss function into a policy loss function. The DPO policy network then represents both the language model and implicit reward.
At the University of California, researchers surpassed DPO with Self-Play Fine-Tuning (SPIN) across various benchmarks and reduced the need for human preference data or advanced LLM feedback. SPIN operates similarly to Generative Adversarial Networks, the trained model (generator) generates responses, while its previous version (discriminator) distinguishes between LLM and human responses, and training pushes the generator to create less distinguishable responses. This iterative process yields comparable results to DPO in the initial steps and surpasses it with additional iterations, offering diminishing improvements. They also showed that the global optimum in training is reached when LLM policy aligns with the target data distribution.
DeepMind’s ReSTEM self-training method reduces reliance on human-labeled data. The model generates multiple output samples (solutions) for each input; those are filtered with a binary reward function to create new training data on which the model is iteratively trained. The researchers demonstrated favorable scaling for larger (PaLM-2) models, contrasting with Alibaba DAMO Academy’s observations of diminishing returns for larger models with increased training data. The use of synthetic data likely contributes to significant performance gains and enhanced results on related held-out benchmarks.
Similarly, Microsoft showed the usefulness of synthetic data by fine-tuning the Mistral 7B text embedding model, achieving state-of-the-art results in under 1k training steps. They showed that Mistral’s pre-training produced robust text representations, requiring minimal fine-tuning of the embedding model. Using GPT-4, they generated a diverse dataset of 500k examples with 150k unique instructions, covering symmetric (semantically related but non-paraphrased pairs) and asymmetric tasks (pairs with similar semantic meanings but different surface forms). Then they employed standard contrastive loss (bringing similar examples closer together) and a mixture of synthetic and labeled data.

Improvements without training

In a collaborative study on the cost-effectiveness of post-training methods, researchers introduced the Compute-Equivalent Gain (CEG) metric to assess additional training needs for models to match post-training results. Overall, post-training has proven highly beneficial, with tool-based methods offering substantial gains at minimal added cost (tens of CEG). Prompting and solution selection methods (best of n) are similarly effective but introduce extra inference costs. Fine-tuning shows a diverse range of usefulness from no gain to thousands of CEG, and evaluating scaffolding enhancements, like Tree-of-Thought prompting and agents, proved too challenging.
Researchers from the Mohamed bin Zayed University of AI evaluated 26 prompt engineering principles across GPT and LLaMA models. Larger scale models generally benefit more from these principles in terms of correctness. Especially useful principles include instructing the model to request more details, providing a style sample or specifying the intended audience, and using output priming. Tipping and using delimiters offer lesser advantages, but they impact longer and more complex prompts, according to Singapore’s GPT-4 prompting competition winner Sheila Teo. She also recommends segmenting complex tasks into smaller steps – supported by the study as enhancing correctness.
Similar observations apply to Chain-of-Thought prompting, as shown in a paper exploring the impact of reasoning length. More steps generally (adjusted for task complexity) enhance correctness, while shortened reasoning decreases it, despite the identical information input. Intriguingly, longer reasoning tends to yield correct outcomes even with minor mistakes. Useful additional steps may involve adding context, summarizing previous reasoning, or self-verification.
As prompts lengthen, Microsoft researchers develop the LLMLingua technique to achieve up to 20x prompt compression with minimal performance loss. This builds on the Selective-Context algorithm, which uses a smaller model to remove low-perplexity tokens with a minor impact on LLM comprehension. To better preserve token inter-dependencies, they introduce a budget controller that allocates varying compression ratios for different prompt parts, with a token-level iterative algorithm for fine-grained compression. They also address the discrepancy between the two models through an instruction-tuning-based method aligning their token distributions.
Meta researchers enhanced LLaMA2 70B’s factuality and objectivity using the System 2 Attention prompt. This prompt regenerates key parts of the input, excluding user biases and opinions, and then responds to the debiased prompt.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!