Geneea’s AI Spotlight #11

The eleventh edition of our newsletter on Large Language Models is here.

In this edition, we explore

Stanford’s AI Index Report,
a mix of releases,
AI in journalism, and
tips for improving model answers.

Business

Stanford AI Index Report

Stanford released their traditional AI Index Report. They highlighted the top 10 takeaways, but let’s look at some other interesting findings, too:

Even though the industry models dominate (takeaway #2) and outperform open-sourced ones, the share of released open-source models has grown significantly, from 33% in 2021, 44% in 2022 to 66% in 2023.
While evaluations for LLM responsibility are still lacking (takeaway #5), more challenging benchmarks emerged for popular areas such as agents, coding, reasoning, and hallucinations. Although it’s common to evaluate models with other models, the importance of human evaluations is increasing, thanks to frameworks like the Chatbot Arena.
According to a survey by McKinsey, 42% of organizations reported cost reductions and 59% reported revenue increases with AI implementation. This isn’t surprising, as AI enhances worker productivity and improves work quality (takeaway #7). But keep in mind that the widely cited McKinsey survey is from 2023 and maps the situation in 2022.
The number of AI regulations is increasing not only in the United States (takeaway #9) but worldwide, with legislative proceedings doubling globally and taking place in 49 countries, representing every continent. One example is the EU AI Act, covered in Spotlight #3 and Spotlight #9.
In North America, computer science students are becoming more ethnically diverse. However, the narrowing of the gender gap in informatics has been slow in both Europe and North America.

In short: the “tapestry” of news

OpenAI showcased their new GPT-4o model, which is faster, cheaper, and has improved multimodality. It’s confirmed that it was the rumored gpt2-chatbot. It supports native speech input and output, making it a useful real-time assistant. At the I/O conference, Google countered by introducing the project Astra, another fast multimodal AI assistant. However, it will be released to the public only later this year.

We now also understand why GPT often uses words like “delve”, “tapestry”, or “leverage”. OpenAI used workers from Africa to finetune the raw model, aligning it with their linguistic preferences. For example, the word “delve” is far more common in Nigerian English than in US or British English.

As usual, many new models have been released, including Mistral’s larger MoE model Mixtral 8×22, Snowflake’s big Arctic LLM with 480 billion parameters specialized in SQL generation and coding, Meta’s impressive LLaMa 3 in 8B and 70B versions, Apple’s small OpenELM, and Microsoft’s small but surprisingly capable Phi-3. For an in-depth analysis, we recommend Sebastian Raschka, PhD‘s newsletter.

Researchers have developed an exciting new architecture called Kolmogorov–Arnold Networks. This innovation aims to replace the classical building block of neural networks, Multi-Layer Perceptron, by using learnable activation functions. Zulnorain Ahmed explains how it works and discusses the potential implications.

AI for Media

Blueprint for evaluating AI tools in journalism

Journalists often hesitate to embrace artificial intelligence due to a lack of tailored evaluation methods. Researchers from Northwestern University proposed a framework that addresses this gap, focusing on three aspects:

Quality Assessment: AI’s performance should be measured against journalistic values like novelty, controversy, surprise, timeliness, and social impact, as well as editorial objectives. For example, it could be assessed how frequently the tool uncovers new perspectives or angles for the story.
User Experience: It should be easy to to interact with, to get outputs, as well as to accept or reject suggestions.. Suggested documents should be relevant and reporters should feel it helps to improve their writing. A crucial aspect is also the tool’s customizability.
Transparency and Alignment: Beyond accuracy and consistency, AI tools must be transparent and traceable, for example, explaining why an item is considered newsworthy. They should also respect professional standards and style guides to prevent generating extra work for the editor.

Check out Sachita Nishal‘s presentation and get inspired!

Trust drives usage and willingness to pay

Schibsted News Media identified what drives people’s trust in the media in Sweden and Norway. As Agnes Stenbom explained, “trust can be a key to unlocking user revenue”. Of course, trust is built on the credibility of journalists, the news creation process, and the content itself. Just as crucial is the personal relevance and usefulness of the articles to users, along with selectivity – the composition of the topics and events covered. See how we applied this by analyzing municipal coverage for Radio France.

Transforming News Creation

At the INMA World Congress of News Media, David Caswell discussed how genAI alters news creation. A live survey revealed that 70% of attendees use AI to create transcripts and summaries, while 60% let it suggest headlines and SEO. Investigative agency Cuestión Pública employs Retrieval Augmented Generation (RAG) to enhance breaking news with information about prominent figures, and Zamaneh Media generates their newsletter. Other applications include generating alerts, social media posts, and tagging metadata.

OpenAI’s pitch to publishers

The list of publishers in OpenAI’s Preferred Publishers Program is growing, but so is the list of publishers suing them (see Spotlight #8). Adweek obtained confidential documents about the program. For the right to train on and display publisher content with attribution, OpenAI offers financial compensation, priority placement, and richer brand expression in user chats. The financial compensation consists of a guaranteed value for the publisher’s archive of articles and a variable value based on user interactions with the content. OpenAI’s statistics show that 25% of users already use the browsing function, and The Atlantic predicts that search with integrated AI will answer about 75% of queries without clickthrough. OpenAI claims the leaked documents are for discussion purposes only and contain some mischaracterizations and outdated information.

Surprising Ways to Improve Outputs

Sci-fi prompting

The most effective prompts for LLMs can be rather unconventional. Broadcom researchers let people compete with automatic prompt optimizers to generate optimal prompts for solving mathematical problems. Optimization methods outperformed humans by prompting the LLM to act as Star Trek captain. Should we even try to enhance prompts ourselves?

Echo embeddings

Carnegie Mellon University researchers discovered a simple yet effective method to enhance embeddings (a mathematical representation of text meaning in a multidimensional space). Conventionally, meaning is encoded sequentially so it lacks the information that is about to come later. They address the limitation by echoing the input (just repeat it and get embeddings from the second part). For example, in the sentences “She loves summer but dislikes the heat.” and “She loves summer for the warm evenings”, conventional embeddings would overestimate the similarity of their initial segments.

Answer elections

It is known that allowing multiple LLMs to vote on an answer improves results. Researchers from Stanford University, UC Berkeley, Google, and Princeton University examined how the results improve for queries with various difficulty levels. Interestingly, only answers to easy queries improve with more votes; with hard queries, we just get more wrong answers, degrading the performance. The challenge to accurately distinguish the easy queries from the hard ones remains.

Thinking before speaking

Researchers from Stanford and Notbad AI Inc taught the Mistral model to think before speaking with a Quiet-STaR, generalization of the STaR framework. STaR fine-tunes a model on a question-answer dataset with generated rationales and uses the REINFORCE algorithm to improve them. Those generated thoughts then guide the model through difficult questions. Quiet-STaR generalizes the thoughts to reasoning that helps infer future text in general, and the tokens are generated in parallel. While this approach enhances performance, it comes with a significant increase in overhead “thought” tokens.

Dots before speaking

Are thoughts crucial? Researchers from New York University had the intriguing idea that performance can be improved by adding computation through generating extra tokens, regardless of their content. They experimented with meaningless filler tokens ‘…’, teaching the model how to use them. Nonetheless, they demonstrated that they really do improve performance almost as the Chain-of-Thought approach on a subclass of problems (first-order logic). This brings a new challenge to the interpretability of LLM results.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!