Geneea’s AI Spotlight #7

The seventh edition of our newsletter on Large Language Models is here.

In this edition, we explore

the turmoil at OpenAI,
several novel LLMs, including Google Gemini,
a new round of challenges and fresh approaches to them,
the mutual impact of AI and media, and
the latest development in EU AI regulation.

OpenAI Turmoil

November was a tough month at OpenAI.
First, there was a pause for sign-ups, which was likely due to a DDoS attack from Russia, a situation that Anthropic also experienced.
Then, Sam Altman was fired, hired by Microsoft, and rehired by OpenAI. And the Board that fired him resigned. The reasons for the original firing remain somewhat unclear; some mention a mysterious system Q*. Gary Marcus, an LLM skeptic, comments on this.

Google Gemini

After a long wait, on December 6, Google finally announced Gemini (report).
It comes in three versions: Nano for mobile devices, Pro, which now powers Bard (in some locations), and Ultra, the largest and most capable version.
The capabilities, especially in multimodal tasks, are impressive, surpassing GPT-4 on numerous benchmarks.
However, soon, criticism surfaced regarding the presentation of the results, with concerns over the emphasis on the MMLU benchmark and its precise numbers due to identified errors (Hugging Face, Erenrich).
Additionally, there is a dispute over the comparison with GPT-4 using different prompting methods, as Microsoft demonstrated GPT-4’s slight advantage with similar prompts.
Also, Google admitted that an impressive demo video showcasing multimodality was significantly edited for effect. Pity, it looked really cool.
Meanwhile, we can experiment with Gemini Pro through its API. From our first impression, the model hallucinates less compared to PaLM: a positive improvement for Bard. It seems to cite information accurately, with proper links. However, it is very cautious, blocking responses for unsupported languages and random topics, like astrology forecasts for the Gemini sign. It seems proficient in image interpretation, reading our very colorful logo with only a minor hiccup.

LLM Zoo

Mistral AI released the impressive Mixtral 8x7B model that outperforms larger models like Llama 2 70B and GPT3.5 across various benchmarks, with notably faster inference, thanks to the Mixture of Experts model (MoE) approach. It employs eight distinct groups of experts (neural networks) guided by a ‘router’ network, which determines token allocation to the respective experts. While still demanding substantial memory, the inference mirrors that of a 12B model, utilizing only two experts per token. Unlike Google, they did not do a shiny fake demo but simply tweeted a torrent magnet link. A few days later, they also made their models accessible via an API; currently, it is in an early-access beta version.
Claude 2.1. Anthropic released a new version of Claude. It can handle 200k contexts (but as we report below, this is far from perfect) and hallucinates less. They also introduced support for customer-provided tools.
Musk’s xAI released Grok, a “Conversational AI for understanding the universe”. Apparently, it is modeled after AI from The Hitchhiker’s Guide to the Galaxy: should answer almost anything and also suggest the right questions. The name itself comes from Heinlein. If you are living in the U.S., you can test it. You even get real-time access to Twitter :D. The answers are supposed to be witty, but unfortunately, people, including Sam Altman, are making fun of Grok’s dad-like jokes.
Meta introduces SeamlessExpressive, a speech translation model that preserves the style, tone, speed, rhythm, expression, etc. You can try a cool demo or dive into technical details (GitHub, paper, Hugging Face)
Berkeley researchers introduced the Starling-7B model, trained with Reinforcement Learning from AI Feedback (RLAIF). Excelling on the Chatbot Arena Leaderboard without requiring laborious Human Feedback, it addresses concerns about training data shortages and online data contamination by LLM models. Meanwhile, OpenAI tackles these challenges through strategic data partnerships.
CogVLM, a multimodal LLM, achieves state-of-the-art performance by employing a trainable visual expert module for deep fusion of vision and language tasks, steering clear of common shallow alignment techniques (image features are mapped into embedding space of the language model).
Following OpenAI’s GPTs, Microsoft announced Copilot Studio for tailored copilots’ creation, while LangChain adopted OpenGPTs.
Navigating the LLM Zoo for the right model is tricky. For coding aid, check this comprehensive survey – hint: go for fine-tuned models like Code LLaMA.

Challenges

A paper by researchers from Google Mind and several universities reports that it is possible to successfully extract training data from LLMs, including aligned models, such as ChatGPT.
Embeddings may not be as secure as thought, as Cornell researchers showed with Vec2Text iteratively guessing sentences from embeddings. Fortunately, knowledge of the specific embedding model is required for this to work, and the method worked only with short embeddings.
In AI Spotlight #4, we reported an issue regarding blind spots in context windows. The problem persists in new models. Anthropic’s best Claude 2.1 model (with up to 128K tokens) suffers from the same “lost in the middle” phenomenon as GPT-4 Turbo (up to 200K tokens). Anthropic later reported that the problem can be minimized by adding a single sentence to the prompt that directs the assistant to find the relevant sentences first. This trick raises the recall from 27% to 98%!
OpenAI’s GPT-4 is more trustworthy but easier to trick. Delving into the trustworthiness of GPT models, researchers uncovered an interesting twist. Despite GPT-4 generally being more trustworthy than GPT-3.5, it is more vulnerable to jailbreaking. This quirk arises from its improved understanding of users, making it surprisingly adept at following even misleading instructions.
The word of the year for 2023, as selected by the Cambridge Dictionary, is “hallucinate”, underscoring the significance of this challenge. Vectara has released an open-source model for evaluating hallucinations in LLMs, with performance metrics of influential models available on the git leaderboard.

Fresh Approaches to Challenges

We found the method of enhancing prompts with emotional stimuli useful for enhancing results and very simple to use.
Expanding on Chain-of-Thought prompting, a new iteration of Everything-of-Thought (XOT) integrates a Monte Carlo Tree Search (MCTS) module. This MCTS is trained by reinforcement learning to seek effective thought structures, enhancing the LLM’s reasoning and planning abilities. The LLM revises and refines these thoughts, and in case of errors, iteratively engages the MCTS for correction.
Chain-of-Note technique enhances Retrieval Augmented Generation by assessing the relevance of the LLM’s “reading notes” from retrieved documents. This allows the LLM to determine insufficiently relevant documents, leading to an “unknown” response instead of providing a potentially misleading answer.

AI in the News & Media

An article at Futurism explores how it is possible that Sports Illustrated published gems like “volleyball can be a little tricky to get into, especially without an actual ball to practice with” by clearly non-existent authors with AI-generated photos. The reaction by SI generally says “Everything is ok, we just did not check somebody else’s work”.
Reporters Without Borders (RSF) and partners crafted the Paris Charter on AI and Journalism, outlining ten ethical principles for newsrooms. The principles emphasize ethics and human agency in making decisions, the media’s role in distinguishing artificial content, and the need for their engagement in governance to uphold the integrity of journalism.
Northwestern University researchers investigated GPT-3’s application in generating different news angles from scientific article abstracts to aid journalists in assessing newsworthiness and offer diverse story-framing angles. Reporters who tested it expressed excitement, finding it reliable for pinpointing key points, identifying interviewees and target audiences, and interpreting complex scientific language.

Media’s Take on AI

The Economist has a great interview with Reid Hoffman, a LinkedIn and OpenAI cofounder, discussing the influence of AI on the workforce. While we do not agree with all of his viewpoints, the interview is very thought-provoking and engaging.
The current all-male OpenAI board and The New York Times’ “Who’s Who in AI”, exclusively featuring men, emphasize the perceived gender disparity in the field. In response, this article from Séphora Bemba spotlights influential women in AI.
Nvidia unveiled the H200 chip for AI, featuring a next-generation 141 GB memory for fast inference. Is it possible that their ChipNeMo LLM contributed to the design?
On top of Mistral, France intensified AI investments with the announcement of Kyutai, a non-profit research lab. Billionaire Xavier Niel and partners have allocated nearly €300 million and secured a thousand Nvidia H100 GPUs to support the researchers.

EU AI Regulation

On December 8, the EU Parliament and the Council (= member states) provisionally agreed on the EU AI Act (AIA). Some technical details still remain to be finalized and the act has to be formally approved by both the Parliament and the states, so officially, it won’t go into effect before 2025.

With minor changes, the act preserved the risk-based approach we described in AI Spotlight #3.
In the case of LLMs, the main focus is on transparency and testing before release.
There is criticism from both sides. Emanuel Macron and DigitalEurope, a business group, say that AIA is too strict and will hamper innovation (note that Mistral is a French company). Others think that the regulation is too weak and full of loopholes. Amnesty International complains that it “greenlighted dystopian digital surveillance” because it did not ban facial recognition.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!