Geneea’s AI Spotlight #6

The sixth edition of our newsletter on Large Language Models is here.

Today, we take a look at

developments in AI infrastructure
new models and prompting methods
multimodal models,
newsroom innovations, and
ethical challenges.

State of AI Report

The highlight of the last month is undeniably the State of AI Report 2023, with numerous written and recorded summaries available.

The report is split into several sections covering research, industry, politics, safety, and predictions.
From a research point of view, we are looking forward to smaller and more capable models as well as multimodal models (see more below).
The Industry section explores AI chips sparsity (AI Spotlight #5), which also overflows into politics.
Policies are slow to follow the trends, while many models are easy to jailbreak.
If at least some of the report’s predictions come true, we have a lot to look forward to in 2024.

Developments in AI Infrastructure

Recently, we have seen several providers introducing serverless LLM solutions. When compared to the self-hosted alternative, this is significantly easier and cheaper to set up and use. Some of the offerings also support advanced customizations. Cloudflare, one of the providers, has also partnered with Hugging Face.
Google extends legal protection for users of their AI models, following Microsoft’s initiative.
LangChain introduced Templates, simplifying project creation by providing a range of end-to-end template architectures for various applications. Although primarily RAG-focused, the templates also cover other useful aspects like guardrails, step-back prompting, and interaction with Elasticsearch.
Microsoft researchers presented Autogen, an open-source framework for agents. It offers an easy setup of specific agents and diverse conversational patterns among the agents. Those include dynamic group chats of multiple agents moderated by an administrator agent, and an interesting three-agent setup featuring a guardian agent responsible for validating actions, safeguarding outputs, or offering operational insights.
Tereza Tizkova compiled a comprehensive list of the many agents already in existence.

News from OpenAI

There was a flurry of announcements at OpenAI DevDay:

GPT-4 Turbo is an improved version of GPT-4, with support for 128K context windows, better function call accuracy, and improved control over the output format. It is also cheaper and the training data covers events up to April 2023. GPT-3.5 Turbo got some of these improvements too.
Assistant API for building agents. A nice feature is the support for persistent and infinitely long threads: developers can offload conversation history management to OpenAI, saving space in the context window. The framework also supports three tools: a sandboxed Python interpreter, retrieval capabilities, and function calling. You can try all this in the playground.
Enhanced support for non-text modalities: GPT-4 Turbo accepts image input, DALL·E 3 offers an API for image generation, and there is a new text-to-speech model.
GPT-4 now supports fine-tuning, although OpenAI warns that it is more challenging than with GPT-3.5.
Most of the improved models are also cheaper and have higher rate limits.

Evolving Landscape of Language Models

An exceptional opportunity for Europe in the AI field emerges with the development of the French Mistral model, created by eminent ex-researchers from Meta and DeepMind. This compact model, with only 7 billion parameters, offers low inference costs and demonstrates exceptional performance by outperforming LLaMA 13B in all and LLaMA 34B in several reasoning benchmarks. Following Mistral’s release, several other models based on its architecture emerged shortly thereafter, such as Mistral fine-tuned on the Dolphin dataset or Zephyr-7B-α.
Seznam.cz, a Czech company, released embedding models with a focus on small size and support for the Czech language.
Anthropic continues in its effort to develop Constitutional AI, a system that wants to be proactively helpful, harmless, and honest. The system learns from general principles instead of relying on humans to correct it in individual cases via RLHF. Recently, they surveyed 1000 Americans for rules they wished for AI to follow, comparing these to their in-house constitution for the Claude model. The constitution aligns closely with Asimov’s first law of robotics while omitting the third law, which concerns robot’s self-preservation. You can review the differences between Claude’s constitution and the public rules here.
Additionally, they crafted an insightful thread and article on interpreting neurons’ meanings within LLMs. These neurons usually represent a superposition of meanings. With dictionary learning, they demonstrated the extraction of specific meanings from a cluster of neurons, successfully deriving approximately 4,000 distinct features from roughly 500 neurons.
Stanford researchers introduced the FMTI index, examining foundational models’ transparency. Their assessment considers various indicators, including training data, architecture, abilities, and governing policies. The open models take the lead, with ChatGPT securing a commendable third place.

New prompting methods

A novel “step-back” prompting technique developed at DeepMind demonstrates improved performance over Chain-of-Thought in several reasoning and knowledge-intensive benchmarks and also enhances RAG results. The key lies in formulating a more general question from the original query and using the broader answer to provide context for the initial query.
In collaboration with Stanford, DeepMind also introduced Analogical prompting, which instructs the LLM to create related few-shot exemplars independently (“Recall three distinct and pertinent problems.”) or produce a tutorial for the query’s fundamental concepts to aid in solving it. This approach simultaneously generates more relevant exemplars while reducing the effort needed to create them.
DeepMind also furthered their exploration of optimizing prompts with LLMs (see issue #5) in a genetic algorithms-like manner. Using the Promptbreeder system, they initiated a population of prompts with various thinking styles and problem descriptions, along with mutation prompts. These mutations guide LLM to modify the initial instructions (e.g., “make it more fun”), while they are also further evolved to enhance the improvement process.
Robotics researchers utilized coding LLMs to devise reward functions for reinforcement learning in manipulation tasks, employing a genetic mutation operator. Their Eureka algorithm streamlines human text input to enhance reward generation and highlights the collaborative potential among varied AI models.
Meta AI devised a Chain of verification method to mitigate hallucinations by generating concise questions from initial responses and independently answering them. Simpler verification questions were answered more accurately than initial queries, outperforming yes/no questions and non-task-specific heuristics. The revised answers notably reduced hallucinations by up to 38%, particularly effective for list creation tasks.

Multimodality

Recently, there has been a growing focus on multimodal models, notably marked by the introduction of the new GPT-4V (vision) from Microsoft. Additionally, LLaVA underwent an instruction tuning enhancement to version 1.5. The OpenAI announcements mentioned above also contain some incremental improvements in the area of image and speech.

From a broader viewpoint, Chip Huyen extensively covered large multimodal models:

Categorized and elaborated on multimodal tasks, and emphasized models’ performance enhancement by incorporating additional modalities.
Explained the principles behind two prominent models – CLIP (using natural language supervision and contrastive learning) and Flamingo (incorporating a vision encoder like CLIP and a language model to discuss the image).
Offered valuable paper references concerning interesting research areas, like unifying multiple modalities into a single vector space, instruction-following for multimodal models, more efficient training, and generating multimodal outputs (such as GPT-4V’s ability to create tables in Latex).

Newsrooms’ Innovations and Exploration

Open Society Foundations pioneers an AI for Journalism Challenge, engaging 12 newsrooms in exploring AI applications, as reported by David Caswell. We are looking forward to the projects’ outcomes, which include, for example, the identification of emerging stories, the use of generative AI to broaden the reach to younger audiences, and the assessment of the impact of news on specific societies.
South Africa’s Daily Maverick plans to make their one-paragraph AI-generated article synopsis the default option due to positive readers’ feedback. The CEO said that most readers only read 25% of an article, but when this group is offered a synopsis, they tend to delve into at least three more articles during their site visit.
Reuters Institute’s analysis of ChatGPT with Bing search shows that ChatGPT is capable of maintaining neutrality on polarizing topics, while it lacks consistency in breaking news updates. In English queries on non-English events, it predominantly sources English content, neglecting original language sources.

Ethical Challenges of Generated Content

In his blog post, Ehud Reiter points out that generated texts, even when accurate, can sometimes lack the sensitivity that doctors exhibit. For example, doctors sometimes choose not to mention a highly unlikely diagnosis to avoid causing unnecessary alarm in patients and may avoid criticizing bad habits like smoking to prevent negative reactions. Moreover, LLMs may occasionally propose actions, like dietary recommendations, that individuals are incapable of carrying out, potentially impacting their self-esteem adversely.
Ethan Edwards explores the scalability of censorship facilitated by LLMs, highlighting their ability to evaluate every published text for potential risk. Traditionally, subversive topics must gain significant traction before censors can detect and label them for automatic identification. LLMs make it easier to identify such content before it has a chance to spread.
Matt Sheehan analyses China’s AI regulation, looking at its components, motivation, and roots, in a series of three papers.
Do the Rewards Justify the Means? The paper (2023/06) introduces a dataset examining ethical behavior through text-based decision-making games. In essence, it reveals that reinforcement learning algorithms trained on prioritizing rewards exhibit Machiavellian behavior, while GPT-4 tends to display higher moral considerations, especially when initiated with ethical prompts.
The C2PA’s introduction of the Content Credentials pin offers a means to inspect the creation of images and audio files, checking for AI usage and potential tampering. Although the image database is limited (based on our exploration), this tool holds promise for verifying edit history and attributing credit. Similarly, tools like GPTzero or Sapling can identify if a text was AI-generated.
Wired reports on a study by ETH Zurich showing that LLMs can often deduce personal information from comments based on the details mentioned and the language used. You can compare yourself to an LLM here.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!