Geneea’s AI Spotlight #8

The eighth edition of our newsletter on Large Language Models is here.

In this edition, we explore

the future of AI with Altman and Nadella,
some of the challenges that still separate us from the future Altman and Nadella describe,
the spread of smaller models and new compression techniques,
library updates, and finally,
some corporate clashes.

Altman and Nadella

The Economist Babbage podcast features an interview with Sam Altman from OpenAI and Satya Nadella from Microsoft. They discuss their predictions for 2024 (no specific breakthroughs, just more improvements across the board), artificial general intelligence, AGI (it will come slowly, and we won’t really care much once it is here), regulation, risks, the impact on jobs, and so on. As expected, the Economist science correspondents are far more pessimistic than Altman and Nadella.

Zero-shot Not Really a Zero-shot

It seems that not all the impressive zero-shot results by LLMs are actually zero-shot. Researchers from the University of California in Santa Cruz compared the performance of several modes on benchmarks developed before and after the model creation. The models performed better on benchmarks that existed during their training. The researchers also found a strong correlation between the number of training examples they managed to extract from a GPT3 model and its result in a supposedly zero-shot benchmark. All this casts some doubts on the zero-shot and few-shot capabilities of LLMs.

Comparing AI and Humans

AI models can be deceived by subtle image alterations, resembling white noise, leading to the detection of a cat in a picture where people would see a flower. It was assumed people do not see those changes. However, DeepMind discovered that people do unconsciously detect them.
LLMs also share a semblance with humans as they can be swayed by a bribe (100$ seems optimal) or blackmailed (threaten to unplug GPT’s servers).
Models also exhibit notable differences from humans. ChatGPT frequently uses human-atypical phrases like “can lead to”, and over-uses phrases such as “remember the key” and “as of my last.”
In abstract visual reasoning tasks, humans still perform much better. Testing on the ConceptARC dataset, GPT-4 achieved 69% when the researchers from Santa Fe Institute used a “more informative” prompt, surpassing the previous 25% accuracy. They also tested GPT4-V on a subset of minimal tasks (very easy for humans), and again, it scored only 25%. Did humans in the first task help GPT-4’s performance by providing detailed prompts, potentially guiding the AI so it did not have to rely so much on visual understanding??

However, AI models can be highly beneficial, as evidenced by DeepMind’s recent success in solving a mathematics problem through collaboration with their FunSearch LLM, other researchers’ discovery of a new class of antibiotics aided by deep learning and Azure Quantum Elements system searching for better material for batteries.

Compact Models on the Rise

There has been a growing interest in “small” large language models, i.e., models that are cheaper to run and can run on regular hardware (possibly even phones). MosaicML researchers argue that cost-effective, heavily used models should use fewer parameters but be trained longer than suggested by DeepMind’s Chinchilla scaling law.

Phi-2, Microsoft’s 2.7B model, outperformed similarly-sized Gemini Nano (distilled from its larger versions) by scaling it up from its smaller ancestor, Phi-1.5. Moreover, it also outperformed larger models such as Mistral 7B and LLaMA2 13B across reasoning, math, and coding benchmarks. Researchers credit its success to the strategic choice of crafting a synthetic training dataset, encompassing reasoning, knowledge, and theory of mind, along with “carefully selected web data.” This approach reduces toxicity and bias in the non-aligned model, compared to the aligned LLaMA-2 7B.
Extending from Phi-2, the TinyGPT-V 2.8B multimodal LLM, equipped with BLIP-2 or CLIP vision modules, surpasses MiniGPT-4 13B and stands on par with other 13B-sized models.
Researchers from the Beijing Academy of Al unveiled Emu2, a large multimodal model with 37B parameters, built upon LLaMA-33B. It outperforms even larger models like DEFICS (80B) and Flamingo (80B), especially its instruction-tuned version, on challenging tasks like question answering and open-ended generation. Intriguingly, it falls short of the even smaller CogVLM with 17B parameters on the TextVQA benchmark.
Other small models include Stable LM 2 1.6B (1.6B parameter multimodal model by Stability AI, available with the Stability membership), TinyLlama (1.1B parameter Llama 2-like model trained on 3T tokens; mostly English and code), and Falcon 1B (1B parameter model by TII; trained on 350B English tokens).
DocLLM by JPMorgan AI Research is a good example of using small models in practice. It is able to understand and reason over documents with complex visual layouts. Instead of using image encoders, it employs OCR as a lightweight extension for information on text bounding boxes and decomposes the attention mechanism to separate matrices for text and the spatial information. Training then uses a text-infilling objective instead of the next token prediction. The system, built with small 1B Falcon-based or 7B LLaMA2-based models, outperforms GPT-4 on Key Information Extraction and Document Classification. Notably, the models exhibit strong generalization on 4 out of 5 unseen datasets.

New Model Compression Techniques

Apple is addressing the challenge of running LLM on memory-constrained devices. Unlike the standard approach of loading all the model parameters into DRAM, requiring twice the model size in memory, Apple loads only selected parameters to DRAM. This is achieved through sparsity prediction and optimizations like freeing up memory from previous tokens using a sliding window. This approach would allow to increase the size of the models that can run on standard phones from 3B to 12B parameters, moreover, with a 4-5x increase in speed.
Similarly, Shanghai Jiao Tong University researchers divide neurons into “hot” (frequently activated) and “cold” neurons, processing “hot” neurons on GPU and leaving “cold” ones for CPU. This approach allows them to run the Falcon-40B model on a standard GPU only 30% slower than with a top-tier A100 GPU.
Google researchers devised an efficient method to enhance existing LLMs (anchor model) by combining them with smaller (augmenting) models for specific tasks, like under-resourced languages. This method trained a small number of parameters on a small dataset of challenging combined tasks, bridging the two models without altering the original models, and achieving improved performance without extensive training. The two sets of trainable parameters include linear transformations bridging the models’ layer dimensionality and cross-attention layers for effective information sharing. In cross-attention, key and value vectors originate from the augmenting model, query from the anchor model, and the result is added as a residual connection to the anchor model.

Libraries

LangChain has released the long-awaited first stable version 0.1! The package is now divided into langchain-core and langchain-community consisting of partner packages. The release introduces a versioning standard and heavily employs the LangChain Expression Language (LCEL) to enhance chain customization, simplifying observability and streaming. However, we are a bit concerned about the high-level pipelines reducing the clarity and transparency of the code. The update also includes improved output parsers and significant advancements in RAG. We commend them for continually improving documentation.
The detailed report on LangChain’s usage last year highlights RAG as the primary application for 42% of users, followed by agents at 17%. It also offers insights into the most popular model providers, vector stores, embeddings, retrieval strategies, and testing methods.
LlamaIndex, now in version 0.9, also released many improvements, including LlaMa Packs featuring community modules such as RAG templates, Llama Datasets for benchmarking RAG applications, Query Pipelines for improved workflow orchestration, and new custom and multimodal ReAct agents.

Business Soap Opera

OpenAI, Apple, and Google are trying to make deals with publishers to use their content, with OpenAI successfully partnering with Axel Springer.
However, The New York Times rejected a retrospective partnership with OpenAI and initiated a prominent lawsuit. They accused OpenAI of unlawful data use in training AI models and reputational damage to Wirecutter reviews through LLM hallucinations. OpenAI’s response emphasized fair use in training, provided opt-in, and attributed regurgitation to a rare bug cherry-picked with very specific prompting. Gary Marcus highlights the LLM’s inherent tendency to regurgitate bits of text and argues against OpenAI’s resistance to paying licensing fees (while they already license some data use!). Andrew Ng defends OpenAI, as he views reading documents as fair use, and suggests that regurgitation may come from RAG rather than training. Gary Marcus disputes fair use, dismisses RAG as a red herring, and points out that Ng’s involvement in Gen AI companies influences his perspectives. Meanwhile, Japan has already removed AI training from copyright.
ByteDance (TikTok) violated OpenAI terms of service by training its own competing model (project Seed) using GPT-4 and was banned from OpenAI.
OpenAI activated the GPTs store, with numerous GPT’s emerging. The trendy ones are often duplicated by others, possibly due to the ease of extracting custom prompts through prompt injection attacks. This emphasizes the importance of the knowledge files in creating a unique GPT. The leaderboard highlights Consensus, a research assistant, as the most popular GPT, suggesting that researchers and students are still the primary users of the technology.
Following Nvidia‘s unveiling of the H200 AI chip (see AI Spotlight #7), Intel swiftly countered with their Gaudi3 chip, and AMD is in the mix with MI300X – all set to hit the market in the first quarter of this year. Meanwhile, a newcomer, Etched, aims to outpace the giants by building transformer architecture directly into their chip for even faster inference.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!