Geneea’s AI Spotlight #10

The tenth edition of our newsletter on Large Language Models is here.

In this edition, we explore

how generative AI is used by companies and people,
new models and infrastructure,
experience from real deployment,
challenges, misuse, and
ColBERT, an efficient approach to search.

The Business Side of Things

GenAI in Fortune 500 companies

Sarah Wang and Shangda Xu from Andreessen Horowitz, a VC fund, released a report on how companies build and buy generative AI. The results are based on responses of ~100 executives from Fortune 500 companies.

Budgets for GenAI are growing fast, 2x–5x per respondent in 2024. And unlike in 2023, most of this will be in regular budget categories, not innovation. ROI is still unclear and hard to measure.
Many companies lack the necessary technical experts. They are often paying the LLM API providers for support.
In 2023, most companies experimented with one model (usually OpenAI). In 2024, more than 90% plan to use at least three models. This will allow them to choose the best model for the job and avoid vendor lock-in. Many applications are designed to make switching the LLM provider easy.
In 2023, 80–90% of respondents used closed models. Now, many want to use open-source models instead. In many use cases and/or with sufficient fine-tuning, an open-source model can match closed-source models.
Control (security of proprietary data and understanding the logic behind outputs) and customization are more important than cost.
Customization via fine-tuning and RAG are far more common than pre-training.
Companies use GenAI internally but are cautious about public-facing use cases.

Popular AI uses

Marc Zao-Sanders, the founder of Filtered, explored GenAI’s top uses. The summary was published in Harvard Business Review. The winner is technical assistance (think of RAG-based chatbots over documentation, for example). Content creation and editing tasks follow closely. This is the area Geneea is most familiar with. Let us know in the comments what you (would) like to use GenAI for!

Automatic programmers

Recently, the spotlight has been on Devin, Cognition’s automatic software engineer. Their demos showcase its abilities to code tasks end-to-end and even search for documentation and bugs. But with access to Devin moving at a snail’s pace, OpenDevin and Devika stepped in as open-source alternatives. Obviously, some criticism appeared, arguing Devin’s capabilities are exaggerated. Princeton released their own SWE-agent. It is able to fix 12.3% of issues in SWE-bench, Princeton’s own bug-fixing benchmark. Devin can fix 1.5 percentage points more, but only on a subset of the benchmark. GitHub didn’t stay behind either and added its own Autofix agent for fixing code vulnerabilities. Microsoft introduced its AI coding agent AutoDev, which orchestrates various agents with diverse tools to build, test, and version control projects in addition to coding.

In short

Despite revealing the MM1 model, Apple is in discussions with Google to integrate Gemini into iPhones and with Baidu in China. Microsoft turned Inflection AI into Microsoft’s AI Studio, continuing its AI investments. And Mistral’s future looks bright with the backing of a new partner, Snowflake.

At the GTC conference, Nvidia announced the Blackwell B100 AI GPU. It should have 3x more transistors than the H100 and consume 25x less power. But they better keep an eye on the University of Pennsylvania’s silicon-photonic chip, which promises speed-of-light computation and minimal power usage. The chip has a variable height that scatters light in specific patterns, allowing it to perform vector-matrix multiplications.

Cloudflare is developing a novel firewall designed to safeguard LLMs and prevent exploitation attempts such as prompt injection. It includes a layer for prompt and response validation alongside features for sensitive data detection and rate limiting.

Model Zoo

For a short time, Claude 3 Opus from Anthropic outperformed GPT-4 in LMSYS Chatbot Arena. From our experience, its answers are better readable, pleasant, and more human-like. OpenAI released a new GPT-4 version to reclaim the first spot, and rumors are circulating about the GPT-4 successor, expected in mid-2024. Anyway, Anthropic seems like a good investment for Amazon.
Cohere released Command-R, securing a great 7th place in Chatbot Arena. It focuses on enterprise use (low latency, high throughput) and Retrieval Augmented Generation (RAG).
01.AI released the open foundation model family Yi, focusing on multimodality, emphasizing visual comprehension. Now, 25th in Chatbot Arena.
273 Ventures released a very small but indisputably legal Kelvin Legal Large Language Model (KL3M) for the legal industry. They got a fairly trained certification label for training only on public domain data, mostly government and legal documents.
Databricks released DBRX, a strong open-source model with 132 billion parameters and a Mixture of Experts (MoE) architecture, excelling particularly in coding tasks.
After Elon Musk accused OpenAI of being ClosedAI, xAI release Grok-1 (see also here), “the biggest LLM”, with 314 billion parameters. Grok incorporates MoE with two active experts, making it an 86B model at inference with massive memory demands even when quantized.
Yet, quantization techniques are advancing. Microsoft’s researchers have introduced ternary quantization, employing solely -1/0/1 values. This significantly lowers cost while preserving the performance of the 16-bit model LLaMA 3B.
Apple’s researchers developed a multimodal model family MM1, varying in size and architecture. They were especially careful with composing their pre-training data, comprising image-text docs and image-caption pairs, and even studied the impact of image resolution. This helped them outperform the top models, such as Emu2 and Flamingo.

Practical Advice

Check out this excellent post by Ken Kantzer, CTO at Truss, who summarizes their experiences with LLMs after processing 500 million tokens. Their conclusions are very similar to ours. Few highlights:

Short prompts are often better than detailed instructions.
Typically, LangChain is overkill.
GPT is really bad at saying nothing. It prefers to hallucinate instead of acknowledging the absence of requested information.
Despite ever-increasing input contexts, the output length is still quite limited. They encounter problems when the output should contain ten or more items.
RAG (Retrieval-Augmented Generation) is hard. There are no ideal similarity thresholds. Semantic searches are confusing for users. The old-fashioned faceted search might be better in many scenarios.
Hallucinations are rare in information extraction. (With the exception of making up results when there are none mentioned above.)

Challenges

Pledge to mitigate GenAI misuse

At the Munich Security Conference, the big tech companies pledged to mitigate the misuse of GenAI as its use for misinformation is a big challenge. This can be tackled with several strategies such as red teaming (humans finding weaknesses), safety guardrails (designing protocols to steer away from harmful outcomes), labeling the creations with technical standards like C2PA, as OpenAI did with DALL-E, or detecting them, as Meta wants to do on their platforms with images.

Detecting AI-generated texts

Detecting AI-generated text is trickier, especially if we want to avoid mislabeling human texts as AI-generated. Fortunately, researchers from the University of Maryland and Carnegie Mellon University devised a method with much fewer false-positive results than others, even for non-native speakers, which is a big challenge. It detects over 90% of GPT-generated text, and it does not need modifications for different language models. Usually, this classification relies on perplexity, a measure of how well a language model predicts the next word, which is reliably different for AI and human-generated text. The researchers add another measure, cross-perplexity, that measures how surprising the predictions of one language model are to another model.

Faithful citations

Another credibility challenge is generating faithful citations to support the claims of generated texts. Researchers from the University of Singapore, Washington, and Nanyang Technological University trained models for this purpose using fine-grained rewards. They achieved around 60% precision and recall on the ELI5 dataset, 10% higher than ChatGPT’s performance. Even though GPT-4 would probably do a bit better, out-of-the-box usability remains a challenge, necessitating frameworks like RAG.

No butterfly effect

Contrary to the title of their paper, Butterfly Effect of Altering Prompts, researchers from the University of Southern California showed that the performance of larger models is relatively unaffected by minor, irrelevant additions to prompts such as “hello,” “thank you,” or offering a tip. However, prompting for a specific format, such as JSON or CSV, may lower the performance differently for specific models, depending on their training data. Not surprisingly, jailbreak prompts hurt the performance a lot.

ColBERT: Efficient and Effective Passage Search

Four years ago, Omar K. and Matei Zaharia from Stanford published a great article describing ColBERT, a retrieval model that attempts to strike a middle ground between fast yet less accurate search approaches (keywords, embedding similarity) and slow but more accurate approaches that use LLMs.

In ColBERT, documents and queries are represented by matrices with contextualized token embeddings (computed by BERT passed through a linear layer, which reduces dimension).
The score is computed as a sum of MaxSim through all query tokens. MaxSim for a query token is the maximum similarity through all document tokens.
BERT is fine-tuned, and linear layers are trained using triples (query, positive document, negative document).
ColBERT can be used either for re-ranking pre-selected results or for full retrieval (optimized using search indexes like Faiss). For re-ranking, ColBERT is competitive with BERT-based approaches in quality but achieves more than 100x shorter latency. For full retrieval, ColBERT is ~5x slower than traditional approaches but achieves significantly higher retrieval quality.
Research continued with ColBERTv2 and made its way into a RAG implementation called Ragatouille.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!