Geneea’s AI Spotlight #5

The fifth edition of our newsletter on Large Language Models is here.

Today, we take a look at

the race among industry leaders,
the challenges of real-world applications,
some new findings and framework releases, and
how more and more websites are blocking AI data crawlers.

Clash of the Titans

Google – the empire strikes back:

Google announced the addition of Llama 2 and Falcon support to Vertex, their AI platform. Claude 2 should be available soon. The models are easy to use but not as easy as calling Palm 2 API. You need to deploy them yourself. Google provides wizards for this, but you still need to pick the right hardware depending on the model and your expected load. Even though the models are free, you might end up paying much more than you would for Palm 2 or GPT API.
Palm 2 added support for 32k context windows and fine-tuning.
According to The Information (see Reuters article), Google is close to releasing Gemini, its new powerful model suite. A handful of businesses have been given early access to some of these models. Gemini is being positioned as a direct competitor to GPT-4, but Demis Hassabis, Google DeepMind’s CEO, says that it will combine a large language model (LLM) with planning and problem-solving abilities (It was DeepMind’s AlphaGo that defeated the world’s number one-ranked Go player.) There are rumors that Gemini is significantly more powerful than GPT-4.

Meta’s big plans

According to the Wall Street Journal, Meta has big plans in the AI domain after it fell behind the other big players in AI commercialization. It is working hard on a new model comparable to GPT-4. Currently, it is expanding its data centers and acquiring the necessary GPUs.
It is hard to say how much difference this will make. About one-third of Meta’s LLM researchers left the company last year (some voluntarily, some not). Also, GPT-4 is here now, and Meta is only planning to start training the new model in early 2024. This also means it will probably be released after Google’s Gemini.
According to WSJ, Zuckerberg wants the model to be open-source and free, but Meta’s lawyers think this might be too risky.

Microsoft Copilot, Ernie, and chips

Microsoft announced Copilot, a unified AI assistant available in Windows 11, Microsoft 365, Edge, and Bing. Even Paint will get some AI. Sounds like Cortana 2.0.
Few days ago, Anthropic announced a $4B investment from Amazon and tighter integration with AWS.
Last month, OpenAI launched a business version of ChatGPT that competes with ChatGPT deployment in Microsoft Azure (Reuters).
Meanwhile, Baidu has introduced Ernie, its own alternative to ChatGPT. This launch had been delayed a few times, initially scheduled for March but scrapped at the last moment. The Economist has taken a closer look at the challenges of running such a system in China. According to local regulations, the chatbot must align with the fundamental principles of socialism. Interestingly, the chatbot claims that COVID-19 originally came from the United States and was later transmitted to Wuhan in China. The New York Times compared Ernie’s answers with those of ChatGPT.
One of the main bottlenecks to AI development – the shortage of GPUs – remains (see articles by FT and WSJ). All of Nvidia’s chips are made by a single company in Taiwan: Taiwan Semiconductor Manufacturing Company, and as TSMC explains, the shortage will last until 2025. We wonder what Ernie thinks about this.

Hype meets reality

As we move from being astonished that LLMs can suggest ten ideas for a blog to more practical applications, more and more challenges surface. Finally, there is some correction in expectations.

Gary Marcus has been skeptical since the start. Maybe too skeptical. He has been stressing that AI is much more than language models (e.g., planning of complex workflows), that AGI is not imminent, etc. Ted Gioia even argues that Microsoft’s bet on AI just created a new version of Clippy.

Other experts were less pessimistic, but they still stressed that bringing LLMs to production takes some nontrivial effort. We mentioned some of those concerns before:

“Building LLM applications for production”, an excellent post by Chip Huyen (see issue #1)
“All the Hard Stuff Nobody Talks About when Building Products with LLMs” by Phillip Carter from Honeycomb.io (issue #2)
“Lost in the Middle: How Language Models Use Long Contexts” (Longer contexts are not the silver bullet in issue #4)
“How is ChatGPT’s behavior changing over time?” (Changing quality of GPT results in issue #4)

Associated Press explores the problem of hallucinations. While Sam Altman, the CEO of OpenAI, thinks that hallucinations will be alleviated in two years, for now, he trusts ChatGPT’s answers “the least of anybody on Earth”. Emily M. Bender, a linguistics professor, considers them an inherent property of LLMs as they are “designed to make things up”. For some use cases, such as marketing, “hallucinations are actually an added bonus” suggests Shane Orlick, president of Jasper AI.

Also, according to Similarweb.com (as reported by Reuters), the number of ChatGPT users has declined for three months in a row. This might be AI fatigue, or it might be just school kids being on vacation.

LLMs are hungry, thirsty, and take deep breaths

We knew that LLMs are great electricity hogs, but they are also quite thirsty, and now it seems they work better when taking deep breaths:

Researchers from the University of California, Riverside, show that LLMs use a surprisingly large amount of water in both training and inference, mostly for cooling and during electricity generation. Microsoft’s water consumption rose by a third between 2021 and 2022, mainly due to AI development.
According to the paper from DeepMind about prompt optimization mentioned above, Palm 2 was best at solving certain mathematical tasks when instructed with prompts starting with “Take a deep breath and work on this problem step by step.” Without taking a deep breath, the results were worse.

Fine-tuning, prompts, and LlamaIndex

Instruction Tuning for Large Language Models: A Survey (2023-08)

A great survey of approaches to instruction tuning. Instruction tuning is what turns language models, i.e., devices predicting the most likely next word, into chatbots.
The authors review instruction-tuning datasets, efficient methods for fine-tuning (LoRa, HINT, LOMO…), and various model types (imitation models, multimodal models, and models tuned for specific domains, such as writing, coding, medical, …).
They discuss the main challenges, including limited dataset diversity and that models learn only surface patterns from training tasks.
The paper covers so much important information that we decided to dedicate a separate post to our notes.

Large Language Models as Optimizers (2023-09)

DeepMind shows that prompts can be very effectively optimized with LLMs.
Two LLMs cooperate on the optimization: the scorer assigns a score to prompts generated by the optimizer. The optimizer’s task, defined in a meta-prompt, is to find a prompt with the highest score based on previous prompts and their scores.
The open questions include how to avoid overfitting to training data, how to use error examples, and how to select the initial conditions.

LlamaIndex Updates (Sep 3 & Sep 20)

A fully working RAG application based on LamaIndex, including UI, was open-sourced (GitHub). A RAG (Retrieval Augmented Generation) application searches external data and uses LLM to generate answers.
Linear adapters allow tuning embeddings to a particular use case without re-embedding (more details here).
Agents can now be composed hierarchically, which means you can easily combine agents, each specialized for a particular task. See this notebook for an example.

Training Data & Intellectual Property

A growing number of websites are labeling their pages as off-limits for AI crawlers.

As of September 22, nearly 26% of the top 1,000 websites (including Amazon, Quora, Bloomberg, CNN, NYT, and Reuters) were using robots.txt to block GPTBot, according to Originality.AI, an AI content detection service.
Only 14% blocked Common Crawl Bot. This does not make much sense because OpenAI is also training on Common Crawl.
Also, blocking only GPTBot means the pages are not included in the training of OpenAI’s models. However, they can still be downloaded for use by plugins. To prevent that, it is necessary to block ChatGPT-User.
For some reason, only two sites were blocking Anthropic.
We think that the different approach to crawlers is not intentional. For example, Reuters was not blocking Common Crawl Bot on September 22, but when we inspected it on September 24, it already was.
An article by The Guardian explains why they are blocking GPTBot and that they are open to “mutually beneficial commercial relationships with developers”.
If you do not want your pages to be included in the training of AI bots, read the Originality.AI post for instructions on how to set up robots.txt properly. But be aware that there is no common standard, so other players might still crawl it.

Please subscribe and stay tuned for the next issue of Geneea’s AI Spotlight newsletter!