Reading Notes: Instruction Tuning for LLMs

Our notes from reading: Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2023-09). Instruction Tuning for Large Language Models: A Survey. ArXiv, abs/2308.10792.

In short:

A great survey of approaches to instruction tuning. Instruction tuning is what turns language models, i.e., devices predicting the most likely next word, into chatbots.
The authors review instruction-tuning datasets, efficient methods for fine-tuning (LoRa, HINT, LOMO…), and various model types (imitation models, multimodal models, and models tuned for specific domains, such as writing, coding, medical, …).
They discuss the main challenges, including limited dataset diversity and that models learn only surface patterns from training tasks.
The paper covers so much important information that we decided to dedicate a separate post to our notes.

Findings of Instruction Tuning Survey

For readers interested in a bit more technical aspect of Language Models, we highlight a few key insights from a thorough study on Instruction Tuning, that is mentioned in our newsletter.

Instruction tuning (IT) aligns the model’s behavior with users’ objectives, where the users want the model to perform a given task instead of just predicting the next word. It makes the model’s behavior more controllable and predictable, and it can adapt the model to a specific domain.

Instruction Tuning Datasets

There are now quite a few datasets for IT, created either from existing NLP datasets and new datasets created with the help of LLMs. However, most of them are only in English, some are English-Chinese, and there are not many multilingual datasets. Therefore, there is a concern about the datasets quality and diversity (both in terms of themes and languages).
Researchers from the Allen Institute and the University of Washington found that the best performance across tasks is achieved by manually combining multiple datasets and that smaller and high-base quality models benefit from IT the most.

Training of Instruction Models

The models are fine-tuned on various IT datasets. Sometimes, the models are then additionally tuned by Reinforcement Learning with the proximal policy optimization method, as known from ChatGPT, where the reward model is almost always created by GPT-4. The most popular base model seems to be LLaMA.
A popular approach is LLM imitation, where smaller models are fine-tuned with responses from LLMs. This improves the smaller model on dataset tasks almost to the performance of the LLM. However, researchers from Berkeley did not observe any performance improvement and sometimes noted even a slight decline on tasks for which the model was not specifically trained.
There are many ways to make the fine-tuning of models more efficient, such as the well-known LoRA and QLoRA. The interesting HINT technique employs adapters and prefixes to reduce the computation of long, repeated instructions and few-shot examples. LOMO reduces the need for memory by avoiding storage of gradient parameters by using only SGD in training. They showed it can perform well enough for large models.

Evaluation of Trained Models

The models’ performances are often evaluated by ChatGPT or GPT-4. However, researchers from Stanford performed a complex evaluation (HELM) of many models on many tasks, evaluated with multiple different metrics. Notably, GPT-3 and YaLM still exhibited differences compared to the findings in their respective reports after multiple evaluations.
There is a concern that the models do not fully comprehend presented tasks but only capture surface patterns, which is apparent from this empirical study, where the models performed similarly when trained on delusive and simplified examples. LIMA exploits this superficial alignment hypothesis by tuning on a smaller dataset, as reported before in AI Spotlight #2. Similarly, researchers from Arizona State University and Microsoft estimate the minimal training data required to match the SOTA in various tasks, which can be only 25% for single-task and 6% for multi-task learning.

Models for Specific Domains

Many models already have been instruction-tuned for specific domains as writing (Writing-Alpaca-7B, CoEdIT, CoPoet) and coding assistants (WizardCoder), dialogue models (InstructDial, ChatDoctor, and ChatGLM-Med), and also for more classical tasks as Intent Classification and Slot Tagging (LINGUIST), Information Extraction (InstructUIE) and Sentiment Analysis, and even for interpreting doctor’s observations (Radiology-GPT).
There are also some multimodal models that answer questions about images, classify them, count objects, and write captions (MultiModal-GPT, Otter, LLaVA, InstructBLIP), InstructPix2Pix based on Stable Diffusion and trained with GPT-3 can edit images with text instructions, and Video-LLaMA with both image and audio encodes can recognize common concepts in videos and generate textual description (and other tasks) with integrated LLaMA or Vicuna LLM.