Paper:《Instruction Tuning for Large Language Models: A Survey—大型語言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
導(dǎo)讀:2023年8月21日,浙江大學(xué)等團隊,發(fā)布了《Instruction Tuning for Large Language Models: A Survey》。指令微調(diào)是在大規(guī)模語言模型的基礎(chǔ)上,使用包含(指令,輸出)的監(jiān)督數(shù)據(jù)進行進一步訓(xùn)練,以減小模型原有的預(yù)測目標與用戶指令之間的差距。其目的是增強模型的能力和可控性。
>> 指令微調(diào)的方法,包括構(gòu)建指令數(shù)據(jù)集、進行指令微調(diào)等。構(gòu)建指令數(shù)據(jù)集可基于現(xiàn)有數(shù)據(jù)集轉(zhuǎn)換,也可以使用語言模型自動生成。指令微調(diào)則是在指令數(shù)據(jù)集上進行監(jiān)督訓(xùn)練。
>> 指令數(shù)據(jù)集的類型,包括自然指令、非自然指令、跨語言指令、對話指令等多種類型。
>> 應(yīng)用指令微調(diào)的語言模型,如InstructGPT、Alpaca、Vicuna等在大型預(yù)訓(xùn)練語言模型基礎(chǔ)上進行指令微調(diào)的模型。
>> 指令微調(diào)的效果評估、分析和批評,需要關(guān)注指令數(shù)據(jù)集的質(zhì)量、指令學(xué)習(xí)是否只停留在表面模仿等問題。
>> 提高指令微調(diào)效率的方法,如基于適配器、重參數(shù)化等方法來進行高效微調(diào)。
LLMs指令微調(diào)技術(shù)通過構(gòu)建豐富的指令數(shù)據(jù)集和采用有監(jiān)督學(xué)習(xí)的方式,能有效提升開源LLMs的能力和可控性。主要技術(shù)點包括構(gòu)建多種指令數(shù)據(jù)集方式自然指令、非自然指令以及多模態(tài)指令等,采用指令微調(diào)的方法對LLMs進行微調(diào),例如基于GPT、T5、LLaMA等骨干模型,采用LOMO、DELTA微調(diào)等高效微調(diào)技術(shù)。指令微調(diào)取得很好效果,但是否只是學(xué)習(xí)表面模式尚存在爭議,未來應(yīng)注重提升指導(dǎo)質(zhì)量和多方面評估。
相關(guān)文章
LLMs之Data:指令微調(diào)的簡介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡介、Alpaca/BELLE應(yīng)用、實戰(zhàn)案例代碼實現(xiàn)之詳細攻略
LLMs之Data:指令微調(diào)的簡介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡介、Alpaca/BELLE應(yīng)用、實戰(zhàn)案例代碼實現(xiàn)之詳細攻略_一個處女座的程序猿的博客-CSDN博客
2023年8月21日—Paper:《Instruction Tuning for Large Language Models: A Survey—大型語言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
Paper:《Instruction Tuning for Large Language Models: A Survey—大型語言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
《Instruction Tuning for Large Language Models: A Survey—大型語言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
地址
論文地址:https://arxiv.org/abs/2308.10792
文章地址:Instruction Tuning for Large Language Models: A Survey | Papers With Code
文章地址:Instruction Tuning for Large Language Models: A Survey - AMiner
時間
2023年8月21日
作者
浙江大學(xué)等
Shengyu Zhang?, Linfeng Dong?, Xiaoya Li?, Sen Zhang?
Xiaofei Sun?, Shuhe Wang?, Jiwei Li??, Runyi Hu?
Tianwei Zhang▲, Fei Wu? and Guoyin Wang
Abstract摘要
指令微調(diào)技術(shù)(增強LLM的能力和可控性,有監(jiān)督微調(diào)+增量訓(xùn)練)、指令對
This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of (Instruction, Output) pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and application, along with analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.
本文調(diào)查了指令微調(diào)(IT)領(lǐng)域中的研究工作,這是一種關(guān)鍵技術(shù),用于增強大型語言模型(LLM)的能力和可控性。指令微調(diào)是指以監(jiān)督方式進一步訓(xùn)練LLM,使用由(Instruction, Output)對組成的數(shù)據(jù)集,從而彌合LLM的下一個詞預(yù)測目標與用戶要求LLM遵循人類指令的目標之間的差距。
在本工作中,我們對文獻進行了系統(tǒng)回顧,包括IT的一般方法、IT數(shù)據(jù)集的構(gòu)建、IT模型的訓(xùn)練,以及應(yīng)用于不同形式、領(lǐng)域和應(yīng)用的應(yīng)用,以及影響IT結(jié)果的因素的分析(例如,指令輸出的生成、指令數(shù)據(jù)集的大小等)。我們還回顧了IT的潛在風(fēng)險,以及對其的批評,同時還指出了現(xiàn)有策略的當前不足之處,并提出了一些有益的研究方向。
1 Introduction引言
LLM顯著進展(GPT-3→PaLM→LLaMA)、當前痛點(訓(xùn)練目標與用戶目標間的不匹配)、
The field of large language models (LLMs) has witnessed remarkable progress in recent years. LLMs such as GPT-3 (Brown et al., 2020b), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023a) have demonstrated impressive capabilities across a wide range of natural language tasks (Zhao et al., 2021; Wang et al., 2022b, 2023a; Wan et al., 2023; Sun et al., 2023c; Wei et al., 2023; Li et al., 2023a; Gao et al., 2023a; Yao et al., 2023; Yang et al., 2022a; Qian et al., 2022; Lee et al., 2022; Yang et al., 2022b; Gao et al., 2023b; Ning et al., 2023; Liu et al., 2021b; Wiegreffe et al., 2021; Sun et al., 2023b,a;Adlakha et al., 2023; Chen et al., 2023). One of the major issues with LLMs is the mismatch between the training objective and users’ objective: LLMs are typically trained on minimizing the contextual word prediction error on large corpora; while users want the model to "follow their instructions helpfully and safely" (Radford et al., 2019; Brown et al., 2020a; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022)
近年來,大型語言模型(LLM)領(lǐng)域取得了顯著進展。諸如GPT-3(Brown等,2020b)、PaLM(Chowdhery等,2022)和LLaMA(Touvron等,2023a)等LLM在各種自然語言任務(wù)中展示了令人印象深刻的能力。
LLM的一個主要問題是訓(xùn)練目標與用戶目標之間的不匹配:LLM通常在最小化大型語料庫上的上下文詞預(yù)測誤差的基礎(chǔ)上進行訓(xùn)練,而用戶希望模型“有助于并安全地遵循他們的指令”(Radford等,2019;Brown等,2020a;Fedus等,2021;Rae等,2021;Thoppilan等,2022)。
提出指令微調(diào)技術(shù)(解決不匹配)、指令微調(diào)的3個好處(彌合誤差+為人類提供介入模型行為的渠道+性能高效)
To address this mismatch, instruction tuning (IT) is proposed, serving as an effective technique to enhance the capabilities and controllability of large language models. It involves further training LLMs using (Instruction, Output) pairs, where INSTRUCTION denotes the human instruction for the model, and OUTPUT denotes the desired output that follows the INSTRUCTION. The benefits of IT are threefold: (1) Finetuning an LLM on the instruction dataset bridges the gap between the next-word prediction objective of LLMs and the users’ objective of instruction following; (2) IT allows for a more controllable and predictable model behavior compared to standard LLMs. The instructions serve to constrain the model’s outputs to align with the desired response characteristics or domain knowledge, providing a channel for humans to intervene with the model’s behaviors; and (3) IT is computationally efficient and can help LLMs rapidly adapt to a specific domain without extensive retraining or architectural changes.
為了解決這種不匹配,提出了指令微調(diào)(IT),作為增強大型語言模型能力和可控性的有效技術(shù)。它涉及使用(Instruction, Output)對進一步訓(xùn)練LLM,其中指令表示模型的人類指令,輸出表示遵循指令的所需輸出。
IT的好處有三個:
(1)在指令數(shù)據(jù)集上微調(diào)LLM彌合了LLM的下一個詞預(yù)測目標與用戶遵循指令目標之間的差距;
(2)與標準LLM相比,IT允許模型行為更可控和可預(yù)測。指令用于限制模型的輸出,使其與期望的響應(yīng)特性或領(lǐng)域知識保持一致,為人類提供介入模型行為的渠道;
(3)IT在計算上是高效的,并且可以幫助LLM在不需要大量重新訓(xùn)練或架構(gòu)更改的情況下迅速適應(yīng)特定領(lǐng)域。
指令微調(diào)的3大挑戰(zhàn):高質(zhì)量性、改善嚴重依賴數(shù)據(jù)性、可能只學(xué)皮毛性
Despite its effectiveness, IT also poses challenges: (1) Crafting high-quality instructions that properly cover the desired target behaviors is non-trivial: existing instruction datasets are usually limited in quantity, diversity, and creativity; (2) there has been an increasing concern that IT only improves on tasks that are heavily supported in the IT training dataset (Gudibande et al., 2023); and (3) there has been an intense criticism that IT only captures surface-level patterns and styles (e.g., the output format) rather than comprehending and learning the task (Kung and Peng, 2023). Improving instruction adherence and handling unanticipated model responses remain open research problems. These challenges highlight the importance of further investigations, analysis, and summarization in this field, to optimize the fine-tuning process and better understand the behavior of instruction fine-tuned LLMs.
盡管其有效性,IT也帶來了挑戰(zhàn):
(1)制定高質(zhì)量的指令以正確覆蓋所需的目標行為并不容易:現(xiàn)有的指令數(shù)據(jù)集通常在數(shù)量、多樣性和創(chuàng)意方面受限;
(2)越來越多的人擔心,IT只會改善那些在IT訓(xùn)練數(shù)據(jù)集中得到大量支持的任務(wù)(Gudibande et al., 2023);
(3)有人強烈批評IT只捕獲表面模式和樣式(例如,輸出格式),而不是理解和學(xué)習(xí)任務(wù)(Kung和Peng,2023)。
改進指令遵循和處理意外模型響應(yīng)仍然是未解決的研究問題。
這些挑戰(zhàn)強調(diào)了進一步調(diào)查、分析和總結(jié)在這一領(lǐng)域的重要性,以優(yōu)化微調(diào)過程并更好地理解經(jīng)過指令微調(diào)的LLM的行為。
In the literature, there has been an increasing research interest in analysis and discussions on LLMs, including pre-training methods (Zhao et al., 2023), reasoning abilities (Huang and Chang, 2022), downstream applications (Yang et al., 2023; Sun et al., 2023b), but rarely on the topic of LLM instruction finetuning. This survey attempts to fill this blank, organizing the most up-to-date state of knowledge on this quickly advancing field. Specifically,
>>Section 2 presents the general methodology employed in instruction fine-tuning.
>>Section 3 outlines the construction process of commonly-used IT representative datasets.
>>Section 4 presents representative instruction- finetuned models.???
>>Section 5 reviews multi-modality techniques and datasets for instruction tuning, including images, speech, and video.
>>Section 6 reviews efforts to adapt LLMs to different domains and applications using the IT strategy.
>>Section 7 reviews explorations to make instruction fine-tuning more efficient, reducing the computational and time costs associated with adapting large models.
>>Section 8 presents the evaluation of IT models, analysis on them, along with criticism against them.
在文獻中,人們越來越關(guān)注對LLM進行分析和討論,包括預(yù)訓(xùn)練方法(Zhao等,2023),推理能力(Huang和Chang,2022),下游應(yīng)用(Yang等,2023;Sun等,2023b),但很少涉及LLM指令微調(diào)這個主題。本調(diào)查試圖填補這一空白,整理關(guān)于這一快速發(fā)展領(lǐng)域的最新知識狀態(tài)。具體而言,
第2節(jié)介紹了指令微調(diào)中采用的一般方法。
第3節(jié)概述了常用IT代表性數(shù)據(jù)集的構(gòu)建過程。
第4節(jié)介紹了代表性的經(jīng)過指令微調(diào)的模型。
第5節(jié)回顧了用于指令微調(diào)的多模態(tài)技術(shù)和數(shù)據(jù)集,包括圖像、語音和視頻。
第6節(jié)回顧了使用IT策略將LLM調(diào)整為不同領(lǐng)域和應(yīng)用的努力。
第7節(jié)回顧了使指令微調(diào)更高效的探索,減少與調(diào)整大型模型相關(guān)的計算和時間成本。
第8節(jié)介紹了對IT模型的評估、分析以及對它們的批評。
2、Methodology方法
In this section, we describe the general pipeline employed in instruction tuning.
在本節(jié)中,我們描述了指令微調(diào)中采用的一般流程。
2.1、Instruction Dataset Construction指令數(shù)據(jù)集構(gòu)建:
數(shù)據(jù)實例三元素:instruction【指定任務(wù)】、input【補充上下文】、output【預(yù)期輸出】
Each instance in an instruction dataset consists of three elements: an instruction, which is a natural language text sequence to specify the task (e.g., write a thank-you letter to XX for XX, write a blog on the topic of XX, etc); an optional input which provides supplementary information for context; and an anticipated output based on the instruction and the input.
指令數(shù)據(jù)集中的每個實例包含三個元素:
instruction:一個instruction,是一系列自然語言文本序列,用于指定任務(wù)(例如,為XX寫一封感謝信,為XX寫一篇關(guān)于XX主題的博客等);
input :可選的input ,為上下文提供補充信息;
output:以及基于指令和輸入預(yù)期的output 。
兩種方法構(gòu)建:T1基于現(xiàn)有數(shù)據(jù)集成策略法(Flan/P3)、T2基于指令收集【手動/自動,如使用LLM的小型手寫種子指令進行擴展】采用LLM【如GPT-3.5-Turbo/GPT4】自動生成法(InstructWild/Self-Instruct)
There are generally two methods for constructing instruction datasets:
>>Data integration from annotated natural language datasets. In this approach, (Instruction, Output) pairs are collected from existing annotated natural language datasets by using templates to transform text-label pairs to (Instruction, Output) pairs. Datasets such as Flan (Longpre et al., 2023) and P3 (Sanh et al., 2021) are constructed based on the data integration strategy.
>>Generating outputs using LLMs: An alternate way to quickly gather the desired outputs to given instructions is to employ LLMs such as GPT-3.5-Turbo or GPT4 instead of manually collecting the outputs. Instructions can come from two sources: (1) manually collected; or (2) expanded based a small handwritten seed instructions using LLMs. Next, the collected instructions are fed to LLMs to obtain outputs. Datasets such as InstructWild (Xue et al., 2023) and Self-Instruct (Wang et al., 2022c) are geneated following this approach.
通常有兩種方法用于構(gòu)建指令數(shù)據(jù)集:
>> 基于現(xiàn)有數(shù)據(jù)集成策略法—從帶注釋的自然語言數(shù)據(jù)集中集成數(shù)據(jù)。在這種方法中,通過使用模板將文本-標簽對轉(zhuǎn)換為(Instruction, Output)對,從現(xiàn)有的帶注釋的自然語言數(shù)據(jù)集中收集(Instruction, Output)對。Flan(Longpre等,2023)和P3(Sanh等,2021)等數(shù)據(jù)集是基于數(shù)據(jù)集集成策略構(gòu)建的。
>> 采用LLM自動生成法—使用LLM生成輸出:一種快速獲取給定指令所需輸出的替代方法是使用LLM,例如GPT-3.5-Turbo或GPT4,而不是手動收集輸出。指令可以來自兩個來源:(1)手動收集;或(2)使用LLM擴展基于小型手寫種子指令。接下來,收集到的指令被輸入LLM以獲得輸出。InstructWild(Xue等,2023)和Self-Instruct(Wang等,2022c)等數(shù)據(jù)集是按照這種方法生成的。
多輪對話微調(diào)數(shù)據(jù)集:讓LLM扮演兩個對立角色來生成
For multi-turn conversational IT datasets, we can have large language models self-play different roles (user and AI assistant) to generate messages in a conversational format (Xu et al., 2023b).
對于多輪對話型的指令微調(diào)數(shù)據(jù)集,我們可以讓大型語言模型扮演不同角色(用戶和AI助手),以生成對話格式的消息(Xu等,2023b)。
2.2、Instruction Tuning指令微調(diào):有監(jiān)督的訓(xùn)練
Based on the collected IT dataset, a pretrained model can be directly fune-tuned in a fully- supervised manner, where given the instruction and the input, the model is trained by predicting each token in the output sequentially.
基于收集到的指令微調(diào)數(shù)據(jù)集,可以以完全監(jiān)督的方式直接微調(diào)預(yù)訓(xùn)練模型,其中在給定指令和輸入的情況下,模型通過逐個預(yù)測輸出中的每個令牌來進行訓(xùn)練。
3、Datasets數(shù)據(jù)集:大多都是英文指令,Natural?Instructions/Unnatural?Instructions/Super-Natural?Instructions、P3/xP3、Flan?2021、Self-Instruct、Evol-Instruct、LIMA、Dolly、OpenAssistant?Conversations、Baize
In this section, we detail widely-used instruction tuning datasets in the community. Table 1 gives an overview of the datasets.
在本節(jié)中,我們詳細介紹了社區(qū)中廣泛使用的指令微調(diào)數(shù)據(jù)集。表格1提供了數(shù)據(jù)集的概述。
3.1、Natural Instructions自然指令:來自193K個實例和61個NLP任務(wù),2元組{輸入,輸出}
Natural Instructions (Mishra et al., 2021) is a human-crafted English instruction dataset consisting of 193K instances, coming from 61 distinct NLP tasks. The dataset is comprised of "instructions" and "instances". Each instance in the "instructions" is a task description consisting of 7 components: title, definition, things to avoid emphasis/caution, prompt, positive example, and negative example. Subfigure (a) in Figure 2 gives an example of the "instructions". "Instances" consists of ("input", "output") pairs, which are the input data and textual result that follows the given instruction correctly. Subfigure (b) in Figure 2 gives an example of the instances.
The data comes from existing NLP datasets of 61 tasks. The authors collected the "instructions" by referring to the dataset annotating instruction file. Next, the authors constructed the "instances" by unifying data instances across all NLP datasets to ("input", "output") pairs.
Natural Instructions(Mishra等,2021)是一個人工創(chuàng)建的英語指令數(shù)據(jù)集,包含了193K個實例,來自61個不同的自然語言處理任務(wù)。數(shù)據(jù)集由“指令”和“實例”組成。
在“指令”中,每個實例是一個任務(wù)描述,包括7個組成部分:標題、定義、避免強調(diào)/注意事項、提示、正面示例和負面示例。
圖2(a)中的子圖示例展示了“指令”的一個示例。而“實例”由(“輸入”,“輸出”)對組成,即輸入數(shù)據(jù)和按照給定指令正確生成的文本結(jié)果。圖2(b)中的子圖示例展示了“實例”的一個示例。
這些數(shù)據(jù)來自61個任務(wù)的現(xiàn)有自然語言處理數(shù)據(jù)集。作者通過參考數(shù)據(jù)集的指令注釋文件來收集“指令”。接下來,作者通過將所有NLP數(shù)據(jù)集中的數(shù)據(jù)實例統(tǒng)一為(“輸入”,“輸出”)對來構(gòu)建“實例”。
3.2、P3公共提示池:整合170個英語NLP數(shù)據(jù)集和2052個英語提示,三元組{“輸入”【描述任務(wù)】+“答案選擇”【響應(yīng)列表】+“目標”【正確響應(yīng)】}
P3 (Public Pool of Prompts) (Sanh et al., 2021) is an instruction fine-tuning dataset constructed by integrating 170 English NLP datasets and 2,052 English prompts. Prompts, which are sometimes named task templates, are functions that map a data instance in a conventional NLP task (e.g., question answering, text classification) to a natural language input-output pair.
Each instance in P3 has three components: "inputs", "answer_choices", and “targets". "Inputs" is a sequence of text that describes the task in natural language (e.g., "If he like Mary is true, is it also true that he like Mary’s cat?"). "Answer choices" is a list of text string that are applicable responses to the given task (e.g., ["yes", "no", "undetermined"]). "Targets" is a text string that is the correct response to the given "inputs" (e.g., "yes"). The authors built PromptSource, a tool for creating high-quality prompts collaboratively and an archive for open-sourcing high-quality prompts. the P3 dataset was built by randomly sampling a prompt from multiple prompts in the PromptSource and mapping each instance into a ("inputs", "answer choices", "targets") triplet.
P3(Public Pool of Prompts)(Sanh等,2021)是一個指令微調(diào)數(shù)據(jù)集,通過整合170個英語自然語言處理數(shù)據(jù)集和2052個英語提示來構(gòu)建。提示有時被稱為任務(wù)模板,是一種將傳統(tǒng)自然語言處理任務(wù)(例如,問題回答、文本分類)的數(shù)據(jù)實例映射到自然語言輸入-輸出對的功能。
P3中的每個實例有三個組成部分:“輸入”,“答案選擇”和“目標”。 “輸入”是一系列以自然語言描述任務(wù)的文本序列(例如,“如果他喜歡瑪麗是真的,那么他是否也喜歡瑪麗的貓?”)。 “答案選擇”是一個文本字符串列表,是給定任務(wù)的適用響應(yīng)(例如,“是”,“否”,“不確定”)。 “目標”是文本字符串,是給定“輸入”的正確響應(yīng)(例如,“是”)。
作者構(gòu)建了PromptSource,這是一個協(xié)作創(chuàng)建高質(zhì)量提示的工具,也是一個開源高質(zhì)量提示的存檔。P3數(shù)據(jù)集是通過從PromptSource中隨機抽樣選擇一個提示,將每個實例映射為一個(“輸入”,“答案選擇”,“目標”)三元組而構(gòu)建的。
3.3、xP3跨語言公共提示池:46種語言中16類NLP任務(wù),2元組{輸入和目標}
xP3 (Crosslingual Public Pool of Prompts) (Muennighoff et al., 2022) is a multilingual instruction dataset consisting of 16 diverse natural language tasks in 46 languages. Each instance in the dataset has two components: "inputs" and "targets". "Inputs" is a task description in natural language. "Targets" is the textual result that follows the "inputs" instruction correctly.
The original data in xP3 comes from three sources: the English instruction dataset P3, 4 English unseen tasks in P3 (e.g., translation, program synthesis), and 30 multilingual NLP datasets. The authors built the xP3 dataset by sampling human-written task templates from PromptSource and then filling templates to transform diverse NLP tasks into a unified formalization. For example, a task template for the natural language inference task is as follows: “If Premise is true, is it also true that Hypothesis?”; "yes", "maybe", no" with respect to the original task labels "entailment (0)", "neutral (1)" and "contradiction (2)".
xP3(Crosslingual Public Pool of Prompts)(Muennighoff等,2022)是一個多語言指令數(shù)據(jù)集,包含46種語言中16個不同的自然語言處理任務(wù)。
數(shù)據(jù)集中的每個實例有兩個組成部分:“輸入”和“目標”。 “輸入”是自然語言中的任務(wù)描述。 “目標”是按照“輸入”指令正確生成的文本結(jié)果。
xP3中的原始數(shù)據(jù)來自三個來源:英語指令數(shù)據(jù)集P3,P3中的4個英語未見過的任務(wù)(例如,翻譯、程序合成)以及30個多語言自然語言處理數(shù)據(jù)集。作者通過從PromptSource中隨機抽樣選擇人工編寫的任務(wù)模板,然后填充模板,將不同的自然語言處理任務(wù)轉(zhuǎn)換為統(tǒng)一的形式,從而構(gòu)建了xP3數(shù)據(jù)集。
3.4、Flan 2021:將63個NLP基準轉(zhuǎn)換為輸入-輸出對進而構(gòu)建,2元組{輸入+目標}
Flan 2021 (Longpre et al., 2023) is an English instruction dataset constructed by transforming 62 widely-used NLP benchmarks (e.g., SST-2, SNLI, AG News, MultiRC) into language input- output pairs. Each instance in the Flan 2021 has "input" and "target" components. "Input" is a sequence of text that describes a task via a natural language instruction (e.g., "determine the sentiment of the sentence ’He likes the cat.’ is positive or negative?"). "Target" is a textual result that executes the "input" instruction correctly (e.g., "positive"). The authors transformed conventional NLP datasets into input-target pairs by: Step 1: manually composing instruction and target templates; Step 2: filling templates with data instances from the dataset.
Flan 2021(Longpre等,2023)是一個英語指令數(shù)據(jù)集,通過將62個廣泛使用的自然語言處理基準(例如,SST-2、SNLI、AG News、MultiRC)轉(zhuǎn)換為語言輸入-輸出對來構(gòu)建。Flan 2021中的每個實例包含“輸入”和“目標”兩個組成部分。“輸入”是描述任務(wù)的自然語言指令序列(例如,“確定句子'他喜歡貓。'的情感是積極還是消極?”)。 “目標”是正確執(zhí)行“輸入”指令的文本結(jié)果(例如,“積極”)。作者通過以下步驟將傳統(tǒng)的自然語言處理數(shù)據(jù)集轉(zhuǎn)換為輸入-目標對:
步驟1:手動組合指令和目標模板;
步驟2:使用數(shù)據(jù)集中的數(shù)據(jù)實例填充模板。
3.5、Unnatural Instructions非自然指令:基于InstructGPT構(gòu)建的24萬個實例,4元組{指令+輸入+約束+輸出}
Unnatural Instructions (Honovich et al., 2022) is an instruction dataset with approximately 240,000 instances, constructed using InstructGPT (text- davinci-002) (Ouyang et al., 2022). Each instance in the dataset has four components: INSTRUCTION, INPUT, CONSTRAINTS, and OUTPUT. Instruction" is a description of the instructing task in natural language. "Input" is an argument in natural language that instantiates the instruction task.
非自然指令(Honovich等,2022)是一個包含約24萬個實例的指令數(shù)據(jù)集,使用InstructGPT(text-davinci-002)(Ouyang等,2022)構(gòu)建而成。數(shù)據(jù)集中的每個實例有四個組成部分:指令、輸入、約束和輸出。 “指令”是自然語言中的指令任務(wù)描述。 “輸入”是實例化指令任務(wù)的自然語言參數(shù)。
3.6、Self-Instruct
LLMs之Data:指令微調(diào)的簡介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡介、Alpaca/BELLE應(yīng)用、實戰(zhàn)案例代碼實現(xiàn)之詳細攻略
LLMs之Data:指令微調(diào)的簡介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡介、Alpaca/BELLE應(yīng)用、實戰(zhàn)案例代碼實現(xiàn)之詳細攻略_一個處女座的程序猿的博客-CSDN博客
包含基于InstructGPT的52K個訓(xùn)練指令和252個評估指令,3元組{“指令”【定義任務(wù)】+“輸入”【指令的內(nèi)容補充】+“輸出”【正確結(jié)果】}
Self-Instruct (Wang et al., 2022c) is an English instruction dataset with 52K training instructions and 252 evaluation instructions, constructed using InstructGPT (Ouyang et al., 2022). Each data instance consists of "instruction", "input" and "output". "Instruction" is a task definition in natural language (e.g., "Please answer the following question."). "Input" is optional and is used as supplementary content for the instruction (e.g., "Which country’s capital is Beijing?"), and "output" is the textual result that follows the instruction correctly (e.g., "Beijing").
自我指導(dǎo)(Self-Instruct)(Wang等,2022c)是一個英語指令數(shù)據(jù)集,包含52K個訓(xùn)練指令和252個評估指令,使用InstructGPT(Ouyang等,2022)構(gòu)建而成。每個數(shù)據(jù)實例包括“指令”、“輸入”和“輸出”三個部分。 “指令”是自然語言中的任務(wù)定義(例如,“請回答以下問題?!?#xff09;。 “輸入”是可選的,用作指令的補充內(nèi)容(例如,“哪個國家的首都是北京?”),而“輸出”是正確遵循指令生成的文本結(jié)果(例如,“北京”)。
生成四步驟:構(gòu)建示例(175個種子任務(wù)來抽樣8個自然語言指令)來提示InstructGPT生成更多指令→判斷是否分類任務(wù)+基于給定的“指令”提示InstructGPT生成“輸入”再結(jié)合生成“輸出”→為相應(yīng)的指令任務(wù)生成“輸入”和“輸出”→后處理(過濾和刪除重復(fù))→最終得到52K個英語指令
The full dataset is generated based on the following steps: Step 1. The authors randomly sampled 8 natural language instructions from the 175 seed tasks as examples and prompted InstructGPT to generate more task instructions.
Step 2. The authors determined whether the instructions generated in Step 1 is a classification task. If yes, they asked InstructGPT to generate all possible options for the output based on the given instruction and randomly selected a particular output category to prompt InstructGPT to generate the corresponding "input" content. For Instructions that do not belong to a classification task, there should be countless "output" options. The authors proposed to use the Input-first strategy, where InstructGPT was prompted to generate the "input" based on the given "instruction" first and then generate the "output" according to the "instruction" and the generated "input".
Step 3. Based on results of step-2, the authors used InstructGPT to generate the "input" and "output" for corresponding instruction tasks using the output-first or input-first strategy.
Step 4. The authors post-processed (e.g., filtering out similar instructions and removing duplicate data for input and output) the generated instruction tasks and got a final number of 52K English instructions.
整個數(shù)據(jù)集是通過以下步驟生成的:
步驟1:作者隨機從175個種子任務(wù)中抽樣8個自然語言指令作為示例,并提示InstructGPT生成更多的任務(wù)指令。
步驟2:作者確定步驟1中生成的指令是否是分類任務(wù)。如果是,他們要求InstructGPT基于給定的指令生成所有可能的輸出選項,并隨機選擇一個特定的輸出類別,以促使InstructGPT生成相應(yīng)的“輸入”內(nèi)容。對于不屬于分類任務(wù)的指令,應(yīng)該有無數(shù)個“輸出”選項。作者提出了首先生成“輸入”的策略,即首先基于給定的“指令”提示InstructGPT生成“輸入”,然后根據(jù)“指令”和生成的“輸入”生成“輸出”。
步驟3:根據(jù)步驟2的結(jié)果,作者使用InstructGPT基于輸出優(yōu)先或輸入優(yōu)先策略為相應(yīng)的指令任務(wù)生成“輸入”和“輸出”。
步驟4:作者對生成的指令任務(wù)進行后處理(例如,過濾相似指令,刪除輸入和輸出的重復(fù)數(shù)據(jù)),得到最終的52K個英語指令。
3.7、Evol-Instruct:包含基于ChatGPT采用進化策略(添加約束、增加推理步驟、復(fù)雜化輸入等)構(gòu)建的52K個訓(xùn)練指令和218個評估指令,二元組{instruction, response}
形成過程:基于52K的初始集→隨機選擇1個進化策略讓ChatGPT重寫指令→過濾未進化的指令對(利用ChatGPT和規(guī)則)→利用新生成進化指令對更新數(shù)據(jù)集→重復(fù)上述四次→收集了25萬個指令對
Evol-Instruct (Xu et al., 2023a) is an English instruction dataset consisting of a training set with 52K instructions and an evaluation set with 218 instructions. The authors prompted ChatGPT (OpenAI, 2022) to rewrite instructions using the in-depth and in-breath evolving strategies. The in-depth evolving strategy contains five types of operations, e.g., adding constraints, increasing reasoning steps, complicating input and etc. The in-breath evolving strategy upgrades the simple instruction to a more complex one or directly generates a new instruction to increase diversity. The authors first used 52K (instruction, response) pairs as the initial set. Then they randomly sampled an evolving strategy and asked ChatGPT to rewrite the initial instruction based on the chosen evolved strategy. The author employed ChatGPT and rules to filter out no-evolved instruction pairs and updated the dataset with newly generated evolved instruction pairs. After repeating the above process 4 times, the authors collected 250K instruction pairs. Besides the train set, the authors collected 218 human-generated instructions from real scenarios (e.g., open-source projects, platforms, and forums), called the Evol- Instruct test set.
Evol-Instruct(Xu等,2023a)是一個英語指令數(shù)據(jù)集,包含一個包含52K個訓(xùn)練指令和218個評估指令的訓(xùn)練集。作者使用ChatGPT(OpenAI,2022)以深入和全面的進化策略重寫指令來構(gòu)建這個數(shù)據(jù)集。深入進化策略包含五種類型的操作,例如添加約束、增加推理步驟、復(fù)雜化輸入等。全面進化策略將簡單指令升級為更復(fù)雜的指令,或直接生成新的指令以增加多樣性。
作者首先使用52K個?(instruction, response)對作為初始集。然后隨機選擇一個進化策略,要求ChatGPT根據(jù)選擇的進化策略重寫初始指令。作者使用ChatGPT和規(guī)則來過濾掉未進化的指令對,并使用新生成的進化指令對更新數(shù)據(jù)集。在重復(fù)上述過程4次之后,作者收集了25萬個指令對。除了訓(xùn)練集之外,作者還從真實場景(例如,開源項目、平臺和論壇)中收集了218個人工生成的指令,稱為Evol-Instruct測試集。
3.8、LIMA:包含1K數(shù)據(jù)實例的訓(xùn)練集(75%源自3個社區(qū)問答網(wǎng)站)和300個實例的測試集,二元組{instruction, response}??
LIMA (Zhou et al., 2023) is an English instruction dataset consisting of a train set with 1K data instances and a test set with 300 instances. The train set contains 1K ("instruction", "response") pairs. For the training data, 75% are sampled from three community question & answers websites (i.e., Stack Exchange, wikiHow, and the Pushshift Reddit Dataset (Baumgartner et al., 2020)); 20% are manually written by a set of the authors (referred Group A) inspired by their interests; 5% are sampled from the Super-Natural Instructions dataset (Wang et al., 2022d). As for the valid set, the authors sampled 50 instances from the Group A author-written set. The test set contains 300 examples, with 76.7% written by another group (Group B) of authors and 23.3% sampled from the Pushshift Reddit Dataset (Baumgartner et al., 2020), which is a collection of questions & answers within the Reddit community.
LIMA(Zhou等,2023)是一個英語指令數(shù)據(jù)集,包含一個包含1K個數(shù)據(jù)實例的訓(xùn)練集和一個包含300個實例的測試集。訓(xùn)練集包含1K個(instruction, response)對。對于訓(xùn)練數(shù)據(jù),其中75%來自三個社區(qū)問答網(wǎng)站(即Stack Exchange、wikiHow和Pushshift Reddit數(shù)據(jù)集(Baumgartner等,2020));20%由一組作者(Group A)手動編寫,受到他們興趣的啟發(fā);5%來自Super-Natural Instructions數(shù)據(jù)集(Wang等,2022d)。至于驗證集,作者從Group A作者編寫的集合中抽樣了50個實例。測試集包含300個示例,其中76.7%由另一組作者(Group B)編寫,23.3%來自Pushshift Reddit數(shù)據(jù)集(Baumgartner等,2020),這是Reddit社區(qū)中的問題和回答的集合。
3.9、Super-Natural Instructions超級自然指令:包含1616個NLP任務(wù)和500萬個任務(wù)實例+涵蓋76種任務(wù)類型和55種語言,二元組(“指令”和“任務(wù)實例”)
Super Natural Instructions (Wang et al., 2022f) is a multilingual instruction collection composed of 1,616 NLP tasks and 5M task instances, covering 76 distinct task types (e.g., text classification, information extraction, text rewriting, text
composition and etc.) and 55 languages. Each task in the dataset consists of an "instruction" and "task instances". Specifically, "instruction" has three components: a "definition" that describes the task in natural language; "positive examples" that are samples of inputs and correct outputs, along with a short explanation for each; and "negative examples" that are samples of inputs and undesired outputs, along with a short explanation for each, as shown in Figure 2 (a). "Task instances" are data instances comprised of textual input and a list of acceptable textual outputs, as shown in Figure 2 (b). The original data in Super Natural Instructions comes from three sources: (1) existing public NLP datasets (e.g., CommonsenseQA); (2) applicable intermediate annotations that are generated through a crowdsourcing process (e.g., paraphrasing results to a given question during a crowdsourcing QA dataset); (3) synthetic tasks that are transformed from symbolic tasks and rephrased in a few sentences (e.g., algebraic operations like number comparison).
超級自然指令(Super Natural Instructions)(Wang等,2022f)是一個多語言指令收集,包含1616個自然語言處理任務(wù)和500萬個任務(wù)實例,涵蓋76種不同的任務(wù)類型(例如,文本分類、信息提取、文本改寫、文本組成等)和55種語言。數(shù)據(jù)集中的每個任務(wù)包括“指令”和“任務(wù)實例”兩個部分。
具體來說,“指令”有三個組成部分:以自然語言描述任務(wù)的“定義”;“正面示例”,它是輸入和正確輸出的示例,每個示例都附有簡短的解釋;“負面示例”,它是輸入和不希望的輸出的示例,每個示例都附有簡短的解釋,如圖2(a)所示。
“任務(wù)實例”是由文本輸入和可接受的文本輸出列表組成的數(shù)據(jù)實例,如圖2(b)所示。
超級自然指令中的原始數(shù)據(jù)來自三個來源:(1)現(xiàn)有的公共自然語言處理數(shù)據(jù)集(例如,CommonsenseQA);(2)通過眾包過程生成的適用中間注釋(例如,在眾包問答數(shù)據(jù)集中對給定問題進行釋義);(3)從符號任務(wù)轉(zhuǎn)換而來且經(jīng)過重新表述的合成任務(wù),這些任務(wù)在幾句話中重新表述(例如,代數(shù)運算,如數(shù)字比較)。
3.10、Dolly:包含15000個人工生成英語指令+7種特定類型
Dolly (Conover et al., 2023a) is an English instruction dataset with 15,000 human-generated data instances designed to enable LLMs to interact with users akin to ChatGPT. The dataset is designed for simulating a wide range of human behaviors, covering 7 specific types: open Q&A, closed Q&A, extracting information from Wikipedia, summarizing information from Wikipedia, brainstorming, classification, and creative writing. Examples of each task type in the dataset are shown in Table 2.
Dolly(Conover等,2023a)是一個包含15000個人工生成的數(shù)據(jù)實例的英語指令數(shù)據(jù)集,旨在使大型語言模型能夠與用戶進行類似于ChatGPT的互動。該數(shù)據(jù)集旨在模擬各種人類行為,涵蓋7種特定類型:開放式問答、封閉式問答、從維基百科中提取信息、從維基百科中總結(jié)信息、頭腦風(fēng)暴、分類和創(chuàng)意寫作。數(shù)據(jù)集中每種任務(wù)類型的示例如表2所示。
3.11、OpenAssistant Conversations
包含158K條消息(90K個用戶提示+68K個助手回復(fù)),35種語言中65K個對話樹+450K個人工注釋的質(zhì)量評分,對話樹(節(jié)點,路徑/線程)
OpenAssistant Conversations (K?pf et al., 2023) is a human-crafted multilingual assistant-style conversation corpus consisting of 161,443 messages (i.e., 91,829 user prompts, 69,614 assistant replies) from 66,497 conversation trees in 35 languages, along with 461,292 human-annotated quality ratings. Each instance in the dataset is a conversation tree (CT). Specifically, each node in a conversation tree denotes a message generated by roles (i.e., prompter, assistant) in the conversation. A CT’s root node represents an initial prompt from the prompter, while other nodes denote replies from a prompter or an assistant. A path from the root to any node in a CT represents a valid conversation between the prompter and assistant in turns and is referred to as a thread. Figure 4 shows an example of a conversation tree consisting of 12 messages in 6 threads.
OpenAssistant Conversations(K?pf等,2023)是一個人工創(chuàng)建的多語言助手風(fēng)格對話語料庫,包含161443條消息(即91829個用戶提示,69614個助手回復(fù)),來自35種語言中66497個對話樹,同時還包含461292個人工注釋的質(zhì)量評分。
數(shù)據(jù)集中的每個實例是一個對話樹(CT)。具體來說,對話樹中的每個節(jié)點表示會話中角色(即提示者、助手)生成的消息。CT的根節(jié)點表示提示者的初始提示,而其他節(jié)點表示提示者或助手的回復(fù)。從根節(jié)點到CT中任何節(jié)點的路徑表示提示者和助手之間的有效會話,稱為線程。圖4顯示了一個由12條消息組成的對話樹的示例,其中包含6個線程。
五步流程收集對話樹:提示者→標記提示→擴展樹節(jié)點→標記回復(fù)→排名
The authors first collected conversation trees based on the five-step pipeline:?
Step 1. prompting: contributors performed as the prompter and crafted initial prompts;
Step 2. labeling prompts: contributors rated scores to initial prompts from step 1, and the authors chose high-quality prompts as root nodes with a balanced sampling strategy;
Step 3. expanding tree nodes: contributors added reply messages as prompter or assistant;
Step 4. labeling replies: contributors assigned scores to existing node replies;
Step 5. ranking: contributors ranked assistant replies referring to the contributor guidelines.
The tree state machine managed and tracked the state (e.g., initial state, growing state, end state) throughout the conversation crafting process. Subsequently, the OpenAssistant Conversations dataset was built by filtering out offensive and inappropriate conversation trees.
作者首先根據(jù)以下五步流程收集了對話樹:
步驟1:提示者:貢獻者扮演提示者的角色,創(chuàng)建初始提示;
步驟2:標記提示:貢獻者對步驟1中的初始提示進行評分,作者使用平衡的抽樣策略選擇高質(zhì)量的提示作為根節(jié)點;
步驟3:擴展樹節(jié)點:貢獻者添加提示者或助手的回復(fù)消息;
步驟4:標記回復(fù):貢獻者對現(xiàn)有節(jié)點的回復(fù)分配分數(shù);
步驟5:排名:貢獻者根據(jù)貢獻者指南對助手的回復(fù)進行排名。
樹狀態(tài)機在整個對話創(chuàng)作過程中管理和跟蹤狀態(tài)(例如,初始狀態(tài)、增長狀態(tài)、結(jié)束狀態(tài))。隨后,通過過濾掉冒犯性和不適當?shù)膶υ挊?#xff0c;構(gòu)建了OpenAssistant Conversations數(shù)據(jù)集。
3.12、Baize:基于ChatGPT(self-chat思想)構(gòu)建的111.5K個實例多輪(3.4輪)聊天語料庫,二元組{prompt,response}
Baize (Conover et al., 2023b) is an English multi- turn chat corpus with 111.5K instances constructed using ChatGPT. And each turn consists of a user’s prompt and a response from the assistant. Each instance in Baize v1 contains 3.4 turns of conversations.
To create the Baize dataset, the authors proposed self-chat, where ChatGPT plays roles of the user and the AI assistant in turns and generates messages in a conversational format. Specifically, the authors first crafted a task template that defines the roles and tasks for ChatGPT (as shown in Table 3). Next, they sampled questions (e.g., "How do you fix a Google Play Store account that isn’t working?") from Quora and Stack Overflow datasets as conversation seeds (e.g., topics). Subsequently, they prompted ChatGPT with the template and the sampled seed. ChatGPT continuously generates messages for both sides until a natural stopping point is reached.
Baize(Conover等,2023b)是一個包含111.5K個實例的英語多輪聊天語料庫,使用ChatGPT構(gòu)建。每個輪次包括用戶的提示和助手的回復(fù)。Baize v1中的每個實例包含3.4輪的對話。
為了創(chuàng)建Baize數(shù)據(jù)集,作者提出了自我對話的概念,其中ChatGPT在輪流扮演用戶和AI助手的角色,以會話格式生成消息。具體來說,作者首先創(chuàng)建了一個任務(wù)模板,定義了ChatGPT的角色和任務(wù)(如表3所示)。接下來,他們從Quora和Stack Overflow數(shù)據(jù)集中抽樣問題(例如,“如何修復(fù)不工作的Google Play Store賬戶?”)作為會話種子(例如,話題)。隨后,他們使用模板和抽樣的種子提示ChatGPT。ChatGPT持續(xù)地為雙方生成消息,直到達到自然停止點為止。
4、Instruction Fine-tuned LLMs指導(dǎo)微調(diào)的LLM模型
In this section, we detail widely-used LLM models in the community that are trained through instruction fine-tuning.
在本節(jié)中,我們詳細介紹社區(qū)中廣泛使用的通過指導(dǎo)微調(diào)訓(xùn)練的LLM模型。
4.1、InstructGPT:基于GPT-3模型+人類指導(dǎo)微調(diào)
LLMs之InstructGPT:《Training language models to follow instructions with human feedback》翻譯與解讀
LLMs之InstructGPT:《Training language models to follow instructions with human feedback》翻譯與解讀_our models generalize to the preferences of “held-_一個處女座的程序猿的博客-CSDN博客
微調(diào)三步驟(基于人類篩選指令進行SFT→基于一個instruction多個降序的responses來訓(xùn)練RM模型→利用RL的PPO策略優(yōu)化RM模型)
InstructGPT (176B) (Ouyang et al., 2022) is initialized with GPT-3 (176B) (Brown et al., 2020b) and then fine-tuned on human instructions. The fine-tuning procedure is composed of the following three steps: (1) supervised fine-tuning (SFT) on the human-filtered instruction dataset, which is collected from Playground API history records; (2) training a reward model to predict human preferences based on an annotated dataset, which is constructed though human labors by sampling multiple responses for one instruction and rank them from the best to the worst; (3) further optimizing the model from Step 1 with new instructions and the trained reward model in step (2). Parameters are updated using the proximal policy optimization (PPO) (Schulman et al., 2017) method, a policy gradient reinforcement learning method. Steps (2) and (3) are alternated multiple times until the model performance does not significantly improve.
InstructGPT(176B)(Ouyang等,2022)以GPT-3(176B)(Brown等,2020b)為初始模型,然后在人類指導(dǎo)下進行微調(diào)。
微調(diào)過程包括以下三個步驟:
(1)在人類篩選的指令數(shù)據(jù)集上進行監(jiān)督微調(diào)(SFT),該數(shù)據(jù)集從Playground API歷史記錄中收集;
(2)訓(xùn)練獎勵模型以預(yù)測人類偏好,基于通過人工勞動采樣的帶注釋數(shù)據(jù)集,該數(shù)據(jù)集為一個指令采樣多個響應(yīng),并將其從最佳到最差進行排序;
(3)使用步驟(2)中訓(xùn)練的獎勵模型從步驟1中的模型和新指令進一步優(yōu)化。參數(shù)使用近端策略優(yōu)化(PPO)(Schulman等,2017)方法進行更新,這是一種策略梯度強化學(xué)習(xí)方法。步驟(2)和(3)多次交替進行,直到模型性能不再顯著提高為止。
InstructGPT的真實性、毒性、模型性能等表現(xiàn)非常出色
Overall, InstructGPT outperforms GPT-3. For automatic evaluations, InstructGPT outperforms GPT-3 by 10% on the TruthfulQA (Lin et al., 2021) dataset in terms of truthfulness and by 7% on the RealToxicityPrompts (Gehman et al., 2020) in terms of toxicity. On NLP datasets (i.e., WSC), InstructGPT achieves comparable performance to GPT-3. For human evaluations, regarding four different aspects, including following correct instructions, following explicit constraints, fewer hallucinations, and generating appropriate responses, InstructGPT outperforms GPT-3 +10%, +20%, -20%, and +10%, respectively.
總體而言,InstructGPT在真實性QA數(shù)據(jù)集(Lin等,2021)方面比GPT-3表現(xiàn)出色,真實性方面提高了10%,在RealToxicityPrompts數(shù)據(jù)集—即評估生成文本模型的毒性(Gehman等,2020)方面提高了7%。在自然語言處理數(shù)據(jù)集(例如WSC)上,InstructGPT的性能與GPT-3相當。在人類評估方面,涉及遵循正確指令、遵循明確約束、幻覺較少以及生成適當響應(yīng)等四個不同方面,InstructGPT分別優(yōu)于GPT-3 +10%、+20%、-20%和+10%。
4.2、BLOOMZ:基于BLOOM模型+指令數(shù)據(jù)集xP3,多種任務(wù)及其數(shù)據(jù)集上表現(xiàn)均超于BLOOM
LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻譯與解讀
LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
BLOOMZ (176B) (Muennighoff et al., 2022) is initialized with BLOOM (176B) (Scao et al., 2022), and then fine-tuned on the instruction dataset xP3 (Muennighoff et al., 2022), a collection of human-instruction datasets in 46 languages, coming from two sources: (1) P3, which is a collection of (English instruction, English response) pairs; and (2) an (English instruction, Multilingual response) set which is transformed from multilingual NLP datasets (e.g., Chinese benchmarks) by filling task templates with pre- defined English instructions.??
For automatic evaluation, BLOOMZ performs better than BLOOM in the zero-shot setting by +10.4%, 20.5%, and 9.8% on coreference resolution, sentence completion and natural language inference datasets, respectively. For the HumanEval benchmark (Chen et al., 2021), BLOOMZ outperforms BLOOM by 10% in terms of the Pass@100 metric. For generative tasks, BLOOMZ receives +9% BLEU improvement compared to BLOOM on the lm-evaluation-harness benchmark.
BLOOMZ(176B)(Muennighoff等,2022)以BLOOM(176B)(Scao等,2022)為初始模型,然后在指令數(shù)據(jù)集xP3(Muennighoff等,2022)上進行微調(diào)。xP3是一個包含46種語言的人類指令數(shù)據(jù)集的集合,來自兩個來源:
(1)P3,其中包含(英文指令,英文響應(yīng))對;
(2)一個(英文指令,多語言響應(yīng))集,通過在多語言自然語言處理數(shù)據(jù)集(例如中文基準)中使用預(yù)定義的英文指令填充任務(wù)模板而轉(zhuǎn)化而來。
對于自動評估,BLOOMZ在zero-shot設(shè)置下在共指消解、句子補全和自然語言推理數(shù)據(jù)集上分別比BLOOM提高了10.4%、20.5%和9.8%。對于HumanEval基準(Chen等,2021),BLOOMZ在Pass@100度量上優(yōu)于BLOOM 10%。對于生成任務(wù),BLOOMZ在lm-evaluation-harness基準上比BLOOM的BLEU分數(shù)提高了9%。
"Pass@100" 是一種評估指標,用于衡量生成式模型在生成任務(wù)中的性能。通常,生成式模型會根據(jù)輸入生成相應(yīng)的文本輸出。
T1、BLEU指標:在文本生成任務(wù)中,一種評估方式是將生成的文本與人工提供的參考文本進行比較,以測量生成文本的質(zhì)量。"BLEU"(Bilingual Evaluation Understudy,雙語評估候補)是一種常用的自動評估指標,用于衡量生成文本與參考文本之間的相似性。
T2、Pass@K指標:而在生成式任務(wù)中,尤其是類似問答任務(wù)中,還有一些其他的評估指標,如"Pass@K",其中 K 代表一個特定的數(shù)值,表示模型生成的回答是否在前 K 個候選中。例如,"Pass@100" 意味著模型生成的回答是否在前100個候選中。
4.3、Flan-T5:基于T5模型+FLAN數(shù)據(jù)集微調(diào),基于JAX的T5X框架+128*TPU v4=37小時
Flan-T5 (11B) is is a large language model initialized with T5 (11B) (Raffel et al., 2019), and then fine-tuned on the FLAN dataset (Longpre et al., 2023). The FLAN dataset is a collection of (instruction, pairs) pairs, constructed from 62 datasets of 12 NLP tasks (e.g., natural language inference, commonsense reasoning, paraphrase generation) by filling templates with various instructions under a unified task formalization.
During fine-tuning, FLAN-T5 adapts the JAX- based T5X framework and selects the best model evaluated on the held-out tasks every 2k step. Compared with T5’s pre-training stage, fine-tuning costs 0.2% computational resources (approximately 128 TPU v4 chips for 37 hours).
For evaluation, FLAN-T5 (11B) outperforms T5 (11B), and achieves comparable results to larger models, including PaLM (60B) (Chowdhery et al., 2022) in the few-shot setting. FLAN- T5 outperforms T5 by +18.9%, +12.3%, +4.1%, +5.8%, +2.1%, and +8% on MMLU (Hendrycks et al., 2020), BBH (Suzgun et al., 2022), TyDiQA (Clark et al., 2020), MGSM (Shi et al., 2022), open-ended generation, and RealToxicityPrompts (Gehman et al., 2020), respectively. In few-shot settings, FLAN-T5 outperforms PaLM +1.4% and +1.2% on the BBH and TyDiQA datasets.
Flan-T5(11B)是一種大型語言模型,其初始化采用T5(11B)(Raffel等,2019)并在FLAN數(shù)據(jù)集(Longpre等,2023)上進行微調(diào)。FLAN數(shù)據(jù)集是一個包含(instruction, pairs)對的集合,通過在統(tǒng)一任務(wù)規(guī)范下使用各種指令填充模板,從12個自然語言處理任務(wù)的62個數(shù)據(jù)集構(gòu)建而成(例如,自然語言推理、常識推理、釋義生成)。
在微調(diào)過程中,FLAN-T5采用基于JAX的T5X框架,并在每2k步時選擇在預(yù)留任務(wù)上評估的最佳模型。與T5的預(yù)訓(xùn)練階段相比,微調(diào)過程消耗0.2%的計算資源(大約128個TPU v4芯片,耗時37小時)。
對于評估,FLAN-T5(11B)優(yōu)于T5(11B),在少樣本設(shè)置中實現(xiàn)了與更大模型(如PaLM(60B)(Chowdhery等,2022))相當?shù)慕Y(jié)果。FLAN-T5在MMLU(Hendrycks等,2020)、BBH(Suzgun等,2022)、TyDiQA(Clark等,2020)、MGSM(Shi等,2022)、開放式生成以及RealToxicityPrompts(Gehman等,2020)方面分別優(yōu)于T5 +18.9%、+12.3%、+4.1%、+5.8%、+2.1%和+8%。在少樣本設(shè)置中,FLAN-T5在BBH和TyDiQA數(shù)據(jù)集上分別優(yōu)于PaLM +1.4%和+1.2%。
4.4、Alpaca:基于LLaMA模型+利用InstructGPT生成指令數(shù)據(jù)集進行微調(diào),8*A100-80G設(shè)備+混合精度AMP+DP=3小時
LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀
LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
Alpaca (7B) (Taori et al., 2023) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the constructed instruction dataset generated by InstructGPT (175B, text-davinci-003) (Ouyang et al., 2022). The fine-tuning process takes around 3 hours on an 8-card 80GB A100 device with mixed precision training and fully shared data parallelism.
Alpaca (7B) achieves comparable performances to InstructGPT (175B,text-davinci-003) in terms of human evaluation. Specifically, Alpaca outperforms InstructGPT on the self-instruct dataset, garnering 90 instances of victories compared to 89 instances.
Alpaca(7B)(Taori等,2023)是一種語言模型,通過對由InstructGPT(175B,text-davinci-003)(Ouyang等,2022)生成的構(gòu)建指令數(shù)據(jù)集進行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)。微調(diào)過程在8卡80GB A100設(shè)備上進行,使用混合精度訓(xùn)練和完全共享的數(shù)據(jù)并行技術(shù),大約耗時3小時。
Alpaca(7B)在人類評估方面表現(xiàn)與InstructGPT(175B,text-davinci-003)相當。具體來說,Alpaca在自我指導(dǎo)數(shù)據(jù)集上優(yōu)于InstructGPT,獲得了90次勝利,而InstructGPT獲得了89次。
4.5、Vicuna:基于LLaMA模型+利用ShareGPT的ChatGPT生成對話數(shù)據(jù)集(過濾低質(zhì)得70K)進行微調(diào),上下文擴到2K+GradientCheckpointing和FlashAttention(降低GPU成本)+8*A100-80G=24小時
LLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀
LLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
Vicuna (13B) (Chiang et al., 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on the conversational dataset generated by ChatGPT.
The authors gathered user-shared ChatGPT conversations from ShareGPT.com, and got 70K conversation records after filtering out low-quality samples. LLaMA (13B) was fine-tuned on the constructed conversation dataset using a modified loss function tailored to multi-turn conversations. To better understand long context across multiple- turn dialog, the authors expanded the max context length from 512 to 2048. For training, the authors adopted the gradient checkpointing and flash attention (Dao et al., 2022) techniques to reduce the GPU memory cost in the fine-tuning process. The fine-tuning process takes 24 hours on an 8 × 80GB A100 device with fully shared data parallelism.
The authors built a test set used exclusively to measure chatbots’ performances. They collected a test set composed by 8 question categories, such as Fermi problems, role play scenarios, coding/math tasks, etc, and then asked GPT-4 (OpenAI, 2023) to rate models’ responses considering helpfulness, relevance, accuracy, and detail. On the constructed test set, Vicuna (13B)outperforms Alpaca (13B) (Taori et al., 2023) and et al., 2022), open-ended generation, and LLaMA (13B) in 90% of the test questions, and generates equal or better rating responses compared to ChatGPT in 45% of the questions.
Vicuna(13B)(Chiang等,2023)是一種語言模型,通過對由ChatGPT生成的對話數(shù)據(jù)集進行微調(diào),使用LLaMA(13B)(Touvron等,2023a)完成微調(diào)。
作者從ShareGPT.com收集了用戶分享的ChatGPT對話,并在濾除低質(zhì)量樣本后獲得了70K個對話記錄。使用經(jīng)過修改的適用于多輪對話的損失函數(shù)對LLaMA(13B)進行了微調(diào)。
為了更好地理解多輪對話中的長上下文,作者將最大上下文長度從512擴展到2048。在訓(xùn)練過程中,作者采用了GradientCheckpointing和FlashAttention(Dao等,2022)技術(shù),以減少微調(diào)過程中的GPU內(nèi)存成本。微調(diào)過程在8個80GB A100設(shè)備上進行,使用完全共享的數(shù)據(jù)并行技術(shù),耗時24小時。
作者構(gòu)建了一個專門用于衡量聊天機器人表現(xiàn)的測試集。他們收集了一個由8個問題類別組成的測試集,例如費米問題、角色扮演情景、編碼/數(shù)學(xué)任務(wù)等,然后要求GPT-4(OpenAI,2023)根據(jù)有用性、相關(guān)性、準確性和細節(jié)對模型的響應(yīng)進行評分。在構(gòu)建的測試集上,Vicuna(13B)在90%的測試問題中優(yōu)于Alpaca(13B)、開放式生成以及LLaMA(13B),并在45%的問題中生成與ChatGPT相等或更好的評分響應(yīng)。
4.6、GPT-4-LLM:基于LLaMA模型+利用Alpaca的指令和GPT-4生成指令數(shù)據(jù)集進行有監(jiān)督微調(diào)→基于構(gòu)建比較數(shù)據(jù)集(收集GPT-4、InstructGPT 等多個大模型的指令響應(yīng)+GPT-4對響應(yīng)評分1~10分)訓(xùn)練RM模型(PPO優(yōu)化),8*A100-80G+AMP+DP=3小時
AIGC之GPT-4:GPT-4的簡介(核心原理/意義/亮點/技術(shù)點/缺點/使用建議)、使用方法、案例應(yīng)用(計算能力/代碼能力/看圖能力等)之詳細攻略
AIGC之GPT-4:GPT-4的簡介(核心原理/意義/亮點/技術(shù)點/缺點/使用建議)、使用方法、案例應(yīng)用(計算能力/代碼能力/看圖能力等)之詳細攻略_一個處女座的程序猿的博客-CSDN博客
GPT-4-LLM (7B) (Peng et al., 2023) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the GPT-4 (OpenAI, 2023) generated instruction dataset. GPT-4-LLM is initialized with LLaMA, then fine-tuned in the following two steps: (1) supervised fine- tuning on the constructed instruction dataset. The authors used the instructions from Alpaca (Taori et al., 2023), and then collected responses using GPT-4. LLaMA is fine-tuned on the GPT-4 generated dataset. The fine-tuning process takes approximately three hours on an 8*80GB A100 machine with mixed precision and fully shared data parallelism. (2) optimizing the step-1 model using the proximal policy optimization (PPO) (Schulman et al., 2017) method, the authors first built a comparison dataset by collecting responses from GPT-4, InstructGPT (Ouyang et al., 2022), and OPT-IML (Iyer et al., 2022) to a collection of instructions and then asked GPT-4 to rate each response from 1 to 10. Using the ratings, a reward model is trained based on OPT (Zhang et al., 2022a). The fine-tuned model from Step 1 is optimized by using the reward model to compute the policy gradient.?
For evaluations, GPT-4-LLM (7B) outperforms not only the baseline model Alpaca (7B), but also larger models including Alpaca (13B) and LLAMA (13B). For automated evaluation, GPT- 4-LLM (7B) outperforms Alpaca by 0.2, 0.5, and 0.7 on User-Oriented-Instructions-252 (Wang et al., 2022c), Vicuna-Instructions (Chiang et al., 2023), and Unnatural Instructions (Honovich et al., 2022) datasets, respectively. For human evaluation, regarding aspects including helpfulness, honesty, and harmlessness, GPT-4-LLM outperforms Alpaca by 11.7, 20.9, and 28.6 respectively.
GPT-4-LLM(7B)(Peng等,2023)是一種語言模型,通過對GPT-4(OpenAI,2023)生成的指令數(shù)據(jù)集進行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)。
GPT-4-LLM首先使用LLaMA進行初始化,然后在以下兩個步驟中進行微調(diào):
(1)在構(gòu)建的指令數(shù)據(jù)集上進行監(jiān)督微調(diào)。作者使用了Alpaca的指令,然后使用GPT-4生成了響應(yīng)。LLaMA在由GPT-4生成的數(shù)據(jù)集上進行微調(diào)。微調(diào)過程在8個80GB A100設(shè)備上使用混合精度和完全共享的數(shù)據(jù)并行技術(shù),大約耗時三小時。
(2)使用近端策略優(yōu)化(PPO) (Schulman et al., 2017)方法優(yōu)化step-1模型,作者首先通過收集GPT-4、InstructGPT (Ouyang et al., 2022)和OPT-IML (Iyer et al., 2022)對指令集合的響應(yīng)構(gòu)建比較數(shù)據(jù)集,然后要求GPT-4對每個響應(yīng)進行1到10的評分。使用評級,基于OPT訓(xùn)練獎勵模型(Zhang et al., 2022a)。通過使用獎勵模型來計算策略梯度,對步驟1的微調(diào)模型進行優(yōu)化。
在評估方面,GPT-4-LLM(7B)不僅優(yōu)于基準模型Alpaca(7B),還優(yōu)于更大的模型,包括Alpaca(13B)和LLAMA(13B)。在自動評估方面,GPT-4-LLM(7B)在用戶導(dǎo)向的指令-252(Wang等,2022c)、Vicuna-指令(Chiang等,2023)和非自然指令(Honovich等,2022)數(shù)據(jù)集上分別優(yōu)于Alpaca 0.2、0.5和0.7。在人類評估方面,關(guān)于可幫助性、誠實性和無害性等四個不同方面,GPT-4-LLM分別優(yōu)于Alpaca 11.7、20.9和28.6。
4.7、Claude:基于數(shù)據(jù)集(52K指令和GPT-4生成的響應(yīng)配對)進行SFT→基于構(gòu)建比較數(shù)據(jù)集(收集GPT-3等多個大模型的指令響應(yīng)+GPT-4對響應(yīng)評分)訓(xùn)練RM模型(PPO優(yōu)化),8*A100-80G+AMP+DP=8小時
Claude is a language model trained by fine-tuning the pre-trained language model on an instruction dataset, aiming to generate helpful and harmless responses. The fine-tuning process consists of two stages: (1) supervised fine-tuning on the instruction dataset. The authors created an instruction dataset by collecting 52K different instructions, paired with responses generated by GPT-4. The fine- tuning process takes approximately eight hours on an 8-card 80GB A100 machine with mixed precision and fully shared data parallelism. (2) optimizing the step-1 model with the proximal policy optimization (Schulman et al., 2017) method. The authors first built a comparison dataset by collecting responses from multiple large language models (e.g., GPT-3 (Brown et al., 2020b)) to the given collection of instructions and then asking GPT-4 (OpenAI, 2023) to rate each response. Using the ratings, a reward model is trained. Then, the fine-tuned model from Step 1 is optimized using the reward model with the proximal policy optimization method.
Claude generates more helpful and harmless responses compared to the backbone model. For automatic evaluations, Claude outperforms GPT- 3 by 7% on the RealToxicityPrompts (Gehman et al., 2020) in terms of toxicity. For human evaluations, regarding four different aspects, including following correct instructions, following explicit constraints, fewer hallucinations, and generating appropriate responses, Claude outperforms GPT-3 (Brown et al., 2020b) +10%,+20%, -20%, and +10%. respectively.
Claude是一種語言模型,通過對預(yù)訓(xùn)練語言模型在指令數(shù)據(jù)集上進行微調(diào),旨在生成有幫助且無害的響應(yīng)。微調(diào)過程包括兩個階段:
(1)在指令數(shù)據(jù)集上進行監(jiān)督微調(diào)。作者通過收集了52K個不同的指令,并與GPT-4生成的響應(yīng)配對,創(chuàng)建了一個指令數(shù)據(jù)集。微調(diào)過程在8卡80GB A100設(shè)備上使用混合精度和完全共享的數(shù)據(jù)并行技術(shù),大約耗時八小時。
(2)使用近端策略優(yōu)化(Schulman等,2017)方法優(yōu)化步驟1中的模型。作者首先通過收集多個大型語言模型(如GPT-3(Brown等,2020b))對給定指令的響應(yīng),并要求GPT-4對每個響應(yīng)進行評分,來構(gòu)建比較數(shù)據(jù)集。使用這些評分,訓(xùn)練了一個獎勵模型。然后,使用獎勵模型使用近端策略優(yōu)化方法優(yōu)化步驟1中的微調(diào)模型。
與骨干模型相比,Claude生成的響應(yīng)更有幫助且無害。在自動評估方面,Claude在RealToxicityPrompts(Gehman等,2020)方面優(yōu)于GPT-3 7%。在人類評估方面,關(guān)于遵循正確指令、遵循明確約束、幻覺較少以及生成適當響應(yīng)等四個不同方面,Claude分別優(yōu)于GPT-3 +10%、+20%、-20%和+10%。
4.8、WizardLM:基于LLaMA模型+Evol-Instruct指令數(shù)據(jù)集(ChatGPT生成)微調(diào),8*V100 GPU+Deepspeed Zero-3技術(shù)+3個epochs =70小時
WizardLM (7B) (Xu et al., 2023a) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the instruction dataset Evol-Instruct generated by ChatGPT (details see Section 3.7). It is fine-tuned on a subset (with 70K) of Evol-Instruct to enable a fair comparison with Vicuna (Chiang et al., 2023). The fine-tuning process takes approximately 70 hours on 3 epochs based on an 8 V100 GPU with the Deepspeed Zero-3 (Rasley et al., 2020) technique. During inference, the max generation length is 2048.
To evaluate LLMs’ performances on complex instructions, the authors collected 218 human- generated instructions from real scenarios (e.g., open-source projects, platforms, and forums), called Evol-Instruct testset.
Evaluations are conducted on the Evol-Instruct testset and Vicuna’s testset. For human evaluation, WizardLM outperforms Alpaca (7B) (Taori et al., 2023) and Vicuna (7B) by a large margins, and generates equal or better responses on 67% test samples compared to ChatGPT. Automatic evaluation is conducted by asking GPT-4 to rate LLMs’ reponses. Specifically, WizardLM gains performance boosts compared to Alpaca by +6.2%, +5.3% on the Evol-Instruct testset and Vicuna’s test sets. WizardLM achieves outperforms Vicuna by+5.8 on the Evol-Instruct testset and +1.7% on the Vicuna’s test set.
WizardLM(7B)(Xu等,2023a)是一種語言模型,通過對由ChatGPT生成的Evol-Instruct指令數(shù)據(jù)集進行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)(詳見第3.7節(jié))。它在Evol-Instruct的一個子集(含70K)上進行微調(diào),以便與Vicuna(Chiang等,2023)進行公平比較。微調(diào)過程基于8個V100?GPU和Deepspeed Zero-3(Rasley等,2020)技術(shù),在3個epochs?內(nèi)耗時約70小時。推理過程中,最大生成長度為2048。
為了評估LLM在復(fù)雜指令上的性能,作者從實際情境(例如開源項目、平臺和論壇)中收集了218個人工生成的指令,稱為Evol-Instruct測試集。評估在Evol-Instruct測試集和Vicuna的測試集上進行。在人類評估中,WizardLM在絕大多數(shù)情況下都優(yōu)于Alpaca(7B)(Taori等,2023)和Vicuna(7B),并且與ChatGPT相比,在67%的測試樣本上生成相等或更好的響應(yīng)。自動評估通過要求GPT-4對LLM的響應(yīng)進行評分進行,其中更高的得分意味著更好的性能。具體來說,在Evol-Instruct測試集和Vicuna的測試集上,WizardLM在比較上優(yōu)于Alpaca +6.2%、+5.3%。WizardLM在Evol-Instruct測試集上優(yōu)于Vicuna +5.8%,在Vicuna的測試集上優(yōu)于Vicuna +1.7%。
4.9、ChatGLM2:基于GLM模型+中英文指令(1:1)的雙語數(shù)據(jù)集(1.4T的tokens),類似InstructGPT的三步微調(diào)策略+上下文長度擴展到32K+MQA/CM策略(降GPU成本)+需13GB的顯存(INT4量化后需6GB)
LLMs之ChatGLM2:ChatGLM2-6B的簡介、安裝、使用方法之詳細攻略
LLMs之ChatGLM2:ChatGLM2-6B的簡介、安裝、使用方法之詳細攻略_一個處女座的程序猿的博客-CSDN博客
ChatGLM2 (6B) (Du et al., 2022) is a language model trained by fine-tuning GLM (6B) (Du et al., 2022) on a bilingual dataset that contains both English and Chinese instructions The bilingual instruction dataset contains 1.4T tokens, with a 1:1 ratio of Chinese to English. Instructions in the dataset are sampled from the question-answering and dialogue completion tasks. ChatGLM is initialized with GLM, then trained by the three-step fine-tuning strategy, which is akin to InstructGPT (Ouyang et al., 2022). To better model contextual information across multi-turn conversations, the authors expanded the maximum context length from 1024 to 32K. To reduce GPU memory cost in the fine-tuning stage, the authors employed multi-query attention and causal mask strategies. During inference, ChatGLM2 requires 13GB GPU memory with FP16 and supports conversations up to 8K in length with 6GB GPU memory using the INT4 model quantization technique.?
Evaluations are conducted on four English and Chinese benchmarks, including MMLU (English) (Hendrycks et al., 2020), C-Eval (Chinese) (Huang et al., 2023), GSM8K (Math) (Cobbe et al., 2021), and BBH (English) (Suzgun et al., 2022). ChatGLM2 (6B) outperforms GLM (6B) and the baseline model ChatGLM (6B) on all benchmarks. Specifically, ChatGLM2 outperforms GLM by+3.1 on MMLU, +5.0 on C-Eval, +8.6 on GSM8K,and +2.2 on BBH. ChatGLM2 achieves better performances than ChatGLM by +2.1, +1.2, +0.4,+0.8 on MMLU, C-Eval, GSM8K and BBH, respectively.
ChatGLM2(6B)(Du等,2022)是一種語言模型,通過對包含英文和中文指令的雙語數(shù)據(jù)集進行微調(diào),使用GLM(6B)(Du等,2022)完成微調(diào)。雙語指令數(shù)據(jù)集包含1.4T個標記,中英比例為1:1。數(shù)據(jù)集中的指令來自問答和對話完成任務(wù)。ChatGLM2初始化使用GLM,然后通過類似于InstructGPT(Ouyang等,2022)的三步微調(diào)策略進行訓(xùn)練。
為了更好地對多輪對話中的上下文信息進行建模,作者將最大上下文長度從1024擴展到32K。為了在微調(diào)階段降低GPU內(nèi)存成本,作者采用了多查詢注意力MQA和因果掩碼CM策略。在推理過程中,ChatGLM2需要13GB的GPU內(nèi)存,使用FP16支持最大長度為8K的對話,使用INT4模型量化技術(shù)時只需要6GB的GPU內(nèi)存。
評估在四個英文和中文基準數(shù)據(jù)集上進行,包括MMLU(英文)(Hendrycks等,2020)、C-Eval(中文)(Huang等,2023)、GSM8K(數(shù)學(xué))(Cobbe等,2021)和BBH(英文)(Suzgun等,2022)。ChatGLM2(6B)在所有基準數(shù)據(jù)集上優(yōu)于GLM(6B)和基準模型ChatGLM(6B)。具體來說,ChatGLM2在MMLU上優(yōu)于GLM +3.1,在C-Eval上優(yōu)于GLM +5.0,在GSM8K上優(yōu)于GLM +8.6,在BBH上優(yōu)于GLM +2.2。ChatGLM2在MMLU、C-Eval、GSM8K和BBH上的性能也優(yōu)于ChatGLM +2.1、+1.2、+0.4、+0.8。
4.10、LIMA:基于LLaMA模型+基于表面對齊假設(shè)構(gòu)建的指令數(shù)據(jù)集,提出了表面對齊假設(shè)并驗證了其效果
LIMA (65B) (Zhou et al., 2023) is a large language model trained by fine-tuning LLaMA (65B) (Touvron et al., 2023a) on an instruction dataset, which is constructed based on the proposed superficial alignment hypothesis.
The superficial alignment hypothesis refers to the idea that the knowledge and capabilities of a model are almost acquired in the pre-training stage, while the alignment training (e.g., instruction fine-tuning) teaches models to generate responses under user-preferred formalizations. Based on the superficial alignment hypothesis, the authors claimed that large language models can generate user-satisfied responses by fine-tuning it on a small fraction of instruction data. Therefore, the authors built instruction train/valid/test sets to verify this hypothesis.
Evaluations are conducted on the constructed test set. For human evaluations, LIMA outperforms InstructGPT and Alpaca by 17% and 19%, respectively. Additionally, LIMA achieves comparable results to BARD, Cladue, and GPT-4. For automatic evaluation, which is conducted by asking GPT-4 to rate responses and a higher rate score denotes better performance, LIMA outperforms InstructGPT and Alpaca by 20% and 36%, respectively, achieving comparable results to BARD, while underperforming Claude and GPT-4. Experimental results verify the proposed superficial alignment hypothesis.
LIMA(65B)(Zhou等,2023)是一種大型語言模型,通過對基于所提出的表面對齊假設(shè)構(gòu)建的指令數(shù)據(jù)集進行微調(diào),使用LLaMA(65B)(Touvron等,2023a)完成微調(diào)。表面對齊假設(shè)指的是模型的知識和能力幾乎在預(yù)訓(xùn)練階段獲得,而對齊訓(xùn)練(例如指令微調(diào))則教導(dǎo)模型在用戶首選的形式化下生成響應(yīng)?;谶@一表面對齊假設(shè),作者聲稱可以通過在少量指令數(shù)據(jù)上進行微調(diào)來生成滿足用戶的響應(yīng)。因此,作者構(gòu)建了指令訓(xùn)練/驗證/測試集來驗證這一假設(shè)。
評估在構(gòu)建的測試集上進行。在人類評估中,LIMA在有關(guān)方面優(yōu)于InstructGPT和Alpaca分別達到17%和19%。此外,LIMA在自動評估方面,通過要求GPT-4對響應(yīng)進行評分,得分越高表示性能越好,分別優(yōu)于InstructGPT和Alpaca達到20%和36%,與BARD的性能相當,但不如Claude和GPT-4。實驗結(jié)果驗證了提出的表面對齊假設(shè)。
4.11、Others
OPT-IML:基于OPT模型+微調(diào)IML數(shù)據(jù)集
LLMs:《OPT: Open Pre-trained Transformer Language Models》翻譯與解讀
LLMs:《OPT: Open Pre-trained Transformer Language Models》翻譯與解讀_csv數(shù)據(jù)集下載_一個處女座的程序猿的博客-CSDN博客
Dolly 2:基于Pythia模型+微調(diào)databricks-dolly-15k指令數(shù)據(jù)集
OPT-IML (175B) (Iyer et al., 2022) is a large language model trained by fine-tuning the OPT (175B) (Zhang et al., 2022a) model on the constructed Instruction Meta-Learning (IML) dataset, which consists of over 1500 NLP tasks from 8 publicly available benchmarks such as PromptSource (Bach et al., 2022), FLAN (Longpre et al., 2023), and Super-NaturalInstructions (Wang et al., 2022d). After fine-tuning, OPT-IML outperforms OPT across all benchmarks.
Dolly 2.0?(12B) (Conover et al., 2023a) is initialized with the pre-trained language model Pythia (12B) (Biderman et al., 2023), and fine- tuned on the instruction dataset databricks-dolly- 15k, which contains 7 categories of NLP tasks such as text classification and information extraction. After fine-tuning, Dolly 2.0 (12B) outperforms Pythia (12B) on the EleutherAI LLM Evaluation Harness benchmark (Gao et al., 2021) by a large margin, and achieves comparable performances to GPT-NEOX (20B) (Black et al., 2022), which has dolly-15k two times more parameters compared to Dolly 2.0 (12B).
OPT-IML(175B)(Iyer等,2022)是一種大型語言模型,通過對構(gòu)建的Instruction Meta-Learning(IML)數(shù)據(jù)集上的OPT(175B)(Zhang等,2022a)模型進行微調(diào),該數(shù)據(jù)集包含來自8個公開可用基準數(shù)據(jù)集的1500多個NLP任務(wù),如PromptSource(Bach等,2022)、FLAN(Longpre等,2023)和Super-NaturalInstructions(Wang等,2022d)。微調(diào)后,OPT-IML在所有基準數(shù)據(jù)集上優(yōu)于OPT。
Dolly 2.0(12B)(Conover等,2023a)通過在databricks-dolly-15k指令數(shù)據(jù)集上進行微調(diào),使用Pythia(12B)(Biderman等,2023)進行初始化,該數(shù)據(jù)集包含文本分類和信息提取等7類NLP任務(wù)。微調(diào)后,Dolly 2.0(12B)在EleutherAI LLM 評估套件基準(Gao等,2021)上遠遠優(yōu)于Pythia(12B),并在性能上與擁有兩倍參數(shù)的GPT-NEOX(20B)(Black等,2022)達到相當?shù)男阅堋?div style="height:15px;">
Falcon-Instruct:基于Falcon模型+微調(diào)英語對話數(shù)據(jù)集(Baize數(shù)據(jù)集150M/1.5億tokens+RefinedWeb數(shù)據(jù)集),降內(nèi)存(Flash Attention+MQ)
LLMs之Data:《The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only》翻譯與解讀
https://yunyaniu.blog.csdn.net/article/details/131137560
Guanaco:基于LLaMA+微調(diào)多語言對話數(shù)據(jù)集(源自包含52K英文指令數(shù)據(jù)對的Alpaca+534K的多輪對話的多語言)
LLMs之Guanaco:《QLoRA:Efficient Finetuning of Quantized LLMs》翻譯與解讀
LLMs之Guanaco:《QLoRA:Efficient Finetuning of Quantized LLMs》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
Falcon-Instruct (40B) (Almazrouei et al., 2023a) is a large language model trained by fine- tuning Falcon (40B) (Almazrouei et al., 2023b) on an English dialogue dataset, which contains 150 million tokens from the Baize dataset (Xu et al., 2023c), with an additional 5% of the data from the RefinedWeb dataset (Penedo et al., 2023). To reduce memory usage, the authors employed flash attention (Dao et al., 2022) and multi-query techniques. For evaluation, Falcon- Instruct (40B) achieved better performance on the Open LLM Leaderboard (Beeching et al., 2023) compared to the baseline model Falcon (40B), and outperforms the Guanaco (65B), which has more model parameters.
Guanaco (7B) (JosephusCheung, 2021) is a multi-turn dialog language model trained by fine- tuning LLaMA (7B) (Touvron et al., 2023a) on the constructed multilingual dialogue dataset. The multilingual dialogue dataset comes from two sources: Alpaca (Taori et al., 2023), which contains 52K English instruction data pairs; and a multilingual (e.g., Simplified Chinese, Traditional Chinese, Japanese, German) dialogue data, which contains 534K+ multi-turn conversations. After fine-tuning, Guanaco is to generate role-specific responses and continuous responses on a given topic in multi-turn conversations.
Falcon-Instruct?(40B) (Almazrouei等人,2023a)是一個大型語言模型,它是通過對Falcon (40B) (Almazrouei等人,2023b)在英語對話數(shù)據(jù)集上進行微調(diào)訓(xùn)練而成的,該數(shù)據(jù)集包含來自Baize數(shù)據(jù)集(Xu等人,2023c)的1.5億個令牌,以及來自RefinedWeb數(shù)據(jù)集(Penedo等人,2023)的額外5%的數(shù)據(jù)。為了減少內(nèi)存使用,作者采用了Flash Attention?(Dao et al., 2022)和多查詢技術(shù)。在評估中,Falcon- Instruct (40B)在Open LLM排行榜(Beeching et al., 2023)上的表現(xiàn)優(yōu)于基線模型Falcon (40B),優(yōu)于模型參數(shù)更多的Guanaco (65B)。
Guanaco(7B)(JosephusCheung,2021)是一種多輪對話語言模型,通過在構(gòu)建的多語言對話數(shù)據(jù)集上進行微調(diào),使用LLaMA(7B)(Touvron等,2023a)進行初始化。多語言對話數(shù)據(jù)集來自兩個來源:包含52K英文指令數(shù)據(jù)對的Alpaca(Taori等,2023);以及包含534K+多輪對話的多語言(例如簡體中文、繁體中文、日語、德語)對話數(shù)據(jù)。微調(diào)后,Guanaco用于在多輪對話中生成針對角色的響應(yīng)和給定主題的連續(xù)響應(yīng)。
Minotaur:基于Starcoder Plus模型+微調(diào)WizardLM和GPTeacher-General-Instruc指令數(shù)據(jù)集
Nous-Herme:基于LLaMA模型+微調(diào)BiologyPhysicsChemistry子集的300K個指令
Minotaur (15B) is a large language model trained by fine-tuning the Starcoder Plus (15B) (Li et al., 2023f) on open-source instruction datasets including WizardLM (Xu et al., 2023a) and GPTeacher-General-Instruct. For model inference, Minotaur supports a maximum context length of 18K tokens.
Nous-Herme (13B) is a large language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on an instruction dataset, which contains over 300k instructions, sampled from GPTeacher, CodeAlpaca (Chaudhary, 2023), GPT-4-LLM (Peng et al., 2023), Unnatural Instructions (Honovich et al., 2022), and BiologyPhysicsChemistry subsets in the Camel- AI (Li et al., 2023c). Responses are generated by GPT-4. For evaluations, Nous-Herme (13B) achieves comparable performances to GPT-3.5- turbo on multiple tasks like ARC challenge (Clark et al., 2018) and BoolQ (Clark et al., 2019).
Minotaur(15B)是一種大型語言模型,通過在包括WizardLM(Xu等,2023a)和GPTeacher-General-Instruct在內(nèi)的開源指令數(shù)據(jù)集上,微調(diào)Starcoder Plus(15B)(Li等,2023f)。在模型推理階段,Minotaur支持最大上下文長度為18K標記。
Nous-Herme(13B)是一種大型語言模型,通過在基于GPTeacher、CodeAlpaca(Chaudhary,2023)、GPT-4-LLM(Peng等,2023)、Unnatural Instructions(Honovich等,2022)以及Camel-AI(Li等,2023c)中的BiologyPhysicsChemistry子集中,包含超過300K個指令的指令數(shù)據(jù)集上進行微調(diào),使用LLaMA(13B)(Touvron等,2023a)進行初始化。評估結(jié)果顯示,Nous-Herme(13B)在多個任務(wù)(如ARC挑戰(zhàn)和BoolQ)上與GPT-3.5-turbo的性能相當。
TüLU :基于OPT 模型+微調(diào)混合指令數(shù)據(jù)集
YuLan-Chat:基于LLaMA模型+微調(diào)雙語數(shù)據(jù)集(25萬個中英文指令對)
TüLU (6.7B) (Wang et al., 2023c) is a large language model trained by fine-tuning OPT (6.7B) (Zhang et al., 2022a) on a mixed instruction dataset, which contains FLAN V2 (Longpre et al., 2023), CoT (Wei et al., 2022), Dolly (Conover et al., 2023a), Open Assistant-1, GPT4-Alpaca, Code-Alpaca (Chaudhary, 2023), and ShareGPT. After fine-tuning, TüLU (6.7B) reaches on average 83% of ChatGPT’s performance and 68% of GPT- 4’s performance.
YuLan-Chat (13B) (YuLan-Chat-Team, 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on a constructed bilingual dataset, which contains 250,000 Chinese- English instruction pairs. After fine-tuning, YuLan-Chat-13B achieves comparable results to the state-of-the-art open-source model ChatGLM (6B) (Du et al., 2022), and outperforms Vicuna (13B) (Chiang et al., 2023) on the English BBH3K (BBH3K is a subset of BBH benchmark (Srivastava et al., 2022)) dataset.
TüLU (6.7B) (Wang等人,2023c)是在混合指令數(shù)據(jù)集上通過對OPT (6.7B) (Zhang等人,2022a)進行微調(diào)而訓(xùn)練的大型語言模型,該數(shù)據(jù)集包含F(xiàn)LAN V2?(Longpre等人,2023)、CoT (Wei等人,2022)、Dolly (Conover等人,2023a)、Open Assistant-1、GPT4-Alpaca、Code-Alpaca (Chaudhary, 2023)和ShareGPT。經(jīng)過微調(diào),TüLU (6.7B)平均達到ChatGPT的83%和GPT- 4的68%的性能。
YuLan-Chat (13B) (YuLan-Chat- team, 2023)是通過微調(diào)LLaMA (13B) (Touvron et al., 2023a)在包含25萬個中英文指令對的構(gòu)建雙語數(shù)據(jù)集上訓(xùn)練的語言模型。經(jīng)過微調(diào),YuLan-Chat-13B在英語BBH3K (BBH3K是BBH基準(Srivastava et al., 2022)的一個子集)數(shù)據(jù)集上取得了與最先進的開源模型ChatGLM (6B) (Du等人,2022)相當?shù)慕Y(jié)果,并且優(yōu)于Vicuna (13B) (Chiang等人,2023)。
MOSS:微調(diào)對話指令的雙語對話語言模型
Airoboros:基于LLaMA+微調(diào)Self-instruct數(shù)據(jù)集
UltraLM:基于LLAMA+微調(diào),
MOSS (16B) is a bilingual dialogue language model, which aims to engage in multi-turn conversations and utilize various plugins, trained by fine-tuning on dialogue instructions. After fine- tuning, MOSS outperforms the backbone model and generates responses that better align with human preferences.
Airoboros (13B) is a large language model trained by fine-tuning LLAMA (13B) (Touvron et al., 2023a) on the Self-instruct dataset (Wang et al., 2022c). After fine-tuning, Airoboros significantly outperforms LLAMA (13B) (Touvron et al., 2023a) on all benchmarks and achieves highly comparable results to models fine-tuned specifically for certain benchmarks.
UltraLM (13B) (Ding et al., 2023a) is a large language model trained by fine-tuning LLAMA (13B) (Touvron et al., 2023a). For evaluation, UltraLM (13B) outperforms Dolly (12B) (Conover et al., 2023a) and achieves the winning rate up to 98%. Additionally, it surpasses the previous best open-source models (i.e., Vicuna (Chiang et al., 2023) and WizardLM (Xu et al., 2023a)) with winning rates of 9% and 28%, respectively.?
MOSS(16B)是一種雙語對話語言模型,旨在進行多輪對話并利用各種插件,在對話指令上進行微調(diào)。微調(diào)后,MOSS優(yōu)于基準模型,并生成與人類偏好更加一致的響應(yīng)。
Airoboros(13B)通過在Self-instruct數(shù)據(jù)集上進行微調(diào),使用LLaMA(13B)(Touvron等,2023a)進行初始化。微調(diào)后,Airoboros在所有基準數(shù)據(jù)集上明顯優(yōu)于LLAMA(13B),并且與專門針對某些基準測試進行微調(diào)的模型取得了高度可比性的結(jié)果。
UltraLM(13B)(Ding等,2023a)通過對LLAMA(13B)(Touvron等,2023a)進行微調(diào)獲得,微調(diào)后在性能上優(yōu)于Dolly(12B)(Conover等,2023a)并達到98%的勝率。此外,它在性能上超越了之前的最佳開源模型(即Vicuna和WizardLM),其勝率分別為9%和28%。
5、Multi-modality Instruction Fine-tuning多模態(tài)指令微調(diào)
5.1、Multi-modality Datasets多模態(tài)數(shù)據(jù)集
MUL-TIINSTRUCT—多模態(tài)指令微調(diào)數(shù)據(jù)集—OFA模型:由62個不同的多模態(tài)任務(wù)組成+統(tǒng)一的序列到序列格式
MUL-TIINSTRUCT (Xu et al., 2022) is a multimodal instruction tuning dataset consisting of 62 diverse multimodal tasks in a unified seq- to-seq format. This dataset covers 10 broad categories and its tasks are derived from 21 existing open-sourced datasets. Each task is equipped with 5 expert-written instructions. For the existing tasks, the authors use the input/output pairs from their available open-source datasets to create instances. While for each new task, the authors create 5k to 5M instances by extracting the necessary information from instances of existing tasks or reformulating them. The MUL-TIINSTRUCT dataset has demonstrated its efficiency in enhancing various transfer learning technique. For example, fine-tuning the OFA model (930M) (Wang et al., 2022a) using various transfer learning strategies such as Mixed Instruction Tuning and Sequential Instruction Tuning on MUL-TIINSTRUCT improve the zero- shot performance across all unseen tasks. On commonsense VQA task, OFA fine-tuned on MUL- TIINSTRUCT achieves 50.60 on RougeL and 31.17 on accuracy, while original OFA achieves 14.97 on RougeL and 0.40 on accuracy.
MUL-TIINSTRUCT(Xu等,2022)是一個多模態(tài)指令微調(diào)數(shù)據(jù)集,由62個不同的多模態(tài)任務(wù)組成,以統(tǒng)一的序列到序列格式呈現(xiàn)。該數(shù)據(jù)集涵蓋10個廣泛的類別,其任務(wù)來自21個現(xiàn)有的開源數(shù)據(jù)集。每個任務(wù)配備了5個專家編寫的指令。
>> 對于現(xiàn)有任務(wù),作者使用其可用的開源數(shù)據(jù)集中的輸入/輸出對創(chuàng)建實例。
>> 而對于每個新任務(wù),作者通過從現(xiàn)有任務(wù)的實例中提取必要信息或重新構(gòu)建它們來創(chuàng)建5k到5M個實例。
MUL-TIINSTRUCT數(shù)據(jù)集已經(jīng)證明在增強各種遷移學(xué)習(xí)技術(shù)方面的有效性。例如,使用Mixed Instruction Tuning和Sequential Instruction Tuning等各種遷移學(xué)習(xí)策略對OFA模型(930M)(Wang等,2022a)在MUL-TIINSTRUCT上進行微調(diào),改進了所有未見任務(wù)的零-shot性能。在常識視覺問答任務(wù)上,經(jīng)過MUL-TIINSTRUCT微調(diào)的OFA在RougeL上達到50.60,在準確性上達到31.17,而原始OFA在RougeL上只有14.97,在準確性上只有0.40。
PMC-VQA—大規(guī)模的醫(yī)學(xué)視覺問答數(shù)據(jù)集—MedVInT模型:227k個圖像-問題對和149k個圖像,從PMC-OA收集圖像-標題對+ChatGPT生成問題-答案對+手工驗證
PMC-VQA (Zhang et al., 2023c) is a large- scale medical visual question-answering dataset that comprises 227k image-question pairs of 149k images, covering various modalities or diseases. The dataset can be used for both open-ended and multiple-choice tasks. The pipeline for generating the PMC-VQA dataset involves collecting image-caption pairs from the PMC-OA (Lin et al., 2023) dataset, using ChatGPT to generate question-answer pairs, and manually verifying a subset of the dataset for quality. The authors propose a generative-based model MedVInT for medical visual understanding by aligning visual information with a large language model. MedVInT pretrained on PMC- VQA achieves state-of-the-art performance and outperforms existing models on VQA-RAD (Lau et al., 2018) and SLAKE (Liu et al., 2021a) benchmarks, with 81.6% accuracy on VQA-RAD and 88.0% accuracy on SLAKE.
PMC-VQA(Zhang等,2023c)是一個大規(guī)模的醫(yī)學(xué)視覺問答數(shù)據(jù)集,包括227k個圖像-問題對和149k個圖像,涵蓋了各種模態(tài)或疾病。該數(shù)據(jù)集可用于開放式和多項選擇任務(wù)。生成PMC-VQA數(shù)據(jù)集的流程涉及從PMC-OA(Lin等,2023)數(shù)據(jù)集中收集圖像-標題對,使用ChatGPT生成問題-答案對,并對數(shù)據(jù)集的子集進行手工驗證以確保質(zhì)量。作者提出了一種基于生成的模型MedVInT,通過將視覺信息與大型語言模型進行對齊,實現(xiàn)醫(yī)學(xué)視覺理解。在經(jīng)過PMC-VQA微調(diào)的MedVInT上實現(xiàn)了最新的性能,并在VQA-RAD(Lau等,2018)和SLAKE(Liu等,2021a)基準上優(yōu)于現(xiàn)有模型,VQA-RAD上的準確率為81.6%,SLAKE上的準確率為88.0%。
LAMM—2D圖像和3D點云理解:包含186K個語言-圖像指令-響應(yīng)對,以及10K個語言-點云指令-響應(yīng)對
LAMM (Yin et al., 2023) is a comprehensive multi-modal instruction tuning dataset for 2D image and 3D point cloud understanding. LAMM contains 186K language-image instruction- response pairs, and 10K language-point cloud instruction-response pairs. The authors collect images and point clouds from publicly available datasets and use the GPT-API and self-instruction methods to generate instructions and responses based on the original labels from these datasets. LAMM-Dataset includes data pairs for commonsense knowledge question answering by incorporating a hierarchical knowledge graph label system from the Bamboo (Zhang et al., 2022b) dataset and the corresponding Wikipedia description. The authors also propose the LAMM- Benchmark, which evaluates existing multi-modal language models (MLLM) on various computer vision tasks. It includes 9 common image tasks and 3 common point cloud tasks, and LAMM- Framework, a primary MLLM training framework that differentiates the encoder, projector, and LLM finetuning blocks for different modalities to avoid modality conflicts.
LAMM(Yin等,2023)是一個全面的多模態(tài)指令微調(diào)數(shù)據(jù)集,用于2D圖像和3D點云理解。LAMM包含186K個語言-圖像指令-響應(yīng)對,以及10K個語言-點云指令-響應(yīng)對。作者從公開可用的數(shù)據(jù)集中收集圖像和點云,并使用GPT-API和自我指導(dǎo)方法根據(jù)這些數(shù)據(jù)集的原始標簽生成指令和響應(yīng)。LAMM-Dataset還包括了常識知識問答的數(shù)據(jù)對,通過將分層知識圖標簽系統(tǒng)從Bamboo(Zhang等,2022b)數(shù)據(jù)集和相應(yīng)的維基百科描述整合進來。作者還提出了LAMM-Benchmark,用于評估現(xiàn)有的多模態(tài)語言模型(MLLM)在各種計算機視覺任務(wù)上的性能。其中包括9個常見的圖像任務(wù)和3個常見的點云任務(wù),以及LAMM-Framework,一個主要的MLLM訓(xùn)練框架,用于為不同的模態(tài)區(qū)分編碼器、投影器和LLM微調(diào)模塊,以避免模態(tài)沖突。
5.2、Multi-modality Instruction Fine-tuning Models多模態(tài)指令微調(diào)模型
InstructPix2Pix條件擴散模型:基于Stable Diffusion+微調(diào)多模態(tài)數(shù)據(jù)集(綜合兩大模型能力【GPT-3、Stable Diffusion】來生成)
InstructPix2Pix (983M) (Brooks et al., 2022) is a conditional diffusion model trained by fine-tuning Stable Diffusion (983M) (Rombach et al., 2022) on a constructed multi-modal dataset that contains more than 450K text editing instructions and corresponding images before and after the edit. The authors combine the abilities of two large-scale pre- trained models, a language model GPT-3 (Brown et al., 2020b) and a text-to-image model Stable Diffusion (Rombach et al., 2022), to generate the the training dataset. GPT-3 is fine-tuned to generate text edits based on image prompts, while Stable Diffusion is used to convert the generated text edits into actual image edits. InstructPix2Pix is then trained on this generated dataset using a latent diffusion objective. Figure 5 shows the process of generating image editing dataset and training the diffusion model on that dataset. The authors compares the proposed method qualitatively with previous works such as SDEdit (Meng et al., 2022) and Text2Live (Bar-Tal et al., 2022), highlighting the ability of the model to follow image editing instructions instead of descriptions of the image or edit layer. The authors also presents quantitative comparisons with SDEdit (Meng et al., 2022) using metrics measuring image consistency and edit quality.
InstructPix2Pix(983M)(Brooks等,2022)是一種條件擴散模型,通過在構(gòu)建的多模態(tài)數(shù)據(jù)集上對Stable Diffusion(983M)(Rombach等,2022)進行微調(diào)而訓(xùn)練得到,該數(shù)據(jù)集包含超過450K個文本編輯指令和相應(yīng)的編輯前后圖像。作者將兩個大規(guī)模預(yù)訓(xùn)練模型的能力結(jié)合在一起,即語言模型GPT-3(Brown等,2020b)和文本到圖像模型Stable Diffusion(Rombach等,2022),以生成訓(xùn)練數(shù)據(jù)集。GPT-3被微調(diào)以根據(jù)圖像提示生成文本編輯,而Stable Diffusion則用于將生成的文本編輯轉(zhuǎn)換為實際圖像編輯。然后,InstructPix2Pix在此生成的數(shù)據(jù)集上使用潛在擴散目標進行訓(xùn)練。圖5展示了生成圖像編輯數(shù)據(jù)集的過程以及在該數(shù)據(jù)集上訓(xùn)練擴散模型的過程。
作者將所提出的方法與之前的作品(如SDEdit和Text2Live)進行了定性比較,強調(diào)該模型能夠按照圖像編輯指令進行操作,而不是圖像或編輯層的描述。作者還使用衡量圖像一致性和編輯質(zhì)量的指標對其與SDEdit進行了定量比較。
LLaVA:基于CLIP視覺編碼器和LLaMA語言解碼器模型+微調(diào)158K個獨特的語言-圖像指令-跟隨樣本的教學(xué)視覺語言數(shù)據(jù)集(利用GPT-4轉(zhuǎn)換格式)
LLaVA (13B) (Liu et al., 2023b) is a large multimodal model developed by connecting the visual encoder of CLIP (400M) (Radford et al., 2021) with the language decoder LLaMA (7B) (Touvron et al., 2023a). LLaVA is fine-tuned using the generated instructional vision-language dataset consisted of 158K unique language-image instruction-following samples. The data collection process involved creating conversation, detailed description, and complex reasoning prompts. GPT-4 is used to convert image-text pairs into appropriate instruction-following format for this dataset. Visual features such as captions and bounding boxes were used to encode images. LLaVA yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
LLaVA(13B)(Liu等,2023b)是一個大型多模態(tài)模型,通過將CLIP(400M)(Radford等,2021)的視覺編碼器與LLaMA(7B)(Touvron等,2023a)的語言解碼器相連接而開發(fā)。LLaVA通過生成包含158K個獨特的語言-圖像指令-跟隨樣本的教學(xué)視覺語言數(shù)據(jù)集進行微調(diào)。
數(shù)據(jù)收集過程涉及創(chuàng)建會話、詳細描述和復(fù)雜推理提示。使用GPT-4將圖像-文本對轉(zhuǎn)換為適用于此數(shù)據(jù)集的適當?shù)闹噶罡S格式。使用標題和邊界框等視覺特征來編碼圖像。LLaVA在合成多模態(tài)指令跟隨數(shù)據(jù)集上相對于GPT-4的得分為85.1%。在Science QA上進行微調(diào)時,LLaVA和GPT-4的協(xié)同作用實現(xiàn)了92.53%的新的最高準確率。
Video-LLaMA多模態(tài)框架:由兩個分支編碼器組成(視覺-語言VL分支和音頻-語言AL分支+語言解碼器LLaMA)
Video-LLaMA (Zhang et al., 2023b) is a multimodal framework that enhances large language models with the ability to understand both visual and auditory content in videos. The architecture of Video-LLaMA consists of two branche encoders: the Vision-Language (VL) Branch and the Audio-Language (AL) Branch, and a language decoder (Vicuna (7B/13B) (Chiang et al., 2023), LLaMA (7B) (Touvron et al., 2023a), etc.). The VL Branch includes a frozen pre-trained image encoder (pre-trained vision component of BLIP-2 (Li et al., 2023d), which includes a ViT-G/14 and a pre-trained Q-former), a position embedding layer, a video Q-former and a linear layer. The AL Branch includes a pre- trained audio encoder (ImageBind (Girdhar et al., 2023)) and an Audio Q-former. Figure 6 shows the overall architecture of Video-LLaMA with Vision-Language Branch and Audio-Language Branch. The VL Branch is trained on the Webvid-2M (Bain et al., 2021) video caption dataset with a video-to-text generation task, and fine-tuned on the instruction-tuning data from MiniGPT-4 (Zhu et al., 2023), LLaVA (Liu et al., 2023b) and VideoChat (Li et al., 2023e). The AL Branch is trained on video/image instru- caption data to connect the output of ImageBind to language decoder. After finetuning, Video- LLaMA can perceive and comprehend video content, demonstrating its ability to integrate auditory and visual information, understand static images, recognize common-knowledge concepts, and capture temporal dynamics in videos.?
Video-LLaMA(Zhang等,2023b)是一個多模態(tài)框架,通過在視頻中理解視覺和聽覺內(nèi)容來增強大型語言模型的能力。Video-LLaMA的架構(gòu)由兩個分支編碼器組成:視覺-語言(VL)分支和音頻-語言(AL)分支,以及一個語言解碼器(Vicuna(7B/13B)(Chiang等,2023),LLaMA(7B)(Touvron等,2023a)等)。
VL分支包括一個凍結(jié)的預(yù)訓(xùn)練圖像編碼器(BLIP-2的預(yù)訓(xùn)練視覺組件(Li等,2023d)),其中包括一個ViT-G/14和一個預(yù)訓(xùn)練的Q-former)、一個位置嵌入層、一個視頻Q-former和一個線性層。
AL分支包括一個預(yù)訓(xùn)練的音頻編碼器(ImageBind(Girdhar等,2023))和一個音頻Q-former。圖6展示了Video-LLaMA的整體架構(gòu),包括視覺-語言分支和音頻-語言分支。
VL分支在Webvid-2M(Bain等,2021)視頻字幕數(shù)據(jù)集上進行訓(xùn)練,進行視頻到文本生成任務(wù),并在來自MiniGPT-4(Zhu等,2023)、LLaVA(Liu等,2023b)和VideoChat(Li等,2023e)的指令微調(diào)數(shù)據(jù)上進行微調(diào)。
AL分支在視頻/圖像指令-字幕數(shù)據(jù)上進行訓(xùn)練,將ImageBind的輸出連接到語言解碼器。
微調(diào)后,Video-LLaMA能夠感知和理解視頻內(nèi)容,展示了其整合聽覺和視覺信息、理解靜態(tài)圖像、識別常識概念以及捕捉視頻中的時間動態(tài)的能力。
InstructBLIP視覺-語言指令微調(diào)框架:基于BLIP-2模型(圖像編碼器+LLM+Query Transformer)
InstructBLIP (1.2B) (Dai et al., 2023) is a vision-language instruction tuning framework initialized with a pre-trained BLIP-2 (Li et al., 2023d)) model consisting of an image encoder, an LLM (FlanT5 (3B/11B) (Chung et al., 2022) or Vicuna (7B/13B) (Chiang et al., 2023)), and a Query Transformer (Q-Former) to bridge the two. As shown in Figure 7, the Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM. The authors evaluate the proposed InstructBLIP model on a variety of vision- language tasks, including image classification, image captioning, image question answering, and visual reasoning. They use 26 publicly available datasets, dividing them into 13 held-in and 13 held-out datasets for training and evaluation. The authors demonstrate that InstructBLIP achieves state-of-the-art zero-shot performance on a wide range of vision-language tasks. InstructBLIP yields an average relative improvement of 15.0% when compared to BLIP-2, smallest InstructBLIP (4B) outperforms Flamingo (80B) (Alayrac et al., 2022) on all six shared evaluation datasets with an average relative improvement of 24.8%.
InstructBLIP(1.2B)(Dai等,2023)是一個視覺-語言指令微調(diào)框架,其初始化為一個預(yù)訓(xùn)練的BLIP-2(Li等,2023d)模型,包括圖像編碼器、LLM(FlanT5(3B/11B)(Chung等,2022)或Vicuna(7B/13B)(Chiang等,2023))和一個Query Transformer(Q-Former)以連接兩者。如圖7所示,Q-Former從凍結(jié)的圖像編碼器的輸出嵌入中提取指令感知的視覺特征,并將視覺特征作為軟提示輸入到凍結(jié)的LLM中。
作者在各種視覺-語言任務(wù)上評估了所提出的InstructBLIP模型,包括圖像分類、圖像字幕生成、圖像問答和視覺推理。他們使用了26個公開可用的數(shù)據(jù)集,將其分為13個用于訓(xùn)練和13個用于評估的數(shù)據(jù)集。作者證明InstructBLIP在各種視覺-語言任務(wù)上實現(xiàn)了最新的零-shot性能。相較于BLIP-2,InstructBLIP平均相對改進15.0%,最小的InstructBLIP(4B)在六個共享評估數(shù)據(jù)集上優(yōu)于Flamingo(80B)(Alayrac等,2022),平均相對改進為24.8%。
Otter:基于OpenFlamingo模型+只微調(diào)Perceiver重采樣模塊、交叉注意力層和輸入/輸出嵌入
Otter (Li et al., 2023b) is a multi-modal model trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023), with the language and vision encoders frozen and only fine-tuning the Perceiver resampler module, cross-attention layers, and input/output embeddings. The authors organize diverse multi-modal tasks covering 11 categories and build multi-modal in-context instruction tuning datasets MIMIC-IT of 2.8M multimodal instruction-response pairs, which consists of image- instruction-answer triplets, where the instruction- answer is tailored to the image. Each data sample also includes context, which contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet. Otter demonstrates the ability to follow user instructions more accurately and provide more detailed descriptions of images compared to OpenFlamingo (Awadalla et al., 2023).
Otter(Li等,2023b)是一種多模態(tài)模型,通過微調(diào)OpenFlamingo(9B)(Awadalla等,2023),其中語言和視覺編碼器被凍結(jié),只微調(diào)了Perceiver重采樣模塊、交叉注意力層和輸入/輸出嵌入。作者組織了涵蓋11個類別的多樣多模態(tài)任務(wù),并構(gòu)建了包含2.8M個多模態(tài)指令-響應(yīng)對的多模態(tài)上下文指令微調(diào)數(shù)據(jù)集MIMIC-IT,其中包括圖像-指令-答案三元組,其中指令-答案適用于圖像。每個數(shù)據(jù)樣本還包括上下文,其中包含一系列與查詢的三元組在上下文上下文相關(guān)的圖像-指令-答案三元組。Otter相對于OpenFlamingo(Awadalla等,2023)能夠更準確地遵循用戶指令,并提供與圖像相關(guān)的更詳細的描述。
MultiModal-GPT:多模態(tài)指令微調(diào)模型
MultiModal-GPT (Gong et al., 2023) is a multi- modal instruction tuning model that is capable of following diverse instructions, generating detailed captions, counting specific objects, and addressing general inquiries. MultiModal-GPT is trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023) on various created visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue. The experiments demonstrate the proficiency of MultiModal-GPT in maintaining continuous dialogues with humans.
MultiModal-GPT(Gong等,2023)是一種多模態(tài)指令微調(diào)模型,能夠遵循不同的指令,生成詳細的標題,計數(shù)特定的對象,并回答一般性問題。MultiModal-GPT通過在包括VQA、圖像字幕生成、視覺推理、文本OCR和視覺對話等的各種創(chuàng)建的視覺指令數(shù)據(jù)上微調(diào)OpenFlamingo(9B)(Awadalla等,2023)而訓(xùn)練得到。實驗展示了MultiModal-GPT在與人類保持持續(xù)對話方面的能力。
6、Domain-specific Instruction Finetuning特定領(lǐng)域指令微調(diào)
In this section, we describe instruction tuning in different domains and applications.
在本節(jié)中,我們描述了不同領(lǐng)域和應(yīng)用中的指令微調(diào)。
6.1、Dialogue對話—InstructDial、LINGUIST模型:每個任務(wù)實例{任務(wù)描述、實例輸入、約束、指令和輸出}+兩個元任務(wù)(指令選擇任務(wù)+指令二元任務(wù))
InstructDial (Gupta et al., 2022) is an instruction tuning framework designed for dialogue. It contains a collection of 48 dialogue tasks in a consistent text-to-text format created from 59 dialogue datasets. Each task instance includes a task description, instance inputs, constraints, instructions, and output. To ensure adherence to instructions, the framework introduces two meta- tasks: (1) an instruction selection task, where the model selects the instruction corresponding to a given input-output pair; and (2) an instruction binary task, where the model predicts "yes" or "no" if an instruction leads to a given output from an input. Two base models T0-3B (Sanh et al., 2021) (3B parameters version of T5 (Lester et al., 2021)) and BART0 (Lin et al., 2022) (406M parameters based on Bart-large (Lewis et al., 2019)) are fine- tuned on the tasks from InstructDial. InstructDial achieves impressive results on unseen dialogue datasets and tasks, including dialogue evaluation and intent detection. Moreover, it delivers even better results when applied to a few-shot setting.?
Intent Classification and Slot Tagging LINGUIST (Rosenbaum et al., 2022) finetunes AlexaTM 5B (Soltan et al., 2022), a 5-billion-parameter multilingual model, on the instruction dataset for intent classification and slot tagging tasks. Each instruction consists of five blocks: (i) the language of the generated output, (ii) intention, slot types and values to include in the output (e.g., the number 3 in [3, snow] corresponds the slot type, and snow is the value used for that slot), a mapping from slot type labels to numbers, and (v) up to 10 examples to instruct the format of the outputs. LINGUIST shows significant improvements over state-of-the-art approaches in a 10-shot novel intent setting using the SNIPS dataset (Coucke et al., 2018). In the zero-shot cross- lingual setting of the mATIS++ dataset (Xu et al., 2020), LINGUIST surpasses a strong baseline of Machine Translation with Slot Alignment across 6 languages while maintaining intent classification performance.
InstructDial(Gupta等,2022)是一個專為對話設(shè)計的指令微調(diào)框架。它包含一個由59個對話數(shù)據(jù)集創(chuàng)建的一致的文本到文本格式的48個對話任務(wù)集合。
每個任務(wù)實例包括任務(wù)描述、實例輸入、約束、指令和輸出。為了確保遵循指令,該框架引入了兩個元任務(wù):(1)指令選擇任務(wù),模型根據(jù)給定的輸入-輸出對選擇相應(yīng)的指令;
(2)指令二元任務(wù),模型預(yù)測如果一個指令將輸入轉(zhuǎn)化為給定的輸出,它將預(yù)測“是”或“否”。
兩個基本模型T0-3B(Sanh等,2021)(T5的3B參數(shù)版本(Lester等,2021))和BART0(Lin等,2022)(基于Bart-large(Lewis等,2019)的406M參數(shù))在來自InstructDial的任務(wù)上進行微調(diào)。InstructDial在看不見的對話數(shù)據(jù)集和任務(wù)上取得了令人印象深刻的成績,包括對話評估和意圖檢測。此外,當應(yīng)用于少樣本設(shè)置時,它甚至可以獲得更好的結(jié)果。
意圖分類和槽位標記LINGUIST(Rosenbaum等,2022)對AlexaTM 5B(Soltan等,2022),一個50億參數(shù)的多語言模型進行微調(diào),用于意圖分類和槽位標記任務(wù)的指令數(shù)據(jù)集。每個指令由五個塊組成:
(i)生成輸出的語言,
(ii)意圖、槽位類型和要包含在輸出中的值(例如,[3, snow]中的數(shù)字3對應(yīng)于槽位類型,snow是用于該槽位的值),從槽位類型標簽到數(shù)字的映射,
(v)最多10個示例,以指導(dǎo)輸出的格式。
LINGUIST在使用SNIPS數(shù)據(jù)集(Coucke等,2018)進行10樣本新意圖設(shè)置時,在零樣本跨語言的mATIS++數(shù)據(jù)集(Xu等,2020)中,LINGUIST在維持意圖分類性能的同時,超越了機器翻譯與槽位對齊的強基線。
6.3、Information Extraction信息抽取—InstructUIE:基于FlanT5模型+指令微調(diào)的統(tǒng)一信息抽取(IE)框架+將IE任務(wù)轉(zhuǎn)化為seq2seq格式,每個任務(wù)實例四個屬性{任務(wù)指令、選項、文本、輸出}?
InstructUIE (Wang et al., 2023b) is a unified information extraction (IE) framework based on instruction tuning, which transforms IE tasks to the seq2seq format and solves them by fine- tuning 11B FlanT5 (Chung et al., 2022) on the constructed IT dataset. Figure 8 shows the overall architecture of InstructUIE. It introduces IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets in a unified text-to- text format with expert-written instructions. Each task instance is delineated by four properties: task instruction, options, text, and output. Task instruction contains information such as the type of information to be extracted, the output structure format, and additional constraints or rules that need to be adhered to during the extraction process. Options refer to the output label constraints of a task. Text refers to the input sentence. Output is the sentence obtained by converting the original tags of the sample (e.g. "entity tag: entity span" for NER). In the supervised setting, InstructUIE performs comparably to BERT (Devlin et al.,2018) and outperforms the state-of-the-art and GPT3.5 (Brown et al., 2020a) in zero-shot settings.
InstructUIE(Wang等,2023b)是一個基于指令微調(diào)的統(tǒng)一信息抽取(IE)框架,它將IE任務(wù)轉(zhuǎn)化為seq2seq格式,并通過在構(gòu)建的IT數(shù)據(jù)集上微調(diào)11B FlanT5(Chung等,2022)來解決這些問題。
圖8展示了InstructUIE的整體架構(gòu)。它引入了IE INSTRUCTIONS,這是一個由32個多樣的信息抽取數(shù)據(jù)集組成的基準,以統(tǒng)一的文本到文本格式呈現(xiàn),其中包含專家編寫的指令。
每個任務(wù)實例由四個屬性描述:任務(wù)指令、選項、文本和輸出。
>> 任務(wù)指令包含諸如要提取的信息類型、輸出結(jié)構(gòu)格式以及在提取過程中需要遵循的附加約束或規(guī)則等信息。
>> 選項是任務(wù)的輸出標簽約束。
>> 文本是輸入句子。
>> 輸出是通過將樣本的原始標簽(例如,NER中的"實體標簽:實體跨度")轉(zhuǎn)換為句子獲得的(實體標簽為槽位標簽,實體跨度為值)。
在監(jiān)督設(shè)置下,InstructUIE在零樣本設(shè)置中表現(xiàn)出色,與BERT(Devlin等,2018)相當,并在零樣本設(shè)置中超越了最先進的和GPT3.5(Brown等,2020a)。
6.4、ABSA基于內(nèi)容的情感分析:基于T5模型
ABSA/Aspect-based Sentiment Analysis基于內(nèi)容的情感分析
Varia et al. (2022) propose a unified instruction tuning framework for solving Aspect-based Sentiment Analysis (ABSA) task based on a fine- tuned T5 (220M) (Raffel et al., 2019) model. The framework addresses multiple factorized sub- tasks that involve the four elements of ABSA, namely Aspect Term, Aspect Category, Opinion Term, and Sentiment. It treats these sub-tasks as a combination of five Question Answering (QA) tasks by transforming each sentence in the corpus using instruction templates provided for each task. For instance, one of the instruction templates used is "What are the aspect terms in the text:
$TEXT?". The framework showcases substantial improvement (8.29 F1 on average) over the state-of- the-art in few-shot learning scenarios and remains comparable in full fine-tuning scenarios.
Varia等(2022)提出了一個統(tǒng)一的指令微調(diào)框架,用于解決基于內(nèi)容的情感分析(ABSA)任務(wù),基于微調(diào)的T5(220M)(Raffel等,2019)模型。該框架處理涉及ABSA的四個元素的多個分解子任務(wù),即內(nèi)容術(shù)語、內(nèi)容類別、意見術(shù)語和情感。它將這些子任務(wù)視為五個問答(QA)任務(wù)的組合,通過使用為每個任務(wù)提供的指令模板來轉(zhuǎn)化語料庫中的每個句子。例如,所使用的指令模板之一是"What are the aspect terms in the text: $TEXT?"。該框架在少樣本學(xué)習(xí)場景中展示了顯著的改進(平均F1值為8.29),在完全微調(diào)場景中保持了可比性。
6.5、Writing寫作
Writing-Alpaca-7B輔助寫作:基于LLaMa-7B模型+微調(diào)寫作指令數(shù)據(jù)集(EDITEVAL基準的擴展),四元組{通用序言用于指導(dǎo)任務(wù)完成的指令字段,提供要編輯的文本的輸入字段,要求模型填寫的響應(yīng)字段}
Zhang et al. (2023d)?propose Writing-Alpaca- 7B that fine-tunes LLaMa-7B on the writing instruction dataset to provide writing assistance. The proposed instruction dataset is an extension of the EDITEVAL benchmark based on instructional data, with the Updating task removed and a task for grammaticality introduced. The instruction scheme strictly follows the one in the Stanford Alpaca project, comprising a universal preface, an instruction field to guide task completion, an input field that provides the text to be edited, and a response field that requires models to fill out. The Writing-Alpaca-7B improves upon LLaMa’s performance on all writing tasks and outperforms other larger off-the-shelf LLMs.
Zhang等(2023d)提出了Writing-Alpaca-7B,通過對寫作指令數(shù)據(jù)集進行LLaMa-7B的微調(diào),以提供寫作輔助。所提出的指令數(shù)據(jù)集是基于指導(dǎo)性數(shù)據(jù)的EDITEVAL基準的擴展,刪除了更新任務(wù)并引入了一個用于語法的任務(wù)。
指令方案嚴格遵循斯坦福Alpaca項目中的方案,包括通用序言、用于指導(dǎo)任務(wù)完成的指令字段、提供要編輯的文本的輸入字段和要求模型填寫的響應(yīng)字段。Writing-Alpaca-7B在所有寫作任務(wù)上均優(yōu)于LLaMa,并在其他更大的現(xiàn)成LLM上取得了更好的表現(xiàn)。
CoEdIT輔助寫作:基于對FLANT模型+微調(diào)在文本編輯的指令數(shù)據(jù)集,兩元組{指令:源,目標}
CoEdIT (Raheja et al., 2023) finetunes FLANT5 (770M parameters, 3B parameters, and 11B parameters) on the instruction dataset for text editing to provide writing assistance. The instruction dataset comprises approximately 82K<instruction: source, target> pairs. As shown in Figure 9, the model takes instructions from the user specifying the characteristics of the desired text, such as "Make the sentence simpler", and outputs the edited text. CoEdIT achieves state-of-the-art performance on several text editing tasks, including grammatical error correction, text simplification, iterative text editing, and three stylistic editing tasks: formality style transfer, neutralization, and paraphrasing. Furthermore, it can generalize well to new, adjacent tasks not seen during fine-tuning.
CoEdIT(Raheja等,2023)對FLANT5(770M參數(shù)、3B參數(shù)和11B參數(shù))在文本編輯的指令數(shù)據(jù)集上進行微調(diào),以提供寫作輔助。
指令數(shù)據(jù)集包括約82K個<指令:源,目標>對。
如圖9所示,模型從用戶處獲取指令,指定所需文本的特性,例如"使句子更簡單",然后輸出編輯后的文本。
CoEdIT在多個文本編輯任務(wù)上取得了最先進的性能,包括語法錯誤糾正、文本簡化、迭代文本編輯以及三個風(fēng)格編輯任務(wù):正式風(fēng)格轉(zhuǎn)換、中性化和改寫。此外,它還可以很好地推廣到新的、相鄰的任務(wù),這些任務(wù)在微調(diào)過程中未曾見過。
CoPoet協(xié)作的詩歌寫作工具:基于T5模型+微調(diào)詩歌寫作數(shù)據(jù)集,兩元組{指令,詩行}
CoPoet (Chakrabarty et al., 2022) is a collaborative poetry writing tool that utilizes a large language model (e.g. T5-3B, T5-11B and T0-3B models) trained on a diverse collection of instructions for poetry writing. Each sample in the instruction dataset includes an <instruction, poem_line> pair. There are three major types of instructions: Continuation, Lexical Constraints, and Rhetorical Techniques. The CoPoet is guided by user instructions that specify desired attributes of the poetry, such as writing a sentence about "love" or ending a sentence with "fly." Not only is the system competitive with publicly available LLMs trained on instructions, such as InstructGPT, but it is also capable of satisfying unseen compositional instructions.
CoPoet(Chakrabarty等,2022)是一個協(xié)作的詩歌寫作工具,利用大型語言模型(如T5-3B、T5-11B和T0-3B模型)在詩歌寫作的各種指導(dǎo)下進行訓(xùn)練。指導(dǎo)性數(shù)據(jù)集中的每個樣本都包括一個<指令,詩行>對。有三種主要類型的指導(dǎo):延續(xù)、詞匯約束和修辭技巧。
CoPoet根據(jù)用戶的指令進行指導(dǎo),指定詩歌的所需屬性,例如寫一個關(guān)于"愛"的句子或以"飛"結(jié)尾的句子。該系統(tǒng)不僅在公開可用的受指導(dǎo)訓(xùn)練的LLM方面具有競爭力,例如InstructGPT,還能夠滿足未見過的組合指導(dǎo)。
6.6、Medical醫(yī)學(xué)
Radiology-GPT針對放射學(xué)領(lǐng)域:基于Alpaca+微調(diào)放射學(xué)領(lǐng)域知識數(shù)據(jù)集,兩元組{發(fā)現(xiàn),結(jié)論}
Radiology-GPT (Liu et al., 2023c) is a fine-tuned Alpaca-7B model for radiology, which utilizes an instruction tuning approach on an extensive dataset of radiology domain knowledge. Radiology reports usually include two corresponding sections: "Findings" and "Impression". The "Findings" section contains detailed observations from the radiology images, while the "Impression" section summarizes the interpretations drawn from those observations. Radiology-GPT provides a brief instruction to the "Findings" text: "Derive the impression from findings in the radiology report". The "Impression" text from the same report serves as the target output. In comparison to general language models such as StableLM, Dolly, and LLaMA, Radiology-GPT demonstrates significant versatility in radiological diagnosis, research, and communication.
Radiology-GPT(Liu等,2023c)是一個針對放射學(xué)領(lǐng)域的Alpaca-7B模型進行微調(diào)的模型,它在廣泛的放射學(xué)領(lǐng)域知識數(shù)據(jù)集上采用了指令微調(diào)方法。放射學(xué)報告通常包括兩個相應(yīng)的部分:"發(fā)現(xiàn)"和"結(jié)論"。"發(fā)現(xiàn)"部分包含來自放射學(xué)圖像的詳細觀察,而"結(jié)論"部分總結(jié)了從這些觀察中得出的解釋。Radiology-GPT為"發(fā)現(xiàn)"文本提供了一個簡要的指令:"從放射學(xué)報告的發(fā)現(xiàn)中得出結(jié)論"。同一份報告中的"結(jié)論"文本被用作目標輸出。與StableLM、Dolly和LLaMA等通用語言模型相比,Radiology-GPT在放射學(xué)診斷、研究和交流方面表現(xiàn)出顯著的多樣性。
ChatDoctor:基于LLaMA模型+微調(diào)Alpaca指令數(shù)據(jù)集和HealthCareMagic100k患者-醫(yī)生對話數(shù)據(jù)集且檢索外部知識數(shù)據(jù)庫
ChatDoctor (Li et al., 2023g) is based on the fine-tuned LLaMA-7B model, utilizing the alpaca instruction dataset and the HealthCareMagic100k patient-doctor dialogue dataset. And prompt templates are designed for retrieving external knowledge databases, such as the Disease Database and Wikipedia retrieval, during doctor-patient conversations to obtain more accurate outputs from the model. The ChatDoctor significantly improves the model’sability to comprehend patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, the accuracy of its responses is substantially improved.?
ChatDoctor(Li等,2023g)基于經(jīng)過微調(diào)的LLaMA-7B模型,利用Alpaca指令數(shù)據(jù)集和HealthCareMagic100k患者-醫(yī)生對話數(shù)據(jù)集。并且在醫(yī)生-患者對話期間為檢索外部知識數(shù)據(jù)庫,如疾病數(shù)據(jù)庫和維基百科檢索,設(shè)計了提示模板,以從模型中獲取更準確的輸出。ChatDoctor顯著提高了模型理解患者需求并提供明智建議的能力。通過為模型配備從可靠的在線和離線來源自主獲取信息的能力,其回答的準確性大大提高。
ChatGLM-Med:基于ChatGLM模型+微調(diào)中國醫(yī)學(xué)指令數(shù)據(jù)集(基于GPT3.5的API和醫(yī)學(xué)知識圖譜創(chuàng)建問題-答案對)
ChatGLM-Med?(Haochun Wang, 2023) is fine- tuned on the Chinese medical instruction dataset based on the ChatGLM-6B model. The instruction dataset comprises medically relevant question and answer pairs, created using the GPT3.5 API and the Medical Knowledge Graph. This model improves the question-answering performance of ChatGLM in the medical field.
ChatGLM-Med(Haochun Wang,2023)在基于ChatGLM-6B模型的中國醫(yī)學(xué)指令數(shù)據(jù)集上進行了微調(diào)。指令數(shù)據(jù)集包括使用GPT3.5 API和醫(yī)學(xué)知識圖譜創(chuàng)建的與醫(yī)學(xué)相關(guān)的問題和答案對。該模型提高了ChatGLM在醫(yī)學(xué)領(lǐng)域的問答性能。
6.7、Arithmetic算術(shù):Goat=基于LLaMA模型+微調(diào)算術(shù)問題數(shù)據(jù)集(ChatGPT生成數(shù)百個指令+自然語言問答的形式表達)
Goat (Liu and Low, 2023) is a fine-tuned LLaMA-7B model based on instructions, which aims to solve arithmetic problems. It expresses arithmetic problems in the form of natural language question answering, such as "What is 8914/64?", by generating hundreds of instruction templates using ChatGPT. The model applies various techniques to enhance its adaptability to diverse question formats, such as randomly removing spaces between numbers and symbols in the arithmetic expression and replacing "*" with "x" or "times". The Goat model achieves state-of-the-art performance on the BIG-bench arithmetic subtask. In particular, zero-shot Goat7B matches or exceeds the accuracy achieved by the few-shot PaLM-540B.
Goat(Liu和Low,2023)是一個基于指令微調(diào)的LLaMA-7B模型,旨在解決算術(shù)問題。它通過使用ChatGPT生成數(shù)百個指令模板,以自然語言問答的形式表達算術(shù)問題,
例如"What is 8914/64?"。該模型應(yīng)用各種技術(shù)增強其適應(yīng)各種問題格式的能力,例如隨機刪除算術(shù)表達式中數(shù)字和符號之間的空格,將"*"替換為"x"或"times"等。Goat模型在BIG-bench算術(shù)子任務(wù)上達到了最先進的性能。特別是,零樣本的Goat7B的準確性達到或超過了少樣本的PaLM-540B的準確性。
6.8、Code代碼:WizardCoder=基于StarCoder模型+Evol-Instruct方法+微調(diào)Code Alpaca數(shù)據(jù)集,3元組{指令、輸入、期望輸出}
WizardCoder (Luo et al., 2023) utilizes StarCoder 15B as the foundation with complex instruction fine-tuning, by adapting the Evol- Instruct method (Xu et al., 2023) to the domain of code. The training dataset is produced through iterative application of the Evol-Instruct technique on the Code Alpaca dataset, which includes the following attributes for each sample: instruction, input, and expected output. For instance, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query, and the expected output is the generated answer. The WizardCoder outperforms all other open-source Code LLMs and even outperforms the largest LLMs, Anthropic’s Claude and Google’s Bard, on HumanEval and HumanEval+.
WizardCoder(Luo等,2023)以StarCoder?15B為基礎(chǔ),采用復(fù)雜指令微調(diào),將Evol-Instruct方法(Xu等,2023)適用于代碼領(lǐng)域。訓(xùn)練數(shù)據(jù)集通過在Code Alpaca數(shù)據(jù)集上迭代應(yīng)用Evol-Instruct技術(shù)產(chǎn)生,該數(shù)據(jù)集為每個樣本包括以下屬性:指令、輸入和期望輸出。
例如,當指令為"Amend the following SQL query to select distinct elements"時,輸入為SQL查詢,期望輸出為生成的答案。WizardCoder在HumanEval和HumanEval+上超越了所有其他開源代碼LLM,甚至在HumanEval和HumanEval+上也超越了最大的LLM,Anthropic的Claude和Google的Bard。
LLMs之Code:SQLCoder的簡介、安裝、使用方法之詳細攻略
LLMs之Code:SQLCoder的簡介、安裝、使用方法之詳細攻略_一個處女座的程序猿的博客-CSDN博客
LLMs之Code:Code Llama的簡介、安裝、使用方法之詳細攻略
LLMs之Code:Code Llama的簡介、安裝、使用方法之詳細攻略_一個處女座的程序猿的博客-CSDN博客
補充—6.9、法律行業(yè)
LLMs之Law:大語言模型領(lǐng)域行業(yè)場景應(yīng)用之大模型法律行業(yè)的簡介、主流LLMs(PowerLawGLM/ChatLaw)、經(jīng)典應(yīng)用之詳細攻略
LLMs之Law:大語言模型領(lǐng)域行業(yè)場景應(yīng)用之大模型法律行業(yè)的簡介、主流LLMs(PowerLawGLM/ChatLaw)、經(jīng)典應(yīng)用之詳細攻略_一個處女座的程序猿的博客-CSDN博客
7、Efficient Tuning Techniques高效微調(diào)技術(shù)
7.0、高效微調(diào)三種方法論:基于添加式(引入額外可訓(xùn)練參數(shù)或模塊,如HINT)、基于規(guī)范化(凍結(jié)某些固有模型參數(shù)同時指定要調(diào)整的參數(shù),如Delta-tuning)、基于重參數(shù)化(假設(shè)模型自適應(yīng)的低秩性→權(quán)重可重新參數(shù)化為低維子空間,如LoRA/QLoRA/LOMO)
Efficient fine-tuning techniques aim at adapting LLMs to downstream tasks by optimizing a small fraction of parameters in multiple ways, i.e., addition-based, specification-based, and Reparameterization-based. Addition-based methods introduce extra trainable parameters or modules not present in the original model. Representative methods include adapter tuning (Houlsby et al., 2019) and prompt-based tuning (Schick and Schütze, 2021). Specification-based methods specify certain inherent model parameters to be tuned while freezing others. For example, BitFit (Zaken et al., 2022) tunes the bias terms of the pre-trained model. Reparameterization methods transform model weights into more parameter-efficient forms for tuning. The key hypothesis is that model adaptation is low-rank, so weights can be reparameterized into low- rank factors or a low-dimensional subspace (e.g., LoRA (Hu et al., 2021)). Intrinsic prompt tuning finds a low-dimensional subspace shared by tuning prompts across diverse tasks.
高效微調(diào)技術(shù)旨在通過多種方式對少量參數(shù)進行優(yōu)化,從而將LLM適應(yīng)于下游任務(wù),包括基于添加式、基于規(guī)范化和基于重參數(shù)化的方法?;谔砑邮降姆椒ㄒ肓嗽谠寄P椭胁淮嬖诘念~外可訓(xùn)練參數(shù)或模塊。代表性的方法包括Adapter微調(diào)(Houlsby等,2019)和基于Prompt的微調(diào)(Schick和Schütze,2021)。基于規(guī)范化的方法在凍結(jié)某些固有模型參數(shù)的同時,指定要調(diào)整的參數(shù)。例如,BitFit(Zaken等,2022)微調(diào)預(yù)訓(xùn)練模型的偏差項?;谥貐?shù)化方法將模型權(quán)重轉(zhuǎn)換為更加參數(shù)高效的形式進行微調(diào)。關(guān)鍵假設(shè)是模型的自適應(yīng)是低秩的,因此權(quán)重可以重新參數(shù)化為低秩因子或低維子空間(例如LoRA(Hu等,2021))。Intrinsic prompt內(nèi)在的提示微調(diào)在不同任務(wù)之間找到了一種共享的低維子空間。
7.1、基于重參數(shù)化—LoRA=基于DeepSpeed框架+訓(xùn)練低維度的A和B→可訓(xùn)練參數(shù)比完全微調(diào)少得多(LoRA訓(xùn)練GPT-3可降低到千分之一)
Low-Rank Adaptation (LoRA) (Hu et al., 2021) enables efficient adaptation of LLMs using low- rank updates. LoRA use DeepSpeed (Rasley et al., 2020) as the training backbone. The key insight of LoRA is that the actual change in LLMs’ weights required for new task adaptation lies in a low- dimensional subspace. Specifically, for a pretrained weight matrix W0, the authors model the adapted weight matrix as W0 + ?W , where ?W is a low rank update. ?W is parameterized as ?W = BA, where A and B are much smaller trainable matrices. The rank r of ?W is chosen to be much smaller than the dimensions of W0. The intuition is that instead of directly training all of W0, the authors train low-dimensional A and B, which indirectly trains W0 in a low-rank subspace of directions that matter for the downstream task. This results in far fewer trainable parameters compared to full fine- tuning. For GPT-3, LoRA reduces the number of trainable parameters by 10,000x and memory usage by 3x compared to full fine-tuning.?
低秩適應(yīng)(LoRA)(Hu等,2021)使用低秩更新實現(xiàn)了LLM的高效適應(yīng)。LoRA使用DeepSpeed(Rasley等,2020)作為訓(xùn)練骨干。LoRA的關(guān)鍵洞察是,用于新任務(wù)適應(yīng)的LLM權(quán)重的實際變化位于低維子空間中。
具體而言,對于預(yù)訓(xùn)練權(quán)重矩陣W0,作者將適應(yīng)權(quán)重矩陣建模為
W0 + ?W,
其中?W是低秩更新。?W的參數(shù)化形式為?W = BA,其中A和B是較小的可訓(xùn)練矩陣。?W的秩r被選擇為遠小于W0的維度。
直覺是,作者不是直接訓(xùn)練所有W0,而是訓(xùn)練低維度的A和B,這間接地在對下游任務(wù)重要的低秩子空間中訓(xùn)練W0。這導(dǎo)致可訓(xùn)練參數(shù)比完全微調(diào)少得多。對于GPT-3,LoRA將可訓(xùn)練參數(shù)的數(shù)量減少了10000倍,內(nèi)存使用量降低了3倍,與完全微調(diào)相比。
7.2、基于添加式—HINT=添加易于微調(diào)的模塊(基于超網(wǎng)絡(luò)數(shù)生成器生成適配器和前綴參數(shù))+插入到骨干模型作為高效的微調(diào)模塊
HINT屬于Addition-based方法。它通過添加易于微調(diào)的模塊(如適配器和前綴)來實現(xiàn)微調(diào),這些模塊沒有包含在原始模型結(jié)構(gòu)中,屬于添加額外的參數(shù)或模塊來實現(xiàn)微調(diào)。
HINT (Ivison et al., 2022) combines the generalization benefits of instruction tuning with efficient on-demand fine-tuning, avoiding repeatedly processing lengthy instructions. The essence of HINT lies in hypernetworks, which generate parameter-efficient modules for LLMs adaptation based on natural language instructions and few-shot examples. The adopted hypernetwork converts instructions and few-shot examples into a encoded instruction and generates adapter and prefix parameters using a pretrained text encoder and cross-attention based parameter generator. Then, the generated adapters and prefixes are inserted into the backbone model as efficient tuning modules. At inference, the hypernetwork performs inference only once per task to generate adapted modules. The benefits are that HINT can incorporate long instructions and additional few- shots without increasing compute, unlike regular fine-tuning or input concatenation methods.???
HINT(Ivison等,2022)將指令微調(diào)的泛化優(yōu)勢與高效的按需微調(diào)相結(jié)合,避免重復(fù)處理冗長的指令。HINT的核心在于超網(wǎng)絡(luò),它基于自然語言指令和少樣本示例為LLM適應(yīng)生成參數(shù)高效的模塊。采用的超網(wǎng)絡(luò)將指令和少樣本示例轉(zhuǎn)化為編碼指令,并使用預(yù)訓(xùn)練文本編碼器和基于交叉注意力的參數(shù)生成器生成適配器和前綴參數(shù)。然后,生成的適配器和前綴被插入到骨干模型中作為高效的微調(diào)模塊。在推理時,超網(wǎng)絡(luò)僅執(zhí)行一次推理以生成適應(yīng)的模塊。好處是,HINT可以在不增加計算的情況下融入長指令和額外的少樣本,不像常規(guī)微調(diào)或輸入連接方法。
7.3、基于重參數(shù)化—QLoRA=LoRA的量化版+NF4+雙量化DQ+分頁優(yōu)化器PO
QLORA (Dettmers et al., 2023) includes optimal quantization and memory optimization, aiming at providing efficient and effective LLMs fine- tuning. QLORA includes 4-bit NormalFloat (NF4) Quantization, which is a quantization scheme optimized for the typical normal distribution of LLM weights. By quantizing based on the quantiles of a normal distribution, NF4 provides better performance than standard 4-bit integer or float quantization. To further reduce memory, the quantization constants are themselves quantized to 8 bits. This second level of quantization saves an additional 0.37 bits per parameter on average. QLORA leverages NVIDIA’s unified memory feature to page optimizer states to CPU RAM when GPU memory is exceeded. avoiding out-of-memory during training. QLORA enables training a 65B parameter LLM on a single 48GB GPU with no degradation compared to full 16- bit finetuning. QLORA works by freezing the 4-bit quantized base LLM, then backpropagating through it into a small set of 16-bit low-rank adapter weights which are learned.
QLORA(Dettmers等,2023)包括最佳量化和內(nèi)存優(yōu)化,旨在提供高效有效的LLM微調(diào)。QLORA包括4位NormalFloat(NF4)量化,這是一種針對LLM權(quán)重的典型正態(tài)分布優(yōu)化的量化方案。通過基于正態(tài)分布的分位數(shù)進行量化,NF4的性能優(yōu)于標準的4位整數(shù)或浮點數(shù)量化。為了進一步減少內(nèi)存,量化常數(shù)本身被量化為8位。這第二層量化平均可節(jié)省每個參數(shù)0.37位的內(nèi)存。QLORA利用NVIDIA的統(tǒng)一內(nèi)存功能,當GPU內(nèi)存超出限制時,將優(yōu)化器狀態(tài)分頁到CPU RAM中,避免訓(xùn)練期間的內(nèi)存不足。QLORA可以在單個48GB GPU上訓(xùn)練65B參數(shù)的LLM,與完全16位微調(diào)相比沒有降級。QLORA的工作方式是凍結(jié)4位量化的基礎(chǔ)LLM,然后通過反向傳播將其傳播到一小組16位低秩適配器權(quán)重中進行學(xué)習(xí)。
7.4、基于重參數(shù)化—LOMO=降低梯度內(nèi)存需求(融合梯度計算與參數(shù)更新+實時只存儲單個參數(shù)的梯度)+穩(wěn)定訓(xùn)練(梯度值裁剪+分離梯度范數(shù)計算+態(tài)損失縮放)+節(jié)省內(nèi)存(激活檢查點+ZeRO優(yōu)化)
LOMO屬于Reparameterization-based方法。LOMO通過將梯度計算和參數(shù)更新融合到一個步驟中,來避免存儲完整的梯度張量,從而實現(xiàn)只存儲單個參數(shù)梯度的能力,從而更高效地進行微調(diào)。這屬于使用參數(shù)重參數(shù)化的方法來實現(xiàn)更高效的微調(diào)。
LOw-Memory Optimization (LOMO) (Lv et al., 2023) enables full parameter fine-tuning of LLMs using limited computational resources through a fusion of gradient computation and update. The essence is to fuse gradient computation and parameter update into one step during backpropagation, thereby avoiding storage of full gradient tensors. Firstly, theoretical analysis is provided in LOMO on why SGD can work well for fine-tuning large pre-trained models despite its challenges on smaller models. In addition, LOMO updates each parameter tensor immediately after computing its gradient in backpropagation. Storing the gradient of one parameter at a time reduces gradient memory to O(1). LOMO employs gradient value clipping, separate gradient norm computation pass and dynamic loss scaling to stabilize training. The integration of activation checkpointing and ZeRO optimization methods saves memory.
低內(nèi)存優(yōu)化(LOMO)(Lv等,2023)通過梯度計算和更新的融合,在有限的計算資源下實現(xiàn)LLM的全參數(shù)微調(diào)。其核心是在反向傳播期間將梯度計算和參數(shù)更新融合為一步,從而避免存儲完整的梯度張量。首先,LOMO在理論上分析了為什么SGD可以在微調(diào)大型預(yù)訓(xùn)練模型時表現(xiàn)良好,盡管在較小的模型上可能存在挑戰(zhàn)。此外,LOMO在反向傳播中在計算梯度后立即更新每個參數(shù)張量。一次只存儲一個參數(shù)的梯度將梯度內(nèi)存降低到O(1)。LOMO采用梯度值裁剪、單獨的梯度范數(shù)計算傳遞和動態(tài)損失縮放來穩(wěn)定訓(xùn)練。激活檢查點和ZeRO優(yōu)化方法的集成可節(jié)省內(nèi)存。
7.5、基于規(guī)范化—Delta-tuning=優(yōu)化和最優(yōu)控制視角+將微調(diào)限制在低維流形上來執(zhí)行子空間優(yōu)化+微調(diào)參數(shù)充當最優(yōu)控制器+在下游任務(wù)中引導(dǎo)模型行為
Delta-tuning屬于Specification-based方法。Delta-tuning通過限制微調(diào)在一個低維子空間上進行,來指定預(yù)訓(xùn)練模型中的某些固有參數(shù)進行微調(diào),而凍結(jié)其他參數(shù)。這屬于指定模型參數(shù)子集進行微調(diào)的Specification-based方法。
Delta-tuning (Ding et al., 2023b) provides optimization and optimal control perspectives for theoretical analyzation. Intuitively, delta-tuning performs subspace optimization by restricting tuning to a low-dimensional manifold. The tuned parameters act as optimal controllers guiding model behavior on downstream tasks.
Delta-tuning(Ding等,2023b)提供了優(yōu)化和最優(yōu)控制的理論分析視角。直觀地說,Delta-tuning通過將調(diào)整限制在低維流形上來執(zhí)行子空間優(yōu)化。調(diào)整的參數(shù)充當引導(dǎo)模型在下游任務(wù)中行為的最優(yōu)控制器。
8、Evaluation, Analysis and Criticism評估、分析和批評
8.1、HELM Evaluation:整體評估+提高LM透明度+關(guān)注三因素(廣泛性+多指標性+標準化)
HELM(Liang et al., 2022) is a holistic evaluation of Language Models (LMs) to improve the transparency of language models, providing a more comprehensive understanding of the capabilities, risks, and limitations of language models. Specifically, differing from other evaluation methods, HELM holds that a holistic evaluation of language models should focus on the following three factors:
HELM(Liang等,2022)是對語言模型(LMs)進行整體評估,旨在提高語言模型的透明度,從而更全面地了解語言模型的能力、風(fēng)險和限制。與其他評估方法不同,HELM認為對語言模型進行整體評估應(yīng)關(guān)注以下三個因素:
(1)、Broad coverage. During the development, language models can be adapted to various NLP tasks (e.g., sequence labeling and question answering), thus, the evaluation of language models needs to be carried out in a wide range of scenarios. To involve all potential scenarios,
HELM proposed a top-down taxonomy, which begins by compiling all existing tasks in a major NLP conference (ACL2022) into a task space and dividing each task into the form of scenarios (e.g., languages) and metrics (e.g., accuracy). Then when facing a specific task, the taxonomy would select one or more scenarios and metrics in the task space to cover it. By analyzing the structure of each task, HELM clarifies the evaluation content (task scenarios and metrics) and improves the scenario coverage of language models from 17.9% to 96.0%.
(2)、Multi-metric measurement. In order to enable human to weigh language models from different perspectives, HELM proposes multi- metric measurement. HELM has covered 16 different scenarios and 7 metrics. To ensure the results of intensive multi-metric measurement, HELM measured 98 of 112 possible core scenarios (87.5%).
(3)、Standardization. The increase in the scale and training complexity of language models has seriously hindered human’s understanding of the structure of each language model. To establish a unified understanding of existing language models, HELM benchmarks 30 well-known language models, covering such institutions as Google (UL2(Tay et al., 2022)), OpenAI (GPT-3(Brown et al., 2020b)), and EleutherAI (GPT-NeoX(Black et al., 2022)). Interestingly, HELM pointed out that LMs such as T5 (Raffel et al., 2019) and Anthropic- LMv4-s3 (Bai et al., 2022a) had not been directly compared in the initial work, while LLMs such as GPT-3 and YaLM were still different from their corresponding reports after multiple evaluations.
(1)廣泛涵蓋。在開發(fā)過程中,語言模型可以適應(yīng)各種自然語言處理任務(wù)(例如序列標注和問題回答),因此需要在廣泛的情景下進行語言模型的評估。為了涵蓋所有潛在情景,HELM提出了一種自上而下的分類法,首先將主要的自然語言處理會議(ACL2022)中的所有現(xiàn)有任務(wù)編譯成任務(wù)空間,并將每個任務(wù)劃分為情景(例如語言)和指標(例如準確性)的形式。然后在面對特定任務(wù)時,分類法會選擇任務(wù)空間中的一個或多個情景和指標來涵蓋它。通過分析每個任務(wù)的結(jié)構(gòu),HELM明確了評估內(nèi)容(任務(wù)情景和指標),并將語言模型的情景涵蓋范圍從17.9%提高到96.0%。
(2)多指標測量。為了使人類能夠從不同角度權(quán)衡語言模型,HELM提出了多指標測量。HELM涵蓋了16種不同的情景和7個指標。為了確保密集的多指標測量結(jié)果,HELM對112個可能的核心情景中的98個進行了測量(87.5%)。
(3)標準化。語言模型規(guī)模和訓(xùn)練復(fù)雜性的增加嚴重阻礙了人類對每個語言模型結(jié)構(gòu)的理解。為了建立對現(xiàn)有語言模型的統(tǒng)一理解,HELM對30個知名語言模型進行了基準測試,涵蓋了Google(UL2(Tay等,2022))、OpenAI(GPT-3(Brown等,2020b))和EleutherAI(GPT-NeoX(Black等,2022))等機構(gòu)。有趣的是,HELM指出,例如T5(Raffel等,2019)和Anthropic- LMv4-s3(Bai等,2022a)等LLMs在初始工作中尚未直接進行比較,而GPT-3和YaLM等LLMs在多次評估后仍與其對應(yīng)的報告不同。
8.2、Low-resource Instruction Tuning低資源指令微調(diào):STL需要數(shù)據(jù)量的25%、MTL需要數(shù)據(jù)量的6%
Gupta et al. (2023) attempts to estimate the minimal downstream training data required by IT models to match the SOTA supervised models over various tasks. Gupta et al. (2023) conducted experiments on 119 tasks from Super Natural Instructions (SuperNI) in both single-task learning (STL) and multi-task learning (MTL) settings. The results indicate that in the STL setting, IT models with only 25% of downstream training data outperform the SOTA models on those tasks, while in the MTL setting, just 6% of downstream training data can lead IT models to achieve the SOTA performance. These findings suggest that instruction tuning can effectively assist a model in quickly learning a task even with limited data.
However, due to resource limitations, Gupta et al. (2023) did not conduct experiments on LLMs, like T5-11B. So, to gain a more comprehensive understanding of the IT models, further investigation using larger language models and datasets is necessary.
Gupta等人(2023)試圖估計IT模型需要多少最少的下游訓(xùn)練數(shù)據(jù),才能在各種任務(wù)上匹配SOTA監(jiān)督模型。Gupta等人(2023)在超自然指令(SuperNI)的119個任務(wù)上進行了實驗,包括單任務(wù)學(xué)習(xí)(STL)和多任務(wù)學(xué)習(xí)(MTL)設(shè)置。結(jié)果表明,在STL設(shè)置下,只需使用下游訓(xùn)練數(shù)據(jù)的25%即可在這些任務(wù)上勝過SOTA模型,而在MTL設(shè)置下,只需使用下游訓(xùn)練數(shù)據(jù)的6%即可使IT模型達到SOTA性能。這些發(fā)現(xiàn)表明,即使數(shù)據(jù)有限,指令微調(diào)也能有效地幫助模型迅速學(xué)習(xí)任務(wù)。
然而,由于資源限制,Gupta等人(2023)并沒有對像T5-11B這樣的LLMs進行實驗。因此,為了更全面地了解IT模型,需要進一步使用更大的語言模型和數(shù)據(jù)集進行調(diào)查。
8.3、Smaller Instruction Dataset更小的指令數(shù)據(jù)集:LIMA(精選1,000個訓(xùn)練示例)表面可過少數(shù)精心策劃的指令進行微調(diào)
IT requires a substantial amount of specialized instruction data for training. Zhou et al. (2023) hypothesized that the pre-trained LLM only has to learn the style or format to interact with users and proposed LIMA that achieves strong performance by fine-tuning an LLM on only 1,000 carefully selected training examples.
Specifically, LIMA first manually curates 1,000 demonstrations with high-quality prompts and responses. Then the 1,000 demonstrations are used to fine-tune the pre-trained 65B-parameter LLaMa (Touvron et al., 2023b). By comparison, across more than 300 challenging tasks, LIMA outperfrms GPT-davinci003 (Brown et al., 2020b), which was fine-tuned on 5,200 examples by human feedback tuning. Moreover, with only half amount of demonstrations, LIMA achieves equivalent results to GPT-4 (OpenAI, 2023), Claude (Bai et al., 2022b), and Bard. Above all, LIMA demonstrated that LLMs’ powerful knowledge and capabilities can be exposed to users with only a few carefully curated instructions to fine-tune.
IT需要大量的專門指令數(shù)據(jù)進行訓(xùn)練。Zhou等人(2023)假設(shè)預(yù)訓(xùn)練LLM只需學(xué)習(xí)與用戶互動的樣式或格式,并提出了LIMA,通過僅在1,000個精選的訓(xùn)練示例上微調(diào)LLM,實現(xiàn)了強大的性能。
具體而言,LIMA首先手動策劃了1,000個具有高質(zhì)量提示和回復(fù)的演示。然后,這1,000個演示用于微調(diào)預(yù)訓(xùn)練的65B參數(shù)LLaMa(Touvron等,2023b)。相比之下,在超過300個具有挑戰(zhàn)性的任務(wù)中,LIMA在表現(xiàn)上勝過了通過人工反饋微調(diào)的GPT-davinci003(Brown等,2020b)。此外,只有一半數(shù)量的示范,LIMA就可以實現(xiàn)與GPT-4(OpenAI,2023)、Claude(Bai等,2022b)和Bard相當?shù)慕Y(jié)果。總之,LIMA表明LLMs的強大知識和能力可以通過少數(shù)精心策劃的指令進行微調(diào)。
8.4、Evaluating Instruction-tuning Datasets評估指令微調(diào)數(shù)據(jù)集:缺乏開放性和主觀性的評估?
The performance of IT model highly depends on the IT datasets. However, there lacks of evaluations for these IT datasets from open-ended and subjective aspects.
To address this issue, Wang et al. (2023c) performs dataset evaluation by fine-tuning the LLaMa model (Touvron et al., 2023b) on various of open IT datasets and measure different fine- tuned models through both automatic and human evaluations. An additional model is trained on the combination of IT datasets. For the results, Wang et al. (2023c) showed that there is not a single best IT dataset across all tasks, while by manually combining datasets it can achieve the best overall performance. Besides, Wang et al. (2023c) pointed out that though IT can bring large benefits on LLMs at all sizes, smaller models and models with a high base quality benefit most from IT. For human evaluations, Wang et al. (2023c) a larger model is more likely to gain a higher acceptability score.
IT模型的性能在很大程度上取決于IT數(shù)據(jù)集。然而,這些IT數(shù)據(jù)集在開放性和主觀性方面缺乏評估。
為了解決這個問題,Wang等人(2023c)通過在各種開放IT數(shù)據(jù)集上微調(diào)LLaMa模型(Touvron等,2023b),并通過自動和人工評估來測量不同的微調(diào)模型。還有一個模型是在IT數(shù)據(jù)集的組合上進行訓(xùn)練的。根據(jù)結(jié)果,Wang等人(2023c)表明,并沒有一個單一的最佳IT數(shù)據(jù)集適用于所有任務(wù),但通過手動組合數(shù)據(jù)集可以實現(xiàn)最佳整體性能。此外,Wang等人(2023c)指出,盡管IT在所有規(guī)模的LLMs上都能帶來很大的好處,但較小的模型和具有高基礎(chǔ)質(zhì)量的模型最能從IT中受益。對于人類評估,Wang等人(2023c)發(fā)現(xiàn)較大的模型更有可能獲得更高的可接受性評分。
8.5、Do IT just learn Pattern Copying?IT是否只是學(xué)習(xí)模式復(fù)制?——有論文指出基于IT的顯著改進只是捕獲表面級別模式而非理解了本質(zhì)
To address the lack of clarity about the specific knowledge that models acquire through instruction tuning, Kung and Peng (2023) delves into the analysis of how models make use of instructions during IT by comparing the tuning when provided with altered instructions versus the original instructions.
Specifically, Kung and Peng (2023) creates simplified task definitions that remove all semantic components, leaving only the output information. In addition, Kung and Peng (2023) also incorporates delusive examples that contain incorrect input-output mapping. Surprisingly, the experiments show that models trained on these simplified task definitions or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Moreover, the paper also introduces a baseline for the classification task with zero-shot, which achieves similar performance to IT in low-resource settings.
In summary, according to Kung and Peng (2023), the notable performance improvements observed in current IT models may be attributed to their ability to capture surface-level patterns, such as learning the output format and making guesses, rather than comprehending and learning the specific task.
為了解決關(guān)于模型通過指令微調(diào)獲取特定知識的缺乏清晰性的問題,Kung和Peng(2023)通過比較在提供修改后的指令與原始指令時的微調(diào)情況,深入分析了模型在指令微調(diào)過程中如何使用指令。
具體而言,Kung和Peng(2023)創(chuàng)建了簡化的任務(wù)定義,去除了所有語義成分,只留下輸出信息。此外,Kung和Peng(2023)還包括包含不正確輸入-輸出映射的誤導(dǎo)性示例。令人驚訝的是,實驗表明,訓(xùn)練在這些簡化的任務(wù)定義或誤導(dǎo)性示例上的模型可以達到與在原始指令和示例上訓(xùn)練的模型相當?shù)男阅堋4送?#xff0c;該論文還引入了零樣本分類任務(wù)的基線,其在低資源設(shè)置下實現(xiàn)了與IT相似的性能。
總之,根據(jù)Kung和Peng(2023)的觀點,當前IT模型中觀察到的顯著性能改進可能歸因于其捕捉表面級別的模式,例如學(xué)習(xí)輸出格式和進行猜測,而不是理解和學(xué)習(xí)特定任務(wù)。
8.6、Proprietary LLMs Imitation專有LLMs模仿:微調(diào)模型能效仿ChatGPT的表達風(fēng)格,但不等于提升其通用能力→更應(yīng)注重基模型及指導(dǎo)實例的質(zhì)量
Gudibande等人通過收集ChatGPT在多個領(lǐng)域的輸出數(shù)據(jù),用于微調(diào)開源模型,旨在使開源模型在部分領(lǐng)域的能力接近專有模型。他們的實驗顯示,在有模仿數(shù)據(jù)集支持的任務(wù)上,微調(diào)后模型的表現(xiàn)明顯提高,輸出與ChatGPT相似;但在沒有模仿數(shù)據(jù)集的任務(wù)上,微調(diào)模型無效甚至效果下降。他們指出微調(diào)模型能效仿ChatGPT的表達風(fēng)格,但不等于提升其通用能力。研究者應(yīng)注重基模型及指導(dǎo)實例的質(zhì)量,而不是模仿專有模型。
LLMs imitation is an approach that collects outputs from a stronger model, such as a proprietary system like ChatGPT, and uses these outputs to fine-tune an open-source LLM. Through this way, an open- source LLM may get competitive capabilities with any proprietary model.
Gudibande et al. (2023) conducted several experiments to critically analyze the efficacy of model imitation. Specifically, Gudibande et al. (2023) first collected datasets from outputs of ChatGPT over broad tasks. Then these datasets were used to fine-tune a range of models covering sizes from 1.5B to 13B, base models GPT-2 and LLaMA, and data amounts from 0.3M tokens to 150M tokens.
For evaluations, Gudibande et al. (2023) demonstrated that on tasks with supported datasets, imitation models are far better than before, and their outputs appear similar to ChatGPT’s. While on tasks without imitation datasets, imitation models do not have improvement or even decline in accuracy.
Thus, Gudibande et al. (2023) pointed out that it’s the phenomenon that imitation models are adept at mimicking ChatGPT’s style (e.g., being fluent, confident and well-structured) that makes researchers have the illusion about general abilities of imitation models. So, Gudibande et al. (2023) suggested that instead of imitating proprietary models, researchers had better focus on improving the quality of base models and instruction examples.
LLMs模仿是一種方法,它收集來自更強大模型(例如ChatGPT等專有系統(tǒng))的輸出,并使用這些輸出對開源LLM進行微調(diào)。通過這種方式,開源LLM可以獲得與任何專有模型相當?shù)哪芰Α?div style="height:15px;">
Gudibande等人(2023)進行了多項實驗,以批判性地分析模型模仿的效果。具體而言,Gudibande等人(2023)首先從廣泛的任務(wù)中收集了ChatGPT的輸出數(shù)據(jù)集。然后,這些數(shù)據(jù)集被用于微調(diào)覆蓋從1.5B到13B大小的一系列模型,基礎(chǔ)模型為GPT-2和LLaMA,數(shù)據(jù)量為0.3M到150M個標記。
在評估方面,Gudibande等人(2023)證明,在有支持數(shù)據(jù)集的任務(wù)上,模仿模型遠遠優(yōu)于以前,其輸出與ChatGPT的輸出相似。然而,在沒有模仿數(shù)據(jù)集的任務(wù)中,模仿模型沒有提高甚至在準確性方面下降。
因此,Gudibande等人(2023)指出,模仿模型擅長模仿ChatGPT的風(fēng)格(例如流利、自信和良好結(jié)構(gòu)),這使得研究人員產(chǎn)生了有關(guān)模仿模型的普遍能力的錯覺。因此,Gudibande等人(2023)建議,研究人員不應(yīng)該模仿專有模型,而應(yīng)該專注于提高基礎(chǔ)模型和指令示例的質(zhì)量。
This work surveys recent advances in the fast growing field of instruction tuning. We make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, IT’s applications to different modalities, domains and application. We also review analysis on IT models to discover both their advantages and potential pitfalls. We hope this work will act as a stimulus to motivate further endeavors to address the deficiencies of current IT models.
本文對迅速發(fā)展的指令微調(diào)領(lǐng)域的最新進展進行了綜述。我們對文獻進行了系統(tǒng)性的回顧,包括IT的一般方法論、IT數(shù)據(jù)集的構(gòu)建、IT模型的訓(xùn)練,以及IT在不同形式、領(lǐng)域和應(yīng)用中的應(yīng)用。我們還對IT模型的分析進行了回顧,以發(fā)現(xiàn)它們的優(yōu)勢和潛在問題。我們希望本文能夠作為一個刺激,激勵更多的努力來解決當前IT模型的不足之處。