AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之簡(jiǎn)介
導(dǎo)讀:本文是對(duì)展示視覺和視覺語(yǔ)言能力的多模態(tài)基礎(chǔ)模型的全面調(diào)查,重點(diǎn)關(guān)注從專業(yè)模型向通用助手的過渡。本文涵蓋了五個(gè)核心主題:
>> 視覺理解:本部分探討了學(xué)習(xí)視覺骨干用于視覺理解的方法,包括監(jiān)督預(yù)訓(xùn)練、對(duì)比語(yǔ)言圖像預(yù)訓(xùn)練、僅圖像的自監(jiān)督學(xué)習(xí)以及多模態(tài)融合、區(qū)域級(jí)和像素級(jí)預(yù)訓(xùn)練。
>> 視覺生成:本部分討論了視覺生成的各個(gè)方面,如視覺生成中的人類對(duì)齊、文本到圖像生成、空間可控生成、基于文本的編輯、文本提示跟隨以及概念定制。
>> 統(tǒng)一視覺模型:本部分考察了視覺模型從封閉集到開放集模型、任務(wù)特定模型到通用模型、靜態(tài)到可提示模型的演變。
>> 加持LLMs的大型多模態(tài)模型:本部分探討了使用大型語(yǔ)言模型(LLMs)進(jìn)行大型多模態(tài)模型訓(xùn)練,包括在LLMs中進(jìn)行指導(dǎo)調(diào)整以及指導(dǎo)調(diào)整的大型多模態(tài)模型的開發(fā)。
>> 多模態(tài)智能體:本部分重點(diǎn)關(guān)注將多模態(tài)工具與LLMs鏈接以創(chuàng)建多模態(tài)代理,包括對(duì)MM-REACT的案例研究、高級(jí)主題以及多種多模態(tài)代理的應(yīng)用。
本文的關(guān)鍵論點(diǎn)包括視覺在人工智能中的重要性,多模態(tài)基礎(chǔ)模型從專業(yè)模型向通用助手的演變,以及大型多模態(tài)模型和多模態(tài)代理在各種應(yīng)用中的潛力。本文旨在面向計(jì)算機(jī)視覺和視覺語(yǔ)言多模態(tài)社區(qū)的研究人員、研究生和專業(yè)人士,他們希望了解多模態(tài)基礎(chǔ)模型的基礎(chǔ)知識(shí)和最新進(jìn)展。
相關(guān)文章
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之簡(jiǎn)介
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之視覺理解、視覺生成
AGI之MFM:《多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之視覺理解、視覺生成_一個(gè)處女座的程序猿的博客-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之統(tǒng)一的視覺模型、加持LLMs的大型多模態(tài)模型
AGI之MFM:《多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之統(tǒng)一的視覺模型、加持LLMs的大型多模態(tài)模型-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結(jié)論和研究趨勢(shì)
AGI之MFM:《多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結(jié)論和研究趨勢(shì)-CSDN博客
《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型:從專家到通用助手》翻譯與解讀
地址
論文地址:https://arxiv.org/abs/2309.10020
時(shí)間
2023年9月18日
作者
微軟團(tuán)隊(duì)
Abstract
研究了兩類的5個(gè)核心主題:成熟的研究領(lǐng)域(視覺主干法+文本到圖像生成)、探索性開放研究領(lǐng)域(基于LLM的統(tǒng)一視覺模型+多模態(tài)LLM的端到端訓(xùn)練+與LLMs鏈接多模態(tài)工具)
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language ca-pabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multi-modal foundation models pre-trained for specific purposes, including two topics –methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics – unified vision models inspired by large lan-guage models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language mul-timodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
本文提供了一份綜合性的調(diào)查,涵蓋了展示視覺和視覺語(yǔ)言能力的多模態(tài)基礎(chǔ)模型的分類和演化,重點(diǎn)關(guān)注了從專業(yè)模型過渡到通用助手的過程。研究領(lǐng)域涵蓋了五個(gè)核心主題,分為兩類。
(i)我們首先調(diào)查了一些成熟的研究領(lǐng)域:為特定目的預(yù)先訓(xùn)練的多模態(tài)基礎(chǔ)模型,包括兩個(gè)主題——用于視覺理解的學(xué)習(xí)視覺主干方法和文本到圖像的生成。。
(ii)然后,我們介紹了探索性開放研究領(lǐng)域的最新進(jìn)展:旨在扮演通用助手角色的多模態(tài)基礎(chǔ)模型,包括三個(gè)主題?- 受大型語(yǔ)言模型(LLMs)啟發(fā)的統(tǒng)一視覺模型、多模態(tài)LLMs的端到端訓(xùn)練以及與LLMs鏈接多模態(tài)工具。本文的目標(biāo)受眾是渴望學(xué)習(xí)多模態(tài)基礎(chǔ)模型的基礎(chǔ)知識(shí)和最新進(jìn)展的計(jì)算機(jī)視覺和視覺語(yǔ)言社區(qū)的研究人員、研究生和專業(yè)人員。
1 Introduction
Vision is one of the primary channels for humans and many living creatures to perceive and interact with the world. One of the core aspirations in artificial intelligence (AI) is to develop AI agents to mimic such an ability to effectively perceive and generate visual signals, and thus reason over and interact with the visual world. Examples include recognition of the objects and actions in the scenes, and creation of sketches and pictures for communication. Building foundational models with visual capabilities is a prevalent research field striving to accomplish this objective.
視覺是人類和許多生物感知和與世界互動(dòng)的主要渠道之一。人工智能(AI)的核心目標(biāo)之一是開發(fā)AI代理,模仿這種有效感知和生成視覺信號(hào)的能力,從而在視覺世界中進(jìn)行推理和互動(dòng)。示例包包括對(duì)場(chǎng)景中物體和動(dòng)作的識(shí)別,以及創(chuàng)建用于溝通的草圖和圖片。建立具有視覺能力的基礎(chǔ)模型是一個(gè)廣泛存在的研究領(lǐng)域,旨在實(shí)現(xiàn)這一目標(biāo)。
特定任務(wù)的模型(需從頭開始訓(xùn)練)→基于語(yǔ)言理解和生成的語(yǔ)言模型(為適應(yīng)下游任務(wù)提供基礎(chǔ),如BERT/GPT-2等)→基于大統(tǒng)一的大型語(yǔ)言模型LLMs(出現(xiàn)新興能力【如ICL/CoT】,如GPT-3等)→基于通用助手的LLMs(有趣的能力【互動(dòng)和使用工具】,如ChatGPT/GPT-4)
Over the last decade, the field of AI has experienced a fruitful trajectory in the development of models. We divide them into four categories, as illustrated in Figure 1.1. The categorization can be shared among different fields in AI, including language, vision and multimodality. We first use language models in NLP to illustrate the evolution process. (i) At the early years, task-specific mod- els are developed for individual datasets and tasks, typically being trained from scratch. (ii) With large-scale pre-training, language models achieve state-of-the-art performance on many established language understanding and generation tasks, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), DeBERTa (He et al., 2021) and GPT-2 (Radford et al., 2019)). These pre-trained models serve the basis for downstream task adaptation. (iii) Exemplified by GPT-3 (Brown et al., 2020), large language models (LLMs) unify various language understanding and generation tasks into one model. With web-scale training and unification, some emerging ca- pabilities appear, such as in-context-learning and chain-of-thoughts. (iv) With recent advances in human-AI alignment, LLMs start to play the role of general-purpose assistants to follow human intents to complete a wide range of language tasks in the wild, such as ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023a). These assistants exhibit interesting capabilities, such as interaction and tool use, and lay a foundation for developing general-purpose AI agents. It is important to note that the latest iterations of foundation models build upon the noteworthy features of their earlier counterparts while also providing additional capabilities.
在過去的十年里,AI領(lǐng)域在模型的發(fā)展方面經(jīng)歷了一個(gè)富有成效的軌跡。我們將它們分為四個(gè)類別,如圖1.1所示。這種分類可以在AI的不同領(lǐng)域共享,包括語(yǔ)言、視覺和多模態(tài)。我們首先使用自然語(yǔ)言處理(NLP)中的語(yǔ)言模型來說明演化過程。
(i) 在早期,為個(gè)別數(shù)據(jù)集和任務(wù)開發(fā)了特定任務(wù)的模型,通常是從頭開始訓(xùn)練的。
(ii) 借助大規(guī)模預(yù)訓(xùn)練,語(yǔ)言模型在許多已建立的語(yǔ)言理解和生成任務(wù)上取得了最先進(jìn)的性能,如BERT(Devlin等人,2019)、RoBERTa(Liu等人,2019)、T5(Raffel等人,2020)、DeBERTa(He等人,2021)和GPT-2(Radford等人,2019)。這些預(yù)訓(xùn)練模型為下游任務(wù)的適應(yīng)提供了基礎(chǔ)。
(iii) 以GPT-3(Brown等人,2020)為代表,大型語(yǔ)言模型(LLMs)將各種語(yǔ)言理解和生成任務(wù)統(tǒng)一到一個(gè)模型中。通過規(guī)模龐大的訓(xùn)練和統(tǒng)一,出現(xiàn)了一些新興的能力,如上下文學(xué)習(xí)ICL和思維鏈CoT。
(iv) 隨著人工智能與人類意圖的對(duì)齊的最新進(jìn)展,LLMs開始扮演通用助手的角色,遵循人類的意圖,在域外完成各種語(yǔ)言任務(wù),如ChatGPT(OpenAI,2022)和GPT-4(OpenAI,2023a)。這些助手展示出了有趣的能力,如互動(dòng)和工具使用,并為開發(fā)通用AI代理奠定了基礎(chǔ)。重要的是要注意,最新的基礎(chǔ)模型版本在保留其早期版本的顯著特征的基礎(chǔ)上,還提供了額外的功能。
Inspired by the great successes of LLMs in NLP, it is natural for researchers in the computer vision and vision-language community to ask the question: what is the counterpart of ChatGPT/GPT-4 for vision, vision-language and multi-modal models? There is no doubt that vision pre-training and vision-language pre-training (VLP) have attracted a growing attention since the birth of BERT, and has become the mainstream learning paradigm for vision, with the promise to learn universal transferable visual and vision-language representations, or to generate highly plausible images. Ar- guably, they can be considered as the early generation of multimodal foundation models, just as BERT/GPT-2 to the language field. While the road-map to build general-purpose assistants for lan- guage such as ChatGPT is clear, it is becoming increasingly crucial for the research community to explore feasible solutions to building its counterpart for computer vision: the general-purpose visual assistants. Overall, building general-purpose agents has been a long-standing goal for AI. LLMs with emerging properties have significantly reduced the cost of building such agents for language tasks. Similarly, we foresee emerging capabilities from vision models, such as following the instruc- tions composed by various visual prompts like user-uploaded images, human-drawn clicks, sketches and mask, in addition to text prompt. Such strong zero-shot visual task composition capabilities can significantly reduce the cost of building AI agents.
受到LLMs在NLP領(lǐng)域的巨大成功的啟發(fā),計(jì)算機(jī)視覺和視覺語(yǔ)言社區(qū)的研究人員很自然地會(huì)提出這樣的問題:對(duì)于視覺、視覺語(yǔ)言和多模態(tài)模型,ChatGPT/GPT-4的對(duì)應(yīng)物是什么?
毫無疑問,自BERT誕生以來,視覺預(yù)訓(xùn)練和視覺語(yǔ)言預(yù)訓(xùn)練(VLP)已經(jīng)引起了越來越多的關(guān)注,并已成為視覺領(lǐng)域的主流學(xué)習(xí)范式,有望學(xué)習(xí)普遍可轉(zhuǎn)移的視覺和視覺語(yǔ)言表征,或生成高度可信的圖像??梢哉f,它們可以被視為多模態(tài)基礎(chǔ)模型的早期代表,就像BERT/GPT-2對(duì)語(yǔ)言領(lǐng)域一樣。雖然建立像ChatGPT這樣的語(yǔ)言通用助手的路線圖已經(jīng)明確,但對(duì)于計(jì)算機(jī)視覺的對(duì)應(yīng)物——通用視覺助手,研究社區(qū)越來越需要探索可行的解決方案。
總的來說,構(gòu)建通用代理已經(jīng)是AI的長(zhǎng)期目標(biāo)。LLMs的新興特性極大地降低了為語(yǔ)言任務(wù)構(gòu)建這種代理的成本。同樣,我們預(yù)見到視覺模型的新興能力,例如除了文本提示外,還可以遵循由各種視覺提示組成的指令,如用戶上傳的圖像、人工繪制的點(diǎn)擊、草圖和掩碼,這種強(qiáng)大的零樣本視覺任務(wù)組成能力可以顯著降低構(gòu)建AI代理的成本。
圖1.1:語(yǔ)言和視覺/多模態(tài)基礎(chǔ)模型發(fā)展軌跡的示意圖
Figure 1.1: Illustration of foundation model development trajectory for language and vision/multi- modality. Among the four categories, the first category is the task-specific model, and the last three categories belong to foundation models, where these foundation models for language and vision are grouped in green and blue blocks, respectively. Some prominent properties of models in each category are highlighted. By comparing the models between language and vision, we are foreseeing that the transition of multimodal foundation models follows a similar trend: from the pre-trained?model for specific purpose, to unified models and general-purpose assistants. However, research 1 exploration is needed to figure out the best recipe, which is indicated as the question mark in the figure, as multimodal GPT-4 and Gemini stay private.
圖1.1:語(yǔ)言和視覺/多模態(tài)的基礎(chǔ)模型發(fā)展軌跡示意圖。在四個(gè)類別中,第一類別是特定于任務(wù)的模型,而最后三個(gè)類別屬于基礎(chǔ)模型,其中語(yǔ)言和視覺的基礎(chǔ)模型分別以綠色和藍(lán)色方塊分組。突出顯示了每個(gè)類別中模型的一些顯著特性。通過比較語(yǔ)言和視覺之間的模型,我們預(yù)見多模態(tài)基礎(chǔ)模型的轉(zhuǎn)變會(huì)遵循類似的趨勢(shì):從針對(duì)特定目的的預(yù)訓(xùn)練模型到統(tǒng)一模型和通用助手。然而,因?yàn)槎嗄B(tài)GPT-4和Gemini仍然保密,仍需要進(jìn)行研究探索以找出最佳的方法,這在圖中用問號(hào)表示。
In this paper, we limit the scope of multimodal foundation models to the vision and vision-language domains. Recent survey papers on related topics include (i) image understanding models such as self-supervised learning (Jaiswal et al., 2020; Jing and Tian, 2020; Ozbulak et al., 2023), segment anything (SAM) (Zhang et al., 2023a,c), (ii) image generation models (Zhang et al., 2023b; Zhou and Shimada, 2023), and (iii) vision-language pre-training (VLP). Existing VLP survey papers cover VLP methods for task-specific VL problems before the era of pre-training, image-text tasks, core vision tasks, and/or video-text tasks (Zhang et al., 2020; Du et al., 2022; Li et al., 2022c; Ruan and Jin, 2022; Chen et al., 2022a; Gan et al., 2022; Zhang et al., 2023g). Two recent survey papers cover the integration of vision models with LLM (Awais et al., 2023; Yin et al., 2022).
在本文中,我們將多模態(tài)基礎(chǔ)模型的范圍限定在視覺和視覺語(yǔ)言領(lǐng)域。最近關(guān)于相關(guān)主題的調(diào)查論文包括
(i)圖像理解模型,如自監(jiān)督學(xué)習(xí)(Jaiswal等,2020;Jing和Tian,2020;Ozbulak等,2023)、任何物體分割(SAM)(Zhang等,2023a,c)、
(ii)圖像生成模型(Zhang等,2023b;Zhou和Shimada,2023),以及
(iii)視覺語(yǔ)言預(yù)訓(xùn)練(VLP)?,F(xiàn)有的VLP調(diào)查論文涵蓋了VLP方法,用于在預(yù)訓(xùn)練時(shí)代之前的特定任務(wù)的VL問題,圖像-文本任務(wù),核心視覺任務(wù)和/或視頻文本任務(wù)(Zhang等,2020;Du等,2022;Li等,2022c;Ruan和Jin,2022;Chen等,2022a;Gan等,2022;Zhang等,2023g)。最近有兩篇調(diào)查論文涵蓋了將視覺模型與LLM集成在一起的問題(Awais等,2023;Yin等,2022)。
Among them, Gan et al. (2022) is a survey on VLP that covers the CVPR tutorial series on Recent Advances in Vision-and-Language Research in 2022 and before. This paper summarizes the CVPR tutorial on Recent Advances in Vision Foundation Models in 2023. Different from the aforemen- tioned survey papers that focus on literature review of a given research topic, this paper presents our perspectives on the role transition of multimodal foundation models from specialists to general- purpose visual assistants, in the era of large language models. The contributions of this survey paper are summarized as follows.
>> We provide a comprehensive and timely survey on modern multimodal foundation models, not only covering well-established models for visual representation learning and image generation, but also summarizing emerging topics for the past 6 months inspired by LLMs, including unified vision models, training and chaining with LLMs.
>> The paper is positioned to provide the audiences with the perspective to advocate a transition in developing multimodal foundation models. On top of great modeling successes for specific vi- sion problems, we are moving towards building general-purpose assistants that can follow human intents to complete a wide range of computer vision tasks in the wild. We provide in-depth discus- sions on these advanced topics, demonstrating the potential of developing general-purpose visual assistants.
其中,Gan等(2022)是一份關(guān)于VLP的調(diào)查,涵蓋了2022年及以前的CVPR視覺與語(yǔ)言研究的最新進(jìn)展教程系列。本文總結(jié)了2023年的CVPR視覺基礎(chǔ)模型最新進(jìn)展教程。與前述調(diào)查論文不同,它們側(cè)重于給定研究主題的文獻(xiàn)綜述,本文提出了我們對(duì)多模態(tài)基礎(chǔ)模型從專業(yè)人員向通用視覺助手的角色轉(zhuǎn)變的看法,在大型語(yǔ)言模型時(shí)代。本調(diào)查論文的貢獻(xiàn)總結(jié)如下。
>> 我們對(duì)現(xiàn)代多模態(tài)基礎(chǔ)模型進(jìn)行了全面和及時(shí)的調(diào)查,不僅涵蓋了視覺表示學(xué)習(xí)和圖像生成的成熟模型,還總結(jié)了受LLMs啟發(fā)的過去6個(gè)月出現(xiàn)的新興主題,包括統(tǒng)一視覺模型、與LLMs的訓(xùn)練和鏈接。
>>本文旨在為讀者提供一個(gè)視角,倡導(dǎo)發(fā)展多模態(tài)基礎(chǔ)模型的轉(zhuǎn)型。在對(duì)特定視覺問題建模取得巨大成功的基礎(chǔ)上,我們正朝著構(gòu)建通用助手的方向發(fā)展,它可以跟隨人類的意圖,在域外完成廣泛的計(jì)算機(jī)視覺任務(wù)。我們對(duì)這些高級(jí)主題進(jìn)行了深入的討論,展示了開發(fā)通用視覺助手的潛力。
1.1、What are Multimodal Foundation Models?什么是多模態(tài)基礎(chǔ)模型?
兩大技術(shù)背景=遷移學(xué)習(xí)(成為可能)+規(guī)模定律(變強(qiáng)大),2018年底的BERT標(biāo)志著基礎(chǔ)模型時(shí)代的開始
As elucidated in the Stanford foundation model paper (Bommasani et al., 2021), AI has been under- going a paradigm shift with the rise of models (e.g., BERT, GPT family, CLIP (Radford et al., 2021) and DALL-E (Ramesh et al., 2021a)) trained on broad data that can be adapted to a wide range of downstream tasks. They call these models foundation models to underscore their critically central?yet incomplete character: homogenization of the methodologies across research communities and emergence of new capabilities. From a technical perspective, it is transfer learning that makes foun- dation models possible, and it is scale that makes them powerful. The emergence of foundation models has been predominantly observed in the NLP domain, with examples ranging from BERT to ChatGPT. This trend has gained traction in recent years, extending to computer vision and other fields. In NLP, the introduction of BERT in late 2018 is considered as the inception of the foundation model era. The remarkable success of BERT rapidly stimulates interest in self-supervised learning in the computer vision community, giving rise to models such as SimCLR (Chen et al., 2020a), MoCo (He et al., 2020), BEiT (Bao et al., 2022), and MAE (He et al., 2022a). During the same time period, the success of pre-training also significantly promotes the vision-and-language multimodal field to an unprecedented level of attention.
正如斯坦?;A(chǔ)模型論文(Bommasani等,2021)所闡明的那樣,隨著模型的崛起(例如BERT、GPT家族、CLIP(Radford等,2021)和DALL-E(Ramesh等,2021a)),人工智能正在經(jīng)歷一場(chǎng)范式轉(zhuǎn)變,這些模型是在廣泛數(shù)據(jù)上訓(xùn)練的,可以適應(yīng)各種下游任務(wù)。他們將這些模型稱為基礎(chǔ)模型,以強(qiáng)調(diào)其關(guān)鍵的核心但不完整的特征:跨研究社區(qū)的方法同質(zhì)化和新能力的出現(xiàn)。
從技術(shù)角度來看,正是遷移學(xué)習(xí)使得基礎(chǔ)模型成為可能,規(guī)模(定律)使得它們變得強(qiáng)大?;A(chǔ)模型的出現(xiàn)主要在NLP領(lǐng)域觀察到,從BERT到ChatGPT都有例證。這一趨勢(shì)近年來逐漸受到重視,擴(kuò)展到計(jì)算機(jī)視覺和其他領(lǐng)域。在NLP領(lǐng)域,2018年底引入BERT被視為基礎(chǔ)模型時(shí)代的開始。BERT的顯著成功迅速激發(fā)了計(jì)算機(jī)視覺社區(qū)對(duì)自監(jiān)督學(xué)習(xí)的興趣,催生了模型如SimCLR(Chen等,2020a)、MoCo(He等,2020)、BEiT(Bao等,2022)和MAE(He等,2022a)的出現(xiàn)。在同一時(shí)期,預(yù)訓(xùn)練的成功也顯著地將視覺和語(yǔ)言多模態(tài)領(lǐng)域提升到前所未有的關(guān)注水平。
圖1.2:本文旨在解決的三個(gè)多模態(tài)基礎(chǔ)模型代表性問題的示意圖
Figure 1.2: Illustration of three representative problems that multimodal foundation models aim to solve in this paper: visual understanding tasks , visual generation tasks , and general-purpose interface with language understanding and generation.
圖1.2:本文旨在解決的三個(gè)多模態(tài)基礎(chǔ)模型代表性問題的示意圖:視覺理解任務(wù)、視覺生成任務(wù)以及與語(yǔ)言理解和生成的通用接口。
In this paper, we focus on multimodal foundation models, which inherit all properties of foundation models discussed in the Stanford paper (Bommasani et al., 2021), but with an emphasis on models with the capability to deal with vision and vision-language modalities. Among the ever-growing literature, we categorize multimodal foundation models in Figure 1.2, based on their functional- ity and generality. For each category, we present exemplary models that demonstrate the primary capabilities inherent to these multimodal foundation models.
在本文中,我們專注于多模態(tài)基礎(chǔ)模型,它繼承了斯坦福論文中討論的基礎(chǔ)模型的所有屬性(Bommasani et al., 2021),但重點(diǎn)關(guān)注具有處理視覺和視覺語(yǔ)言模態(tài)能力的模型。在不斷增長(zhǎng)的文獻(xiàn)中,我們根據(jù)其功能和通用性對(duì)圖1.2中的多模態(tài)基礎(chǔ)模型進(jìn)行了分類。對(duì)于每個(gè)類別,我們提供了示例模型,展示了這些多模態(tài)基礎(chǔ)模型固有的主要能力。
(1)、Visual Understanding Models. 視覺理解模型:三級(jí)范圍(圖像級(jí)→區(qū)域級(jí)→像素級(jí)),三類方法(基于監(jiān)督信號(hào)不同,標(biāo)簽監(jiān)督【如ImageNet】/語(yǔ)言監(jiān)督【如CLIP和ALIGN】/僅圖像自監(jiān)督【如對(duì)比學(xué)習(xí)等】/多模態(tài)融合【如CoCa/Flamingo/GLIP/SAM】)
(Highlighted with orange in Figure 1.2) Learning general visual representations is essential to build vision foundation models, as pre-training a strong vision back- bone is foundamental to all types of computer vision downstream tasks, ranging from image-level (e.g., image classification, retrieval, and captioning), region-level (e.g., detection and grounding) to pixel-level tasks (e.g., segmentation). We group the methods into three categories, depending on the types of supervision signals used to train the models.
>>Label supervision. Datasets like ImageNet (Krizhevsky et al., 2012) and ImageNet21K (Rid- nik et al., 2021) have been popular for supervised learning, and larger-scale proprietary datasets are also used in industrial labs (Sun et al., 2017; Singh et al., 2022b; Zhai et al., 2022a).
>>Language supervision. Language is a richer form of supervision. Models like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) are pre-trained using a contrastive loss over millions or even billions of noisy image-text pairs mined from the Web. These models enable zero- shot image classification, and make traditional computer vision (CV) models to perform open-vocabulary CV tasks. We advocate the concept of computer vision in the wild,1 and encourage the development and evaluation of future foundation models for this.
>>Image-only self-supervision. This line of work aims to learn image representations from su- pervision signals mined from the images themselves, ranging from contrastive learning (Chen et al., 2020a; He et al., 2020), non-contrastive learning (Grill et al., 2020; Chen and He, 2021; Caron et al., 2021), to masked image modeling (Bao et al., 2022; He et al., 2022a).
>>Multimodal fusion, region-level and pixel-level pre-training. Besides the methods of pre- training image backbones, we will also discuss pre-training methods that allow multimodal fusion (e.g., CoCa (Yu et al., 2022a), Flamingo (Alayrac et al., 2022)), region-level and pixel- level image understanding, such as open-set object detection (e.g., GLIP (Li et al., 2022e)) and promptable semgentation (e.g., SAM (Kirillov et al., 2023)). These methods typically rely on a pre-trained image encoder or a pre-trained image-text encoder pair.
(在圖1.2中以橙色突出顯示)學(xué)習(xí)一般性的視覺表示對(duì)于構(gòu)建視覺基礎(chǔ)模型至關(guān)重要,因?yàn)轭A(yù)訓(xùn)練一個(gè)強(qiáng)大的視覺骨干是所有類型的計(jì)算機(jī)視覺下游任務(wù)的基礎(chǔ),范圍從圖像級(jí)(例如,圖像分類、檢索和字幕)、區(qū)域級(jí)(例如,檢測(cè)和接地)到像素級(jí)任務(wù)(例如,分割)。根據(jù)用于訓(xùn)練模型的監(jiān)督信號(hào)的類型,我們將這些方法分為三類。
>>標(biāo)簽監(jiān)督。像ImageNet(Krizhevsky等,2012)和ImageNet21K(Ridnik等,2021)這樣的數(shù)據(jù)集一直以來都在監(jiān)督學(xué)習(xí)中很受歡迎,工業(yè)實(shí)驗(yàn)室也使用了規(guī)模更大的專有數(shù)據(jù)集(Sun等,2017;Singh等,2022b;Zhai等,2022a)。
>>語(yǔ)言監(jiān)督。語(yǔ)言是一種更豐富的監(jiān)督形式。像CLIP(Radford等,2021)和ALIGN(Jia等,2021)這樣的模型是使用從網(wǎng)絡(luò)上挖掘的數(shù)百萬甚至數(shù)十億的噪聲圖像-文本對(duì)的對(duì)比損失進(jìn)行預(yù)訓(xùn)練的。這些模型使得零采樣圖像分類成為可能,并使傳統(tǒng)的計(jì)算機(jī)視覺(CV)模型能夠執(zhí)行開放詞匯的CV任務(wù)。我們倡導(dǎo)域外計(jì)算機(jī)視覺的概念,并鼓勵(lì)為此開發(fā)和評(píng)估未來的基礎(chǔ)模型。
>>僅圖像自監(jiān)督。這一領(lǐng)域的工作旨在從圖像本身挖掘監(jiān)督信號(hào),學(xué)習(xí)圖像表示,包括對(duì)比學(xué)習(xí)(Chen等,2020a;He等,2020)、非對(duì)比學(xué)習(xí)(Grill等,2020;Chen和He,2021;Caron等,2021)以及圖像掩碼建模(Bao等,2022;He等,2022a)等。
>>多模態(tài)融合、區(qū)域級(jí)別和像素級(jí)預(yù)訓(xùn)練。除了預(yù)訓(xùn)練圖像主干的方法外,我們還將討論允許多模態(tài)融合(例如CoCa(Yu等,2022a)、Flamingo(Alayrac等,2022))以及區(qū)域級(jí)別和像素級(jí)別圖像理解的預(yù)訓(xùn)練方法,例如開放式目標(biāo)檢測(cè)(例如GLIP(Li等,2022e))和可提示分割(例如SAM(Kirillov等,2023))。這些方法通常依賴于預(yù)訓(xùn)練的圖像編碼器或預(yù)訓(xùn)練的圖像-文本編碼器對(duì)。
(2)、Visual Generation Models. 視覺生成模型:三大技術(shù)(量量化VAE方法/基于擴(kuò)散的模型/回歸模型),兩類(文本條件的視覺生成【文本到圖像生成模型{DALL-E/Stable Diffusion/Imagen}+文本到視頻生成模型{Imagen Video/Make-A-Video}】、人類對(duì)齊的視覺生成器)
(Highlighted with green in Figure 1.2) Recently, foundation image generation models have been built, due to the emergence of large-scale image-text data. The techniques that make it possible include the vector-quantized VAE methods (Razavi et al., 2019), diffusion-based models (Dhariwal and Nichol, 2021) and auto-regressive models.
>>Text-conditioned visual generation. This research area focuses on generating faithful vi- sual content, including images, videos, and more, conditioned on open-ended text descrip- tions/prompts. Text-to-image generation develops generative models that synthesize images of high fidelity to follow the text prompt. Prominent examples include DALL-E (Ramesh et al., 2021a), DALL-E 2 (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2021; sta, 2022), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022b). Building on the success of text- to-image generation models, text-to-video generation models generate videos based on text prompts, such as Imagen Video (Ho et al., 2022) and Make-A-Video (Singer et al., 2022).
>>Human-aligned visual generator. This research area focuses on improving the pre-trained visual generator to better follow human intentions. Efforts have been made to address vari- ous challenges inherent to base visual generators. These include improving spatial control- lability (Zhang and Agrawala, 2023; Yang et al., 2023b), ensuring better adherence to text prompts (Black et al., 2023), supporting flexible text-based editing (Brooks et al., 2023), and facilitating visual concept customization (Ruiz et al., 2023).
(在圖1.2中以綠色突出顯示)近年來,由于大規(guī)模圖像-文本數(shù)據(jù)的出現(xiàn),建立了基礎(chǔ)圖像生成模型。使之成為可能的技術(shù)包括向量量化VAE方法(Razavi等,2019)、基于擴(kuò)散的模型(Dhariwal和Nichol,2021)和自回歸模型。
>>文本條件的視覺生成。這一研究領(lǐng)域?qū)W⒂谏芍覍?shí)于開放性文本描述/提示的視覺內(nèi)容,包括圖像、視頻等。文本到圖像生成開發(fā)了生成模型,能夠根據(jù)文本提示合成高保真度的圖像。著名的示例包括DALL-E(Ramesh等,2021a)、DALL-E 2(Ramesh等,2022)、Stable Diffusion(Rombach等,2021;sta,2022)、Imagen(Saharia等,2022)和Parti(Yu等,2022b)。在文本到圖像生成模型取得成功的基礎(chǔ)上,文本到視頻生成模型基于文本提示生成視頻,如Imagen Video(Ho等,2022)和Make-A-Video(Singer等,2022)等。
>>人類對(duì)齊的視覺生成器。這一研究領(lǐng)域?qū)W⒂诟倪M(jìn)預(yù)訓(xùn)練的視覺生成器以更好地遵循人類意圖。努力解決基礎(chǔ)視覺生成器固有的各種挑戰(zhàn)。這些包括提高空間控制不穩(wěn)定性(Zhang和Agrawala,2023;Yang等,2023b)、確保更好地遵循文本提示(Black等,2023)、支持靈活的基于文本的編輯(Brooks等,2023)以及促進(jìn)視覺概念定制化(Ruiz等,2023)等。
(3)、General-purpose Interface通用接口(為特定目的而設(shè)計(jì)):三個(gè)研究主題
(3)、General-purpose Interface. (Highlighted with blue in Figure 1.2) The aforementioned multi- modal foundation models are designed for specific purposes – tackling a specific set of CV prob- lems/tasks. Recently, we see an emergence of general-purpose models that lay the basis of AI agents. Existing efforts focus on three research topics. The first topic aims to unify models for vi- sual understanding and generation. These models are inspired by the unification spirit of LLMs in NLP, but do not explicitly leverage pre-trained LLM in modeling. In contrast, the other two topics embrace and involve LLMs in modeling, including training and chaining with LLMs, respectively.
(在圖1.2中以藍(lán)色突出顯示)上述多模態(tài)基礎(chǔ)模型是為特定目的而設(shè)計(jì)的,旨在解決一組特定的CV問題/任務(wù)。最近,我們看到了通用模型的出現(xiàn),為AI代理打下了基礎(chǔ)?,F(xiàn)有的努力集中在三個(gè)研究主題上。第一個(gè)主題旨在統(tǒng)一用于視覺理解和生成的模型。這些模型受到NLP中LLMs統(tǒng)一精神的啟發(fā),但沒有明確地在建模中利用預(yù)訓(xùn)練的LLM。相反,其他兩個(gè)主題在建模中涵蓋和涉及LLMs,包括與LLMs的訓(xùn)練和鏈接。
用于理解和生成的統(tǒng)一視覺模型:采用統(tǒng)一的模型體系結(jié)構(gòu)(如CLIP/GLIP/OpenSeg)、統(tǒng)一不同粒度級(jí)別的不同VL理解任務(wù)(如IO統(tǒng)一方法/Unified-IO/GLIP-v2/X-Decoder)、具互動(dòng)性和提示性(如SAM/SEEM )
>>Unified vision models for understanding and generation. In computer vision, several at- tempts have been made to build a general-purpose foundation model by combining the func- tionalities of specific-purpose multimodal models. To this end, a unified model architecture is adopted for various downstream computer vision and vision-language (VL) tasks. There are different levels of unification. First, a prevalent effort is to bridge vision and language by converting all closed-set vision tasks to open-set ones, such as CLIP (Radford et al., 2021), GLIP (Li et al., 2022f), OpenSeg (Ghiasi et al., 2022a), etc. Second, the unification of differ- ent VL understanding tasks across different granularity levels is also actively explored, such as I/O unification methods like UniTAB (Yang et al., 2021), Unified-IO (Lu et al., 2022a)), Pix2Seq-v2 (Chen et al., 2022d) and functional unification methods like GPV (Gupta et al., 2022a), GLIP-v2 (Zhang et al., 2022b)) and X-Decoder (Zou et al., 2023a). In the end, it is also necessitated to make the models more interactive and promptable like ChatGPT, and this has been recently studied in SAM (Kirillov et al., 2023) and SEEM (Zou et al., 2023b).
>>用于理解和生成的統(tǒng)一視覺模型。在計(jì)算機(jī)視覺領(lǐng)域,已經(jīng)有幾個(gè)嘗試通過結(jié)合特定用途的多模態(tài)模型的功能來構(gòu)建通用基礎(chǔ)模型。為此,對(duì)各種下游計(jì)算機(jī)視覺和視覺語(yǔ)言(VL)任務(wù)采用統(tǒng)一的模型體系結(jié)構(gòu)。有不同層次的統(tǒng)一。首先,一個(gè)普遍的努力是通過將所有封閉集視覺任務(wù)轉(zhuǎn)化為開放集視覺任務(wù)來架起視覺和語(yǔ)言的橋梁,例如CLIP(Radford等,2021)、GLIP(Li等,2022f)、OpenSeg(Ghiasi等,2022a)等。其次,還積極探索了不同粒度級(jí)別的不同VL理解任務(wù)的統(tǒng)一,如I/O統(tǒng)一方法(如UniTAB(Yang等,2021)、Unified-IO(Lu等,2022a))和功能統(tǒng)一方法(如GPV(Gupta等,2022a)、GLIP-v2(Zhang等,2022b)和X-Decoder(Zou等,2023a)等。最后,還需要使模型像ChatGPT一樣更具互動(dòng)性和提示性,最近在SAM (Kirillov et al., 2023)和SEEM (Zou et al., 2023b)中對(duì)此進(jìn)行了研究。
與LLMs的訓(xùn)練(遵循指令+能力擴(kuò)展到多模態(tài)+端到端的訓(xùn)練):如Flamingo和Multimodal GPT-4
>>Training with LLMs. Similar to the behavior of LLMs, which can address a language task by following the instruction and processing examples of the task in their text prompt, it is desirable to develop a visual and text interface to steer the model towards solving a multimodal task. By extending the capability of LLMs to multimodal settings and training the model end-to-end, multimodal LLMs or large multimodal models are developed, including Flamingo (Alayrac et al., 2022) and Multimodal GPT-4 (OpenAI, 2023a).
>>與LLMs的訓(xùn)練。與LLMs的行為類似,LLMs可以通過遵循指令和處理任務(wù)的文本提示中的示例來解決語(yǔ)言任務(wù),因此有必要開發(fā)一個(gè)視覺和文本接口,以引導(dǎo)模型解決多模態(tài)任務(wù)。通過將LLMs的能力擴(kuò)展到多模態(tài)設(shè)置并進(jìn)行端到端的訓(xùn)練,開發(fā)了多模態(tài)LLMs或大型多模態(tài)模型,包括Flamingo(Alayrac等,2022)和Multimodal GPT-4(OpenAI,2023a)等。
與LLMs鏈接的工具(集成LLMs和多模態(tài)的基礎(chǔ)模型):如Visual ChatGPT/MM-REACT
>>Chaining tools with LLM. Exploiting the tool use capabilities of LLMs, an increasing num- ber of studies integrate LLMs such as ChatGPT with various multimodal foundation models to facilitate image understanding and generation through a conversation interface. This interdis- ciplinary approach combines the strengths of NLP and computer vision, enabling researchers to develop more robust and versatile AI systems that are capable of processing visual informa- tion and generating human-like responses via human-computer conversations. Representative works include Visual ChatGPT (Wu et al., 2023a) and MM-REACT (Yang* et al., 2023).
>>與LLMs鏈接的工具。越來越多的研究將LLMs(如ChatGPT)與各種多模態(tài)基礎(chǔ)模型集成起來,通過對(duì)話接口促進(jìn)圖像理解和生成。這種跨學(xué)科方法結(jié)合了NLP和計(jì)算機(jī)視覺的優(yōu)勢(shì),使研究人員能夠開發(fā)出更強(qiáng)大、更通用的人工智能系統(tǒng),能夠處理視覺信息,并通過人機(jī)對(duì)話產(chǎn)生類似人類的反應(yīng)。代表性作品包括Visual ChatGPT(Wu等,2023a)和MM-REACT(Yang*等,2023)。
1.2、Definition and Transition from Specialists to General-Purpose Assistants定義和從專家到通用助手的過渡
兩類多模型基礎(chǔ)模型:特定目的的預(yù)訓(xùn)練視覺模型=視覺理解模型(CLIP/SimCLR/BEiT/SAM)+視覺生成模型(Stable Diffusion)、通用助手=統(tǒng)一架構(gòu)的通才+遵循人類指令
?Based on the model development history and taxonomy in NLP, we group multimodal foundation models in Figure 1.2 into two categories.
>> Specific-Purpose Pre-trained Vision Models cover most existing multimodal foundation mod- els, including visual understanding models (e.g., CLIP (Radford et al., 2021), SimCLR (Chen et al., 2020a), BEiT (Bao et al., 2022), SAM (Kirillov et al., 2023)) and visual generation models (e.g., Stable Diffusion (Rombach et al., 2021; sta, 2022)), as they present powerful transferable ability for specific vision problems.
>> General-Purpose Assistants refer to AI agents that can follow human intents to complete various computer vision tasks in the wild. The meanings of general-purpose assistants are two-fold: (i) generalists with unified architectures that could complete tasks across different problem types, and?(ii)easy to follow human instruction, rather than replacing humans. To this end, several research topics have been actively explored, including unified vision modeling (Lu et al., 2022a; Zhang et al., 2022b; Zou et al., 2023a), training and chaining with LLMs (Liu et al., 2023c; Zhu et al., 2023a; Wu et al., 2023a; Yang* et al., 2023).
基于NLP中的模型發(fā)展歷史和分類,我們將多模態(tài)基礎(chǔ)模型分為兩類,如圖1.2所示。
>>特定目的的預(yù)訓(xùn)練視覺模型包括大多數(shù)現(xiàn)有的多模態(tài)基礎(chǔ)模型,包括視覺理解模型(例如CLIP(Radford等,2021)、SimCLR(Chen等,2020a)、BEiT(Bao等,2022)、SAM(Kirillov等,2023))和視覺生成模型(例如Stable Diffusion(Rombach等,2021;sta,2022)),因?yàn)樗鼈兙哂嗅槍?duì)特定視覺問題的強(qiáng)大的可轉(zhuǎn)移能力。
>>通用助手是指能夠跟隨人類意圖在域外完成各種計(jì)算機(jī)視覺任務(wù)的AI代理。通用助手的含義有兩層:
(i)具具有統(tǒng)一架構(gòu)的通才,可以完成不同問題類型的任務(wù),
(ii)易于遵循人類指令,而不是取代人類。為此,已經(jīng)積極探討了幾個(gè)研究主題,包括統(tǒng)一的視覺建模(Lu等,2022a;Zhang等,2022b;Zou等,2023a)、與LLMs的培訓(xùn)和鏈接(Liu等,2023c;Zhu等,2023a;Wu等,2023a;Yang*等,2023)。
1.3、Who Should Read this Paper?誰應(yīng)該閱讀本文?
This paper is based on our CVPR 2023 tutorial,2 with researchers in the computer vision and vision- language multimodal communities as our primary target audience. It reviews the literature and explains topics to those who seek to learn the basics and recent advances in multimodal foundation models. The target audiences are graduate students, researchers and professionals who are not ex- perts of multimodal foundation models but are eager to develop perspectives and learn the trends in the field. The structure of this paper is illustrated in Figure 1.3. It consists of 7 chapters.
本文基于我們的CVPR 2023教程,以計(jì)算機(jī)視覺和視覺語(yǔ)言多模態(tài)社區(qū)的研究人員為主要目標(biāo)受眾。它回顧了文獻(xiàn)并向那些尋求學(xué)習(xí)多模態(tài)基礎(chǔ)模型的基礎(chǔ)知識(shí)和最新進(jìn)展的人解釋了主題。目標(biāo)受眾是研究生,研究人員和專業(yè)人士,他們不是多模態(tài)基礎(chǔ)模型的專家,但渴望發(fā)展觀點(diǎn)和了解該領(lǐng)域的趨勢(shì)。本文的結(jié)構(gòu)如圖1.3所示。全文共分七章。
?>> Chapter 1 introduces the landscape of multimodal foundation model research, and presents a his- torical view on the transition of research from specialists to general-purpose assistants.
>> Chapter 2 introduces different ways to consume visual data, with a focus on how to learn a strong image backbone.
>> Chapter 3 describes how to produce visual data that aligns with human intents.
>> Chapter 4 describes how to design unified vision models, with an interface that is interactive and promptable, especially when LLMs are not employed.
>> Chapter 5 describes how to train an LLM in an end-to-end manner to consume visual input for understanding and reasoning.
>> Chapter 6 describes how to chain multimodal tools with an LLM to enable new capabilities.
>> Chapter 7 concludes the paper and discusses research trends.
>>第1章介紹了多模態(tài)基礎(chǔ)模型研究的背景,并提供了從專家到通用助手的研究過渡的歷史視圖。
>>第2章介紹了使用視覺數(shù)據(jù)的不同方法,重點(diǎn)介紹了如何學(xué)習(xí)強(qiáng)大的圖像主干。
>>第3章描述了如何生成與人類意圖一致的視覺數(shù)據(jù)。
>>第4章描述了如何設(shè)計(jì)統(tǒng)一的視覺模型,具有交互和可提示的接口,特別是在不使用LLMs的情況下。
>>第5章描述了如何以端到端的方式訓(xùn)練LLMs以使用視覺輸入進(jìn)行理解和推理。
>>第6章描述了如何將多模態(tài)工具與LLMs鏈接以實(shí)現(xiàn)新的能力。
>>第7章全文進(jìn)行了總結(jié),并對(duì)研究趨勢(shì)進(jìn)行了展望。
Figure 1.3: An overview of the paper’s structure, detailing Chapters 2-6.
Relations among Chapters 2-6. 第2-6章之間的關(guān)系。
Chapter 2-6 are the core chapters of this survey paper. An overview of the structure for these chapters are provided in Figure 1.2. We start with a discus- sion of two typical multimodal foundation models for specific tasks, including visual understanding in Chapter 2 and visual generation in Chapter 3. As the notion of multimodal foundation models are originally based on visual backbone/representation learning for understanding tasks, we first present a comprehensive review to the transition of image backbone learning methods, evolving from early supervised methods to the recent language-image contrastive methods, and extend the discussion on image representations from image-level to region-level and pixel-level (Chapter 2). Recently, generative AI is becoming increasingly popular, where vision generative foundation models have been developed. In Chapter 3, we discuss large pre-trained text-to-image models, and various ways that the community leverage the generative foundation models to develop new techniques to make them better aligned with human intents. Inspired by the recent advances in NLP that LLMs serve as general-purpose assistants for a wide range of language tasks in daily life, the computer vision com- munity has been anticipating and attempting to build general-purpose visual assistants. We discuss three different ways to build general-purpose assistants. Inspired by the spirit of LLMs, Chapter 4 focuses on unifying different vision models of understanding and generation without explicitly in- corporating LLMs in modeling. In contrast, Chapter 5 and Chapter 6 focus on embracing LLMs to build general-purpose visual assistants, by explicitly augmenting LLMs in modeling. Specifically, Chapter 5 describes end-to-end training methods, and Chapter 6 focuses on training-free approaches that chain various vision models to LLMs.
第2-6章是本調(diào)查論文的核心章節(jié)。這些章節(jié)的結(jié)構(gòu)概述如圖1.2所示。我們首先討論針對(duì)特定任務(wù)的兩個(gè)典型的多模態(tài)基礎(chǔ)模型,包括第2章的視覺理解和第3章的視覺生成。由于多模態(tài)基礎(chǔ)模型的概念最初是基于用于理解任務(wù)的視覺骨干/表示學(xué)習(xí),因此我們首先對(duì)圖像骨干學(xué)習(xí)方法的轉(zhuǎn)變進(jìn)行了全面回顧,從早期的監(jiān)督方法到最近的語(yǔ)言-圖像對(duì)比方法,并將對(duì)圖像表示的討論從圖像級(jí)擴(kuò)展到區(qū)域級(jí)和像素級(jí)(第2章)。最近,生成式人工智能變得越來越流行,其中開發(fā)了視覺生成基礎(chǔ)模型。在第3章中,我們討論了大型預(yù)訓(xùn)練的文本到圖像模型,以及社區(qū)利用生成基礎(chǔ)模型開發(fā)新技術(shù)以使其更好地與人類意圖保持一致的各種方法。受到NLP領(lǐng)域最新進(jìn)展的啟發(fā),LLMs在日常生活中可以作為各種語(yǔ)言任務(wù)的通用助手,計(jì)算機(jī)視覺社區(qū)一直在期望并嘗試構(gòu)建通用視覺助手。我們討論了三種不同的構(gòu)建通用助手的方式。受LLMs的精神啟發(fā),第4章側(cè)重于統(tǒng)一不同的理解和生成視覺模型,而不在建模中明確地包含LLMs。相反,第5章和第6章側(cè)重于通過顯式地在建模中增加LLMs來構(gòu)建通用的視覺助手。具體來說,第5章描述了端到端訓(xùn)練方法,第6章側(cè)重于無需訓(xùn)練即可將各種視覺模型鏈接到LLMs的方法。
How to read the paper. 如何閱讀本文
Different readers have different backgrounds, and may have different purposes of reading this paper. Here, we provide a few guidance.
>> Each chapter is mostly self-contained. If you have a clear goal and a clear research direction that you want to focus on, then just jump to the corresponding chapter. For example, if you are interested in building a mini prototype using OpenAI’s multimodal GPT-4, then you can directly jump to Chapter 5.
>> If you are a beginner of multimodal foundation models, and are interested in getting a glimpse of the cutting-edge research, we highly recommend that you read the whole paper chapter by chapter in order, as the early chapters serve as the building blocks of later chapters, and each chapter provides the description of the key concepts to help you understand the basic ideas, and a comprehensive literature review that to help you grasp the landscape and state of the art.
>> If you already have rich experience in multimodal foundation models and are familiar with the literature, feel free to jump to specific chapters you want to read. In particular, we include in most chapters a section to discuss advanced topics and sometimes provide our own perspectives, based on the up-to-date literature. For example, in Chapter 6, we discuss several important aspects of multimodal agents in tool use, including tool creation and its connection to retrieval-augmented methods.
不同的讀者有不同的背景,可能閱讀本文的目的也不同。在這里,我們提供一些建議。
>> 每一章基本上都是獨(dú)立的。如果你有一個(gè)明確的目標(biāo)和一個(gè)你想要關(guān)注的明確的研究方向,那么就直接跳到相應(yīng)的章節(jié)。例如,如果您對(duì)使用OpenAI的多模式GPT-4構(gòu)建迷你原型感興趣,那么您可以直接跳到第5章。
>> 如果您是多模態(tài)基礎(chǔ)模型的初學(xué)者,并且有興趣了解前沿研究的概貌,我們強(qiáng)烈建議您按順序逐章閱讀整篇文章,因?yàn)樵缙谡鹿?jié)是后續(xù)章節(jié)的基石,每章提供了關(guān)鍵概念的描述,幫助您理解基本思想,并提供了全面的文獻(xiàn)回顧,幫助您掌握領(lǐng)域的概貌和最新技術(shù)。
>> 如果您已經(jīng)在多模態(tài)基礎(chǔ)模型領(lǐng)域擁有豐富的經(jīng)驗(yàn),并且熟悉相關(guān)文獻(xiàn),可以隨意跳轉(zhuǎn)到您想要閱讀的特定章節(jié)。特別是在大多數(shù)章節(jié)中,我們都包含了討論高級(jí)主題的部分,有時(shí)還根據(jù)最新文獻(xiàn)提供了我們自己的觀點(diǎn)。例如,在第6章中,我們討論了多模態(tài)工具在工具使用中的幾個(gè)重要方面,包括工具的創(chuàng)建以及與檢索增強(qiáng)方法的關(guān)聯(lián)。
1.4、Related Materials: Slide Decks and Pre-recorded Talks相關(guān)資料:幻燈片和預(yù)錄演講
This survey paper extends what we present in the CVPR 2023 tutorial by covering the most recent advances in the field. Below, we provide a list of slide decks and pre-recorded talks, which are related to the topics in each chapter, for references.
Chapter 2: Visual and Vision-Language Pre-training (Youtube, Bilibili)
Chapter 3: Alignments in Text-to-Image Generation (Youtube, Bilibili)
Chapter 4: From Representation to Interface: The Evolution of Foundation for Vision Under- standing (Youtube, Bilibili)
Chapter 5: Large Multimodal Models (Youtube, Bilibili)
Chapter 6: Multimodal Agents: Chaining Multimodal Experts with LLMs (Youtube, Bilibili)
本調(diào)查論文擴(kuò)展了我們?cè)贑VPR 2023教程中提出的內(nèi)容,涵蓋了該領(lǐng)域最新的進(jìn)展。以下是與每章主題相關(guān)的幻燈片和預(yù)錄演講的列表,供參考。
第2章:視覺和視覺語(yǔ)言預(yù)訓(xùn)練(Youtube,Bilibili)
第3章:文本到圖像生成中的對(duì)齊(Youtube,Bilibili)
第4章:從表示到接口:視覺理解基礎(chǔ)的演變(Youtube,Bilibili)
第5章:大型多模態(tài)模型(Youtube,Bilibili)
第6章:多模態(tài)智能體:將多模態(tài)專家與LLMs鏈接(Youtube,Bilibili)