iWISDM: Assessing instruction following in multimodal models at scale (2024)

Xiaoxuan Lei
Department of PhysiologyMcGill UniversityMila, University of MontrealCanadaxiaoxuan.lei@mail.mcgill.caLucas Gomez¹¹footnotemark: 1
Integrated Program in Neuroscience (IPN)McGill UniversityMila, University of MontrealCanadalucas.gomez@mail.mcgill.ca\ANDHao Yuan Bai¹¹footnotemark: 1
School of Computer ScienceMcGill UniversityMila, University of MontrealCanadahao.bai@mail.mcgill.caPouya Bashivan
Department of PhysiologyMcGill UniversityMila, University of MontrealCanadapouya.bashivan@mcgill.caThese authors contributed equally to this work.

Abstract

The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species.As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models’ ability to precisely follow instructions with that of humans. The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.

1 Introduction

A typical day in most people’s lives involves hundreds or thousands of tasks. Most of which are performed without explicit attention. Just in between getting up and getting to work, one may have already performed 5-15 tasks (taking a shower, shaving, making coffee, getting dressed, etc.). Teaching artificial agents to perform similar seemingly mundane tasks has proven to be an extremely difficult computational problem (Konar, 2018). The challenge becomes more apparent when one realizes that each of these seemingly mundane tasks such as making coffee involves tens of steps/actions (Figure 1).The challenge becomes increasingly more significant once we consider more complex tasks such as operating a device or assembling a piece of furniture from its instruction manual. And yet, these tasks are performed proficiently by most individuals in most situations.

Large Language Models (LLMs) have become increasingly capable of comprehending natural language across wide topics and contexts, allowing them to hold meaningful conversations, give expert advice, and analyze data among other features (Brown etal., 2020; Ouyang etal., 2022; Radford etal., 2019). In the meantime, their multimodal counterparts are starting to emerge, signalling broader application of such models across industries. Large Multimodal Models (LMMs) are generally capable of receiving and responding in a range of possible modalities including visual, text, and audio (Alayrac etal., 2022; Liu etal., 2023b; Achiam etal., 2023). For example, the Gemini-Ultra model, which accepts text, image, audio, and video inputs, and responds with a combination of text and image outputs, recently achieved state-of-the-art on a range of single and multimodal benchmarks (Team etal., 2023).

iWISDM: Assessing instruction following in multimodal models at scale (1)

However, existing benchmarks for assessing such models face several shortcomings:(1) Most multimodal benchmarks like FLEURS (audio-based, (Conneau etal., 2023)) and VATEX (video-based, (Wang etal., 2019)) are still unimodal in their inputs, and do not permit detailed assessment of models’ capacity to integrate information across modalities towards task goals. (2) Visual Question-Answering (VQA) datasets like VQAv2 (Goyal etal., 2017) and CLEVR (Johnson etal., 2017) assess reasoning with visual information in static images without addressing temporal information integration and sequential decision-making. (3) Open-ended learning environments such as XLand (Team etal., 2021), Crafter (Hafner, 2021), and Minecraft (Guss etal., 2019) have been utilized for training reinforcement learning agents. It remains unclear whether and how they can be adapted to benchmark LMMs. (4) To the best of our knowledge, none of the existing benchmarks specifically assess models’ ability to precisely follow instructions in the context of decision making tasks, an important measure of reliability and trustworthiness. Despite its importance, conducting such assessments has been particularly challenging because of the difficulty of collecting samples of multi-step tasks with ground truth information. (5) More recent benchmarks such as MME (Fu etal., 2023), Mmbench (Liu etal., 2023c) and MMvet (Yu etal., 2023) cover a wide range of cognitive tasks that start to adopt manually-generated or GPT-powered responses. However, those benchmarks are difficult to scale, which makes them inconvenient when investigating the scaling properties of LMMs. In addition, benchmarks such as OwlEval (Ye etal., 2023) and LVLMeHub (Xu etal., 2023) rely on subjective human responses that often show high variability across individuals.

To close this gap, we designed instructed-Virtual Visual Decision Making (iWISDM), a virtual environment that enables procedural generation of complex, multi-step decision making tasks that test an agent’s capacity to process visual information guided by natural language instructions. iWISDM builds on the compositional nature of natural behaviour (Barker, 1963) and the fact that complex tasks are often compositionally constructed by combining smaller task units in time. We thus developed a framework which allows instantiating visual decision-making tasks as computational graphs that could be combined logically and temporally to construct a virtually infinite number of tasks with varying complexity. The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.

Our main contributions are:

1.
We introduce iWISDM, a virtual environment for the procedural generation of limitless visual decision-making tasks accompanied by natural language instructions.
2.
We use iWISDM to construct three vision-language multimodal benchmarks with varying complexity levels to probe LMMs’ ability to follow natural language instructions.
3.
We test several recently developed LMMs as well as human subjects on these benchmarks and identify a notable shared weakness across existing LMMs compared to humans in their ability to precisely follow user instructions in the context of visual decision-making tasks.

2 Related Works

2.1 Large Multimodal Models

Continual advancements in pretrained large-scale multimodal models are driving progress in a wide array of downstream tasks. Given the significant computational expense associated with end-to-end pretraining, it is common to utilize readily available pretrained vision models alongside Large Language Models (LLMs) such as (OPT (Zhang etal., 2022), FlanT5 (Chung etal., 2022), Vicuna (Chiang etal., 2023), LLaMA (Touvron etal., 2023)).Pioneering LMMs such as VisualGPT (Chen etal., 2022) and Frozen (Tsimpoukelli etal., 2021) have highlighted the advantages of leveraging pre-trained multimodal models. The primary challenge lies in achieving cross-modal alignment, given that LLMs typically lack exposure to images during their unimodal pretraining phase. LMM research is coalescing around a key strategy known as the “visual instruction tuning”, which involves a two-phase training process: firstly, a vision-language alignment pretraining stage, and secondly, a visual instruction tuning stage.

A range of methods and models have been developed to enhance the capabilities of LMMs. Early approaches use a frozen object detector for visual feature extraction (Chen etal., 2020; Li etal., 2020; Zhang etal., 2021) while LiT (Zhai etal., 2022) borrowed a frozen pretrained image encoder from CLIP (Radford etal., 2021). More recently, Frozen (Tsimpoukelli etal., 2021) and Flamingo (Alayrac etal., 2022) have adopted an image-to-text generation approach, prompting the language model to generate text based on an input image. However, in BLIP-2 (Li etal., 2023b), it was shown that this approach is not adequate for overcoming the modality gap and instead proposed a Querying Transformer (Q-Former), acting as a visual resampler, and a two-stage bootstrapping pretraining method, which led to models that outperformed Flamingo80B (Alayrac etal., 2022) in zero-shot VQAv2 tasks with fewer trainable parameters. The Q-former architecture was also adopted by later works such as InstructBLIP (Dai etal., 2023) and Qwen-VL (Bai etal., 2023).

Moreover, GPT-4 (Achiam etal., 2023) has demonstrated remarkable proficiency in multi-modal dialogues with humans. Models like LLaVA (Liu etal., 2023b) and MiniGPT-4 (Zhu etal., 2023) have sought to emulate its performance by integrating a fully connected vision-language cross-modal connector, which significantly reduces the need for paired image-text data during pretraining. Both have shown notable proficiency in following natural instructions and visual reasoning.Beyond image-based LMMs, there have been developments in models that specialize in video information processing. PaLM-E (Driess etal., 2023) incorporates continuous real-world sensor data into LMMs, facilitating the unification of real-world perceptions with human language. Video ChatCaptioner (Chen etal., 2023a) leverages ChatGPT’s conversational interface to enhance its understanding of video spatiotemporal contexts.

2.2 Multimodal benchmarks

The rapid progression of LMMs has driven the need for comprehensive benchmarks to assess their multifaceted capabilities. Traditionally,datasets primarily focus on computer vision tasks – classification, detection, segmentation, captioning, visual generation, and editing (Computer-Vision-in-the-Wild, 2024) – where instructions are implicitly integrated. However, these primarily unimodal datasets do not adequately test the models’ proficiency in multimodal information alignment, revealing a limitation in our ability to fully evaluate the LMMs.

To address this, the community has turned to human-annotated datasets like MS-COCO (Lin etal., 2014) and web-scraped collections such as YFCC-100M (Goyal etal., 2019) to forge better-aligned multimodal datasets. Enhanced by additional data cleansing methods and supported by CLIP-like models, datasets like Conceptual 3M/12M (Lai etal., 2023) and LAION-5B (Schuhmann etal., 2022) have not only improved in descriptive quality and text-image alignment but also in scale.Despite these advancements, a crucial capacity of LMMs which is to effectively follow multimodal vision-language instructions, remains inadequately assessed by current benchmarks. To address this gap, several benchmarks have pivoted toward evaluating cognitive skills and systematic, quantitative assessments. Visual Question Answering (VQA) datasets like ScienceQA (Saikh etal., 2022) play a crucial role in examining LMMs’ multimodal reasoning capabilities. More recent benchmarks such as MME (Fu etal., 2023) and LLaVA-Bench (Liu etal., 2023a) utilize manually crafted questions and answers for precise evaluations, while platforms like MMBench (Liu etal., 2023c) adopt ChatGPT-driven techniques for data creation and response generation. Notably, ShareGPT4V (Chen etal., 2023b) introduces a dataset with high-quality captions, initially derived from GPT4-Vision and subsequently expanded, reflecting a trend towards more sophisticated and scalable evaluation frameworks.

The exploration into video-based datasets expands the assessment scope to include temporal integration, covering areas like video captioning, event segmentation, and action prediction. Datasets from video game environments (e.g., Minecraft (Guss etal., 2019), XLand (Team etal., 2021), and Crafter (Hafner, 2021)) and Video Question Answering tasks (e.g., YouCook2 (Zhou etal., 2018)) are pivotal in evaluating models’ strategic understanding and instructional adherence. Additionally, benchmarks like Seed-Bench-2 (Li etal., 2023a) and various perception tests pose further challenges for LMMs by testing their efficacy in navigating and interpreting complex, multimodal data streams. Yet, a gap persists in evaluating models’ precision in following instructions across sequential images, a gap that the iWISDM environment aims to bridge.

3 Methodology

We developed iWISDM to facilitate the generation of a diverse range of sequential visual reasoning tasks, which vary in complexity and require minimal user intervention. iWISDM encompasses a broad spectrum of tasks that engage executive functions such as inhibition of action, working memory, attentional set, task switching, and schema generalization. These functions are traditionally associated with the prefrontal cortex, a critical area for advanced cognitive processes in the brain (Fuster, 2015; Sun etal., 2023). Notably, the iWISDM task space is designed to accommodate classic working memory and decision-making tasks commonly employed in neuroscience and cognitive science research (Rigotti etal., 2013; Goldman-Rakic, 1992; Fuster, 2009). In the following sections, we begin by detailing the key functionalities of the iWISDM environment, followed by an in-depth description of its design and the implementation of its constituent components.

3.1 Design

Inspired by prior work (Yang etal., 2018), iWISDM generates tasks through a 3-phase procedure: (1) Task graph construction; (2) Node initialization (3) Trial instantiation. Dividing the generation procedure into distinct phases eliminates the necessity of constructing the task graph anew for each task trial. Once all properties associated with the nodes in a given task graph have been specified (phase 2), the user can generate any number of trials of that particular task, each with potentially different stimuli, ground-truth actions, as well as any other inherent stochastic values such as the number of delay frames. In general, iWISDM creates tasks following:

\textbf{f},i,\textbf{r}=\texttt{iWISDM}(G)

where, G denotes the task graph, f the sequence of visual frames, i the corresponding language instructions, and r the sequence of ground-truth actions for each visual frame within f according to the instruction i.

The task graph G can either be specified by the user or generated automatically via AutoTask. The major distinction between the two resides in the initial task graph construction phase (phase 1). In contrast to the user-specified mode, where the user needs to manually define the task graph, in AutoTask mode, the user needs to only specify the parameters listed in the following section.

3.1.1 Task graph construction

In iWISDM, nodes and edges create directed, acyclic, connected task graphs. Each node represents a predefined operator that contributes to defining the task (each node in an iWISDM graph represents a task operator, so we use the terms ’node’ and ’operator’ interchangeably). Task operators take downstream stimuli/actions as input and output stimuli/actions based on their definitions. While some operators must have parent/upstream operators, root operators define the actions of a task. They form minimal sub-graphs that define sub-tasks (e.g. Get operators are root operators that define the subtask: What is the attribute of an object?). Under user-defined connectivity rules, sub-graphs can be combined to generate corresponding compositional tasks. In our environment, the depth of the graph is measured by the longest path from the root operator to any other operator within the graph.

Each operator has a customizable set of rules that constrain its connections. The specific operators and their permissible connections are described below: (see Figure 3 for visualization):

•
Functional Operators: Select: This operator defines stimuli based on three criteria: when, where, and what. “When” refers to the specific frame from which the stimuli originate. “Where” indicates the location of the stimuli within the frame. “What” depends on the particular dataset from which the stimuli are derived. For example, in our ShapeNet environment, the stimuli have three attributes: category (such as car versus plane), identity (which specific car), and view angle (the angle from which the stimulus is rendered). Select operators that have no downstream connections are the terminal nodes of the graph. Conversely, its potential downstream operators may include any Get* functional operator. Switch: Based on the output action of a boolean task, this operator connects the logic to one of two possible paths. Its compulsory downstream connection must be a boolean operator, while its typical upstream connections are subtasks graphs. Get*: This group of operators is responsible for fetching specific properties of stimuli, such as category, location, or identity, exemplified by operators like GetCategory, GetLocation, and GetIdentity. Its direct downstream connection is always a Select operator and its upstream connections can be any boolean operator. CONST: The simplest operator, CONST represents a fixed value. It is often used as a downstream connection for boolean operators that compare attributes.
•
Boolean Operators: Exist: Paired with a specific property value (e.g., ’Desk’), this operator poses the question of whether an object with the property (’Desk’) exists. It generates a boolean output and functions as an action generator or downstream connection for boolean operators. And, IsSame, NotSame, Or: These boolean operators process inputs from two boolean operators to produce a boolean outcome. They are critical in constructing logical conditions within the task graph.

With these operators, iWISDM provides two modes for constructing task graphs: user-specified and automatic.

•
User-specified task graph construction. In this mode, users have the freedom to fully specify the task graph by manually creating an instance of NetworkX directed graph. Doing so requires a manual definition of all nodes (operators) and edges (connection between operators) within a task graph, thereby establishing the desired task operation logic. Subsequently, the task graph can be fed to the generator function (write_trial_instance) to yield the corresponding task trials.One potential application of this mode is to replicate trials from a specific task such as the n-back or contextual decision-making tasks. In our core build of iWISDM, we have incorporated a collection of classic tasks from neuroscience and cognitive science literature that users can readily access.
•
AutoTask graph construction. In AutoTask mode, users can define a custom task space with a set of hyperparameters to procedurally generate tasks. A task space delineates the complexity and permissible operations pertinent to task construction. The available hyperparameters are: (1) number of compositions, i.e. maximum number of Switch operators to compose subtasks (2) each task graph’s maximum depth and maximum number of operators (3) set of operators to sample from.

In addition to the above hyperparameters, in AutoTask mode, the allowed task structure is further constrained by the permitted connectivity for various operators (e.g. And operators must be followed by other boolean operators such as IsSame, NotSame, And, Or). A default operator connectivity is defined for all existing operators in our core build, but new connectivity rules could easily be added for any new user-defined operators. This is done through a Python dictionary, which details the allowed input and output operators for additional operators. By specifying these hyperparameters, iWISDM autonomously generates random runnable task graphs derived from the predefined task space. Each resultant task graph can then be used to generate specific trials.

To ensure each task graph would comply with the connectivity rules between operators, we follow a backward initialization process during AutoTask (Figure A1). The task generation process starts from the root node and descends recursively. For each current node/operator $n$ in the graph, its downstream operators $C_{n}$ are randomly sampled based on the connectivity rules. As the graph depth approaches the specified maximum depth, the permissible operators with the shortest possible subtask depth are sampled into $C_{n}$ . For instance in our core build, if $n$ is the And operator, then only IsSame, NotSame are sampled since they have shorter subgraph depths than And, Or. Through this procedure, iWISDM AutoTask facilitates sampling from diverse task spaces with varying degrees of complexity specified by the user. The utilization of the connectivity rule dictionary and Switch operator guarantees the logical and feasible nature of the generated tasks, providing researchers with an extensive pool of tasks for investigation and exploration.

Together, iWISDM’s two operating modes provide users with the flexibility to use the environment to either train or evaluate models on specific tasks (e.g. classic tasks from literature), as well as a wide variety of tasks adhering to specific guidelines as stipulated by the specified hyperparameters.

3.1.2 Node Initialization

The second step assigns values to each node within the task graph, thereby conferring logically coherent tasks. There are two critical challenges: the initialization of an independent task graph and the integration of multiple graphs in time.

To instantiate a task graph (Figure A1a), a backward recursive approach is used (Yang etal., 2018), similar to that of AutoTask task graph generation. During trial instantiation, given the expected stimuli/action output, an operator propagates stimuli-related properties or actions to its children operators in reverse topological order. The process starts from the root operator of the graph that corresponds to the final task action. To guarantee a balanced action space, we uniformly sample from the pool of possible output (e.g. true/false, location values, and object categories) and assign it to each unassigned operator. We then iteratively go one layer down until reaching the Select leaf nodes, propagating the expected output to all nodes. At this stage, each task only has one output action.

The backward process is confined to the instantiation of individual task graphs and ensures logical consistency within each task. To ensure inter-graph logical consistency, we formulated a forward algorithm, generating complex tasks that require output actions at different frames, which we call temporal composition (see section 3.2. During this process, distinct Select operators might assign conflicting attributes to the same stimulus. The forward process (Figure A1b) therefore also serves to resolve disparities between Select operators in the same frame. It identifies the earliest merging frame, and resolves potential conflicts on a frame-by-frame basis, adjusting the properties of ensuing operators.

In summary, the node initialization process focuses on property assignment, involving the initialization of individual graphs followed by fusing them together. This process aims to yield logically coherent task instances, both within subtasks and in temporally merged tasks.

iWISDM: Assessing instruction following in multimodal models at scale (2)

3.1.3 Task trial instantiation

Regardless of the selected operation mode, for each task trial, iWISDM yields a frame sequence, an accompanying natural language instruction, and an action sequence. As delineated above, tasks are initially formulated with graphs that define the task logic. The task graph then serves as the basis for instantiating task trials. Distractors and fixation cues are also added during this step. Each task trial comprises of the following distinct components:

•
Frame sequence. The frame sequence consists of an array of images. They display the visual information at each time step of the trial. Each frame is accompanied by a dictionary that contains the object properties within that frame. Images are stored as PNG files.
•
Natural language instructions. Each trial is accompanied by a natural language instruction that explains the task steps and decision criteria. Natural language instructions are generated concurrently during task instantiation. A partial string is assigned to each operator in the task graph that depends on its definitions and initialization (see Figure A1 for an example). This approach allows iWISDM to automatically produce contextually relevant natural language instructions that describe the task in a human-readable format for each task.
•
Action sequences. Each trial also consists of an array of ground-truth actions at each frame of the trial. The action sequences could be used for supervised training and validation of the agents on the generated trials.
•
Distractors (optional) For users interested in making trials more attention-demanding, we offer a pre-implemented solution that adds distractors post-hoc. By specifying parameters during trial generation, distractors can be added without causing conflicts with existing task rules. The major challenge lies in distinguishing between stimuli and distractors in task instructions. This is achieved by specifying the attributes of the stimulus that are not used during task execution. Additionally, it is important to avoid using distractors that share the same non-used attribute values. Based on the instructions, an ideal agent should be able to identify the stimulus of interest accurately. An example can be seen in Appendix Figure A5.

iWISDM: Assessing instruction following in multimodal models at scale (3)

3.2 Main Features

At its core, iWISDM is designed to be scalable and extensible. To do so, we adopted a modular framework in which task rules are constructed compositionally by combining functional and boolean operators. The combination of these operators gives rise to distinct tasks. Likewise, task spaces (i.e. collections of instantiable task graphs) are spanned by specifying the set of allowable operators and operator-operator connection rules (see section 3.1 for detailed definitions). We detail iWISDM’s main features below.

Compositionality. Real-world tasks are fundamentally compositional, as most tasks can be readily decomposed into sets of simpler subtasks that involve fewer sensory observations and cognitive operations. Two crucial facets of compositionality need to be considered: logical and temporal, both of which are common in daily human behaviour and allow individuals to efficiently handle complex tasks and adapt to dynamic environments. Cognitive processes involving task decomposition (i.e. breaking tasks into subtasks) and temporal combination of decision rules (i.e. combining the outcomes of subtasks) are fundamental to our ability to navigate the world around us.

•
Logical: An agent’s action can be viewed as a function of sensory observations, internalized world knowledge, prior actions, and its objectives (Ha & Schmidhuber, 2018). Yet, as outlined in the introduction, the decision-making process frequently decomposes into sub-decisions and information-processing steps that are temporally constrained and require fewer observations.We define logical compositionality as how decisions can be combined hierarchically through boolean operators (i.e. And, Or, etc) and functional operators (i.e. Switch operator that asks for if…then…). As an example, consider the contextual decision-making task (ctxDM, see Figure 2b or Figure 3d). In the task in Figure 3d, the subject needs to first compare the category of the objects in the first and third frames. If their categories are the same, then compare the category of the objects from the second and third frames. Otherwise, compare the category of the objects from the second and fourth frames. For this task, the task rule of the second task is conditioned upon the result of the first task, which makes a logical composition. As the depth and the total number of operators involved in the task grows, we can compose more complex logical structures.
•
Temporal: Temporal compositionality is concerned with how different decision rules should be combined together to construct a complex task that extends in time. In real-world scenarios, individuals often face tasks that require multiple decisions to be made in sequence or in parallel. For instance, making coffee involves following decision rules to accomplish a sequence of tasks such as grinding the coffee beans, brewing, and pouring (Figure 2). While rule compositionality has been explored in several prior works (Lake & Baroni, 2023; Liška etal., 2018; Loula etal., 2018), the topic of temporal compositionality has attracted less attention in the field. This is potentially due to a lack of proper datasets or virtual environments that could enable such investigations. iWISDM is precisely engineered to fill this gap by enabling the generation of temporally compositional tasks made from combining simpler tasks in time (Figure 2).

Vast Task Space. Another important feature of iWISDM is the vastness of its task space. This feature naturally extends from the compositionality that is inherent to iWISDM’s task generation procedure. The ability to produce a large number of distinct tasks is critically important for the robust evaluation of large multimodal models that are trained on increasing volumes of information from the web.Moreover, the vast number of instantiable tasks in iWISDM provides an opportunity for training or fine-tuning large multimodal models to improve their ability to follow instructions in a vision-language context.

Natural Language Task Instruction. Natural language provides a rich and convenient way of communicating complex information to biological or artificial agents. It has been shown that improvements on language understanding in LLMs directly enhance their generality (performing many tasks) and adaptation (0-shot generalization)Brown etal. (2020); Radford etal. (2019). Due to their capacity to compress enormous knowledge bases, these models have been useful in various applications where traditionally human supervision had been necessary Shah etal. (2023); Dasgupta etal. (2023). Perhaps for similar reasons, natural language input constitutes the core of most existing multimodal models and they are heavily trained on large text corpora among others.For this reason, in iWISDM, each task is accompanied by a simplified natural language instruction (see examples A1). When completing complex tasks, the instruction communicates first the task structure in terms of upcoming observations, then the task rules that determine the relationship between observations and actions.

Automatic Task Generation. In contrast to prior virtual environments tailored for cognitive or neuroscience investigations (Molano-Mazon etal., 2022), iWISDM can not only generate hand-crafted cognitive tasks such as classical decision-making tasks (e.g. contextual decision-making and n-back, 3), but also allow procedural task generation from a pre-specified task space defined by a small set of hyperparameters (See Section 3.2 for details).

Customizability and Extensibility. We envision the future of iWISDM as a framework that will be continuously developed and expanded by the larger community of machine learning scientists, cognitive scientists, and neuroscientists. For this reason, we have designed iWISDM to be highly customizable and extendable in task operators, visual inputs, and stimulus properties (for detailed discussions, see Appendix A.1.1).

4 Evaluation

4.1 Models & Humans

As a preliminary evaluation, we test the capabilities of GPT-4V, Gemini-Pro-1.0, Claude-3, InternLM-XComposer2, and MMICL in solving iWISDM tasks. We compare their performance to human baselines and find a notable gap in the multi-image instruction-following task capabilities of existing LMMs.

We also collected responses from 6 human subjects tasked to answer three sets of randomly selected trials (Figure A2,A3, A4), each set sampled from a different complexity level (total of 150 trials). The task trials were displayed in a way similar to that of the models, where images can be seen alongside the task’s text instruction following a general task description.

4.2 Complexity

Using the AutoTask framework in iWISDM, we created three benchmarks corresponding to Low, Medium, and High complexity tasks. Low complexity tasks were restricted to contain exactly one logical joining operator (”and/or”), exclude Switch operators, and require only boolean actions. Medium complexity tasks had the same logical joining operator restriction as low complexity and boolean action requirements, with the complexity increase coming from adding a Switch operator. Finally, high complexity tasks have between one and two logical joining operators, an included Switch operator, and require boolean and object property action responses (i.e. ’… category of object 1?’, Answer: ’planes’). For further details refer to Figure A1. Additionally, since the benchmarks are generated based on a synthetic stimuli dataset, while all LMMs are trained with naturalistic stimuli, we aim to confirm whether LMMs can recognize rendered ShapeNet objects. Therefore, we developed two sets of simplistic single-frame tasks: location-only and category-only existence tasks (e.g. Is object 1 in the bottom left?). Evaluation on these two single-frame task sets provides an upper bound of LMMs’ performance.

The set of tested models was limited due to the scarcity of applicable models. Many popular open-source LMMs, such as LLaVa-1.5, were unable to perform the task simply due to their limited image sequence lengths. A minimum image sequence length of ten is needed to complete all complexity levels. We were able to properly evaluate two open-source models, InternLM-XComposer2-7b and MMICL-Instructblip-T5-xl. For samples of prompts used to evaluate each model see Appendix A.3.

4.3 Results

Figure 4a shows the accuracy of actions taken by the models plotted against the complexity level for each type of prompt. GPT-4V generally achieves the best model scores, with the largest performance gap on the low and high complexity tasks. However, relative to human performance these gaps are marginal at best. Broadly, MMICL was the worst-performing model. The expected inverse correlation between complexity and action accuracy was only captured clearly by the GPT-4V and Gemini-Pro-1.0 results.

In contrast to model performance, human subjects scored much more accurately, with scores ranging from 0.78 to 0.98 across complexities. This model-human gap in performance indicates a significant shortcoming of LMMs on multi-image instruction following tasks. This shortcoming is unlikely due to insufficient feature understanding, as the high single-frame category task performance (Figure 4b) conflicts with the observed weak category task performances across complexities.

iWISDM: Assessing instruction following in multimodal models at scale (4)

We also analyzed how each model performed on subsets of tasks where single or multiple object properties were involved. Figure 4 b-e shows the average accuracy of all models conditioned on task properties and complexity level. We did not include the object-identity subgroup in the high complexity plot of Figure 4e as non-boolean response types for object identity tasks are not feasible. The performances on high complexity tasks show a clear ranking between the abilities of GPT-4V and Gemini-pro models on tasks with diverse response options. For almost all models and complexities, location-only tasks posed the most difficulties. This finding confirms previous analyses of GPT-4V, which found that it often struggles to correctly recognize an object’s position within an image¹¹1https://blog.roboflow.com/gpt-4v-object-detection (Majumdar etal., 2024).

To further investigate the response patterns of the LMMs we performed further analyses on a subset of the models seen in Appendix Figures A7, A8, A9, and A10. To determine the effects of delay frames on model performance a set of simple delayed-match-to-sample tasks were generated which only differed in difficulty by the number of delay frames. Appendix Figure A7 shows that for InternLM-XComposer2, MMICL, and GPT-4v, the addition of single or multiple delay frames in a task has little effect on task performance. Next, we looked into how the number of different boolean operators affected open-source model (InternLM-XComposer2 & MMICL) performance. The Appendix Figure A8 a-d results show that, generally, across all boolean operators an increase in their abundance within a task leads to worse model performance, which is expected. In Appendix Figure A9 we examined how the number of stimuli affected task performance across complexities. We expected to see that as the number of stimuli increases, the task performance would decrease. This was found to be the case for Low complexity tasks, which can be seen in Appendix Figure A9 a. However Medium and High complexity tasks displayed an inverse trend, as seen in Appendix Figure A10 b and c. Finally, we investigated the exact effect that different required response types had on accuracy for the High complexity benchmark. As Appendix Figure A10 shows, when tasks required a response that was a non-boolean type word, the models performed significantly worse.

5 Conclusion and future directions

We introduced iWISDM as a platform for validating multimodal models. As a benchmark, our primary focus is on assessing the ability of LMMs to follow instructions in visual-language decision-making tasks.We developed three benchmarks of incremental complexity and evaluated several LMMs and human performance. The gap between LMMs and humans indicates that LMMs still lack key abilities to solve instruction-following tasks. Through a detailed analysis of LMMs’ behaviour patterns, we identified diminished spatial recognition ability and decreased performance with increased task complexity.

We believe iWISDM would be an important benchmark that complements existing benchmarks which evaluate LMM capabilities in areas such as commonsense reasoning, numerical computation, or relational inferences (Fu etal., 2023). Although the current version of iWISDM lacks the capacity to probe these functions, it could potentially cover some of these additional capabilities by adding new operators like Count (to tally specific objects) or Relative (to identify properties like location in relation to other objects). We also believe iWISDM can serve as an important benchmark for continual learning algorithms evaluation. For detailed discussions, please refer to Appendix A.1.1.

Moreover, the currently used stimulus dataset is derived from publicly accessible sources (Chang etal., 2015), which may pose a risk of data leakage. To address this, we plan to introduce a variety of datasets that users can easily select from. Finally, the compositional nature of iWISDM tasks may also provide an avenue for exploring the failure modes of current LMMs by investigating their specific weaknesses and specifically targeting those during training. We are looking to develop detailed evaluation criteria tailored to iWISDM and to establish an evaluation platform, along with a continuously updated leaderboard, for the purpose of measuring and comparing the performances of models.

Acknowledgments

X.L. was supported by CAMBAM fellowship (2024) and Doctoral Excellence Scholarship, Union Neuroscience et Artificial Intelligence Quebec, UNIQUE (2021-2022). L.G. was supported by Masters Excellence Scholarship, Union Neuroscience et Artificial Intelligence Quebec, UNIQUE (2024). This research was supported by the Healthy-Brains-Healthy-Lives startup supplement grant and the NSERC Discovery grant RGPIN-2021-03035. P.B. was supported by FRQ-S Research Scholars Junior 1 grant 310924, and the William Dawson Scholar award. All analyses were executed using resources provided by the Digital Research Alliance of Canada (Compute Canada) and funding from Canada Foundation for Innovation project number 42730. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Aljundi etal. (2019)Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars.Task-free continual learning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11254–11263, 2019.
Bai etal. (2023)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023.
Barker (1963)RogerG Barker.The stream of behavior: Explorations of its structure & content.1963.
Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
Chang etal. (2015)AngelX Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, etal.Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015.
Chen etal. (2022)Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny.Visualgpt: Data-efficient adaptation of pretrained language models for image captioning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030–18040, 2022.
Chen etal. (2023a)Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny.Video chatcaptioner: Towards the enriched spatiotemporal descriptions.arXiv preprint arXiv:2304.04227, 2023a.
Chen etal. (2023b)Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin.Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023b.
Chen etal. (2020)Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, YuCheng, and Jingjing Liu.Uniter: Universal image-text representation learning.In European conference on computer vision, pp. 104–120. Springer, 2020.
Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE Gonzalez, etal.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
Chung etal. (2022)HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, etal.Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416, 2022.
Computer-Vision-in-the-Wild (2024)Computer-Vision-in-the-Wild.Cvinw readings: A collection of papers on the topic of “computer vision in the wild (cvinw)”.https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings, 2024.Accessed: 2024-02-16.
Conneau etal. (2023)Alexis Conneau, Min Ma, Simran Khanuja, YuZhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna.Fleurs: Few-shot learning evaluation of universal representations of speech.In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805. IEEE, 2023.
Dai etal. (2023)WDai, JLi, DLi, AMH Tiong, JZhao, WWang, BLi, PFung, and SHoi.Instructblip: towards general-purpose vision-language models with instruction tuning. arxiv.Preprint posted online on June, 15:2023, 2023.
Dasgupta etal. (2023)Ish*ta Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus.Collaborating with language models for embodied reasoning.arXiv preprint arXiv:2302.00763, 2023.
Deng (2012)LiDeng.The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012.
Driess etal. (2023)Danny Driess, Fei Xia, MehdiSM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, etal.Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023.
Fu etal. (2023)Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, XuLin, Jinrui Yang, Xiawu Zheng, KeLi, Xing Sun, etal.Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023.
Fuster (2015)Joaquin Fuster.The prefrontal cortex.Academic Press, 2015.
Fuster (2009)JoaquínM Fuster.Cortex and memory: emergence of a new paradigm.Journal of cognitive neuroscience, 21(11):2047–2072, 2009.
Goldman-Rakic (1992)PatriciaS Goldman-Rakic.Working memory and the mind.Scientific American, 267(3):110–117, 1992.
Goyal etal. (2019)Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra.Scaling and benchmarking self-supervised visual representation learning.In Proceedings of the ieee/cvf International Conference on computer vision, pp. 6391–6400, 2019.
Goyal etal. (2017)Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Making the v in vqa matter: Elevating the role of image understanding in visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
Guss etal. (2019)WilliamH Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov.Minerl: A large-scale dataset of minecraft demonstrations.arXiv preprint arXiv:1907.13440, 2019.
Ha & Schmidhuber (2018)David Ha and Jürgen Schmidhuber.World models.arXiv preprint arXiv:1803.10122, 2018.
Hafner (2021)Danijar Hafner.Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780, 2021.
Johnson etal. (2017)Justin Johnson, Bharath Hariharan, Laurens Van DerMaaten, LiFei-Fei, CLawrenceZitnick, and Ross Girshick.Clevr: A diagnostic dataset for compositional language and elementary visual reasoning.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2901–2910, 2017.
Konar (2018)Amit Konar.Artificial intelligence and soft computing: behavioral and cognitive modeling of the human brain.CRC press, 2018.
Lai etal. (2023)Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, etal.From scarcity to efficiency: Improving clip training via visual-enriched captions.arXiv preprint arXiv:2310.07699, 2023.
Lake & Baroni (2023)BrendenM Lake and Marco Baroni.Human-like systematic generalization through a meta-learning neural network.Nature, 623(7985):115–121, 2023.
Li etal. (2023a)Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan.Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023a.
Li etal. (2023b)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023b.
Li etal. (2020)Xiujun Li, XiYin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, LiDong, Furu Wei, etal.Oscar: Object-semantics aligned pre-training for vision-language tasks.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 121–137. Springer, 2020.
Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
Liška etal. (2018)Adam Liška, Germán Kruszewski, and Marco Baroni.Memorize or generalize? searching for a compositional rnn in a haystack.arXiv preprint arXiv:1802.06467, 2018.
Liu etal. (2023a)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a.
Liu etal. (2023b)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023b.
Liu etal. (2023c)Yuan Liu, Haodong Duan, Yuanhan Zhang, BoLi, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023c.
Lomonaco & Maltoni (2017)Vincenzo Lomonaco and Davide Maltoni.Core50: a new dataset and benchmark for continuous object recognition.In Conference on robot learning, pp. 17–26. PMLR, 2017.
Loula etal. (2018)Joao Loula, Marco Baroni, and BrendenM Lake.Rearranging the familiar: Testing compositional generalization in recurrent networks.arXiv preprint arXiv:1807.07545, 2018.
Majumdar etal. (2024)Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, etal.Openeqa: Embodied question answering in the era of foundation models.In 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024, 2024.
Molano-Mazon etal. (2022)Manuel Molano-Mazon, Joao Barbosa, Jordi Pastor-Ciurana, Marta Fradera, Ru-Yuan Zhang, Jeremy Forest, Jorge del PozoLerida, LiJi-An, ChristopherJ Cueva, Jaime dela Rocha, etal.Neurogym: An open resource for developing and sharing neuroscience tasks.2022.
Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
Rigotti etal. (2013)Mattia Rigotti, Omri Barak, MelissaR Warden, Xiao-Jing Wang, NathanielD Daw, EarlK Miller, and Stefano Fusi.The importance of mixed selectivity in complex cognitive tasks.Nature, 497(7451):585–590, 2013.
Saikh etal. (2022)Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya.Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022.
Schuhmann etal. (2022)Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Shah etal. (2023)Dhruv Shah, Błażej Osiński, Sergey Levine, etal.Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action.In Conference on robot learning, pp. 492–504. PMLR, 2023.
Sun etal. (2023)Weinan Sun, Madhu Advani, Nelson Spruston, Andrew Saxe, and JamesE Fitzgerald.Organizing memories for generalization in complementary learning systems.Nature neuroscience, 26(8):1438–1448, 2023.
Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, etal.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023.
Team etal. (2021)Open EndedLearning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, etal.Open-ended learning leads to generally capable agents.arXiv preprint arXiv:2107.12808, 2021.
Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
Tsimpoukelli etal. (2021)Maria Tsimpoukelli, JacobL Menick, Serkan Cabi, SMEslami, Oriol Vinyals, and Felix Hill.Multimodal few-shot learning with frozen language models.Advances in Neural Information Processing Systems, 34:200–212, 2021.
Wang etal. (2019)Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and WilliamYang Wang.Vatex: A large-scale, high-quality multilingual dataset for video-and-language research.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591, 2019.
Xu etal. (2023)Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, YuQiao, and Ping Luo.Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.arXiv preprint arXiv:2306.09265, 2023.
Yang etal. (2018)GuangyuRobert Yang, Igor Ganichev, Xiao-Jing Wang, Jonathon Shlens, and David Sussillo.A dataset and architecture for visual reasoning with a working memory.In Proceedings of the European Conference on Computer Vision (ECCV), pp. 714–731, 2018.
Ye etal. (2023)Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, etal.mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023.
Yu etal. (2023)Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023.
Zhai etal. (2022)Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.Lit: Zero-shot transfer with locked-image text tuning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
Zhang etal. (2021)Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.Vinvl: Revisiting visual representations in vision-language models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5579–5588, 2021.
Zhang etal. (2022)Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, etal.Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022.
Zhou etal. (2018)Luowei Zhou, Chenliang Xu, and Jason Corso.Towards automatic learning of procedures from web instructional videos.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32, 2018.
Zhu etal. (2023)Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.

Appendix A Appendix

A.1 Additional Discussion

A.1.1 Extensibility of iWISDM

•
Task operators. iWISDM task rules are constructed from the interconnection of various building blocks called task operators. Task operators themselves are highly customizable and extensible, allowing users to define new task logic by inheriting the Operator class and overriding the get_expected_input function. Our core task operator set is adopted from a prior work (Yang etal., 2018) and includes Get, IsSame, NotSame, And, Or.
•
Visual inputs. iWISDM also allows any natural image set to be used as the stimulus set. In our core build, we use 2D projection of 3D object models from the ShapeNet dataset (Chang etal., 2015), where each stimulus is defined by a parameter vector consisting of category, identity, pose angle, location. We provide a template that allows users to seamlessly import alternative stimulus datasets.
•
Stimulus properties. In addition to the visual stimuli themselves, users can define arbitrary object/stimuli properties to be used by the environment during task generation. In our core build of iWISDM, we use category, identity, pose angle, location as the default object properties used during task generation. These properties are attached to each individual object in our stimulus set via an accompanying JSON file. Users can add custom properties by editing the accompanying JSON files for their stimuli of choice.

A.1.2 iWISDM as a framework for Continual Learning

Additionally, iWISDM provides a framework for testing and comparing different approaches for continual learning (CL) and multi-task learning. iWISDM can be used to generate any number of tasks with quantifiable similarity which can in turn be used to test CL approaches on their capacity to sequentially learn tasks with different degrees of similarities without forgetting. Unlike most current multi-task learning models tested in fixed environments with relatively hom*ogeneous task structures, iWISDM introduces dynamic task variations. For instance, in computer vision, CL algorithms are typically tested on classifying an increasing number of image classes using datasets such as MNIST (Deng, 2012), CORe50 (Lomonaco & Maltoni, 2017), or similar datasets. Task-free continual learning approaches (Aljundi etal., 2019) exist but are still limited in generating a continual stream of tasks. In contrast, iWISDM allows users to generate a series of tasks with monotonically increasing or decreasing difficulty levels by adjusting complexity parameters in a fully controllable manner.

A robust model should adapt to changing tasks based on prior sequential experiences. Therefore, CL models should be evaluated on their learning speed for incoming tasks, considering quantifiable complexity parameters and confounding factors. Existing CL models often suffer from rapid performance degradation, known as catastrophic forgetting. iWISDM can be used to test the hypothesis that properly ordered learning sequences of incoming tasks can prevent catastrophic forgetting, a concept known as curriculum learning.

Achieving human-level intelligence requires models capable of out-of-distribution generalization. Compositional generalization is an important ability of intelligent agents andthus, it is important to evaluate models on their ability to generalize to compositionally generated scenarios. iWISDM’s graph structure allows for the manual construction of novel tasks based on existing task graphs, enabling systematic testing of both logical and temporal compositional generalization abilities. This makes iWISDM an ideal test bed for modularity-based CL algorithms, as it facilitates parallel connections between modules of models and tasks.

iWISDM is also suitable for evaluating meta-learning models in the CL domain. While natural language instructions are provided, other methods of constructing task-specific identifiers, such as one-hot compressed encoding of the task graph using simple graph theory methods, are available. These different methods for representing task-specific identifiers can serve as templates for testing against the learned representations from a meta-learner. Analyzing the similarity across these identified identifiers can further our understanding of the underlying task-solving mechanisms of neural network models.

A.2 Additional Figures

iWISDM: Assessing instruction following in multimodal models at scale (5)

iWISDM: Assessing instruction following in multimodal models at scale (6)

iWISDM: Assessing instruction following in multimodal models at scale (7)

iWISDM: Assessing instruction following in multimodal models at scale (8)

iWISDM: Assessing instruction following in multimodal models at scale (9)

iWISDM: Assessing instruction following in multimodal models at scale (10)

iWISDM: Assessing instruction following in multimodal models at scale (11)

iWISDM: Assessing instruction following in multimodal models at scale (12)

iWISDM: Assessing instruction following in multimodal models at scale (13)

iWISDM: Assessing instruction following in multimodal models at scale (14)

A.3 Model evaluation prompts

A.3.1 GPT-4V, Claude-3, & Gemini-pro Low/Medium Complexity (all properties + examples included)

In this task we will show you a series of frame images. Each frame will either be blank (delay frame) or contain a 3D object.The objects within the task will ALWAYS be from one of 8 categories: benches, boats, cars, chairs, couches, lighting, planes, and tables. For each of these 8 categories, there are 8 unique objects that could be used in the task. Any object which is sampled will be displayed as an image taken from a random viewing angle. The objects will be placed in one of four locations: top left, top right, bottom left, and bottom right.

A written instruction will be provided. Your goal is to follow the instructions and answer the question contained in the instruction. Answers will ALWAYS be one of the following: true, false .

Here is a simple example of the task ...

Task instruction: "observe object 1, observe object 2, location of object 1 not equal location: bottom left ?"

Here are the corresponding frames ...

Answer: false.This is because the location of object 1 IS in the bottom left location.

Here is a simple example of the task...

Task instruction: "observe object 1, delay, observe object 2, category of object 1 equals category of object 2 ?"

Here are the corresponding frames ...

Answer: true.This is because the category of object 1 (lighting) IS equal to the category of object 2 (lighting).

Here is a simple example of the task...

Task instruction: "observe object 1, observe object 2, identity of object 1 equals identity of object 2 ?"

Here are the corresponding frames ...

Answer: true.This is because object 1 (a white table) IS identical to object 2 (the same white table).

Now please solve the following new task...

Task instruction: "observe object 1, observe object 2, delay, observe object 3, observe object 4, category of object 2 not equal category of object 3 or category of object 1 equals category of object 4 ?"

Here are the corresponding frames ...

What is the correct answer to this task? (respond EXACTLY and ONLY with one of the following answers: true, false). Provide your answer here:

A.3.2 GPT-4V Single-Frame Complexity (location property + no examples included)

In this task we will show you an image. Each image will contain a 3D object.The objects within the task will ALWAYS be from one of 8 categories: benches, boats, cars, chairs, couches,lighting, planes, and tables. For each of these 8 categories there are 8 unique objects that could be used in the task.Any object which is sampled will be displayed as an image taken from a random viewing angle. The object will be placed in one of four locations: top left, top right, bottom left, and bottom right.

A written instruction will be provided. Your goal is to follow the instructions and answer the question contained in the instruction. Answers will ALWAYS be one of the following: true, false.

Now please solve the following new task...

Task instruction: "observe object 1, category of object 1 not equals planes?"

Here are the corresponding frames ...

What is the correct answer to this task? (respond EXACTLY and ONLY with one of the following answers: true, false). Provide your answer here:

A.3.3 GPT-4V, Claude-3, & Gemini-pro High Complexity (all properties + examples included)

In this task we will show you a series of frame images. Each frame will either be blank (delay frame) or contain a 3D object.The objects within the task will ALWAYS be from one of 8 categories: benches, boats, cars, chairs, couches,lighting, planes, and tables. For each of these 8 categories there are 8 unique objects that could be used in the task.Any object which is sampled will be displayed as an image taken from a random viewing angle. The objects will be placed in one of four locations: top left, top right, bottom left, and bottom right.

A written instruction will be provided. Your goal is to follow the instructions and answer the question contained in the instruction. Answers will ALWAYS be one of the following: true, false, bottom right, bottom left, top left, top right, benches, boats, cars, chairs, couches, lighting, planes, tables .

Here is an example of the task...

Task instruction: "observe object 1, observe object 2, location of object 1 not equal location: bottom left ?"

Here are the corresponding frames ...

The correct answer: bottom right.This is because object 2 is located in the bottom right.

Here is a simple example of the task...

Task instruction: "observe object 1, delay, observe object 2, category of object 1 equals category of object 2 ?"

Here are the corresponding frames ...

Answer: lighting.This is because object 1 (a lamp) belongs to the category of lighting.

Here is a simple example of the task...

Task instruction: "observe object 1, observe object 2, identity of object 1 equals identity of object 2?"

Here are the corresponding frames ...

Answer: true.This is because object 1 (a white table) IS identical to object 2 (the same white table).

Now please solve the following new task...

Task instruction: "observe object 1, observe object 2, delay, observe object 3, observe object 4, observe object 5, if location of object 5 not equal location of object 2 , then location of object 1? else category of object 4 not equal tables or category of object 3 not equal couches?"

Here are the corresponding frames ...

What is the correct answer to this task? (respond EXACTLY and ONLY with one of the following answers: true, false, bottom right, bottom left, top left, top right, benches, boats, cars, chairs, couches, lighting, planes, tables). Provide your answer here:

A.3.4 InternLM-XComposer2 & MMICL Low Complexity (all properties included + examples excluded)

In this task we will show you a series of frame images. Each frame will either be blank (delay frame) or contain a 3D object.The objects within the task will ALWAYS be from one of 8 categories: benches, boats, cars, chairs, couches,lighting, planes, and tables. For each of these 8 categories there are 8 unique objects that could be used in the task.Any object which is sampled will be displayed as an image taken from a random viewing angle. The objects will be placed in one of four locations: top left, top right, bottom left, and bottom right.

A written instruction will be provided. Your goal is to follow the instructions and answer the question contained in the instruction. Answers will ALWAYS be one of the following: true, false .

Please solve the following task...

Task instruction: "observe object 1, delay, observe object 2, observe object 3, observe object 4, location of object 1 equals location of object 2 and category of object 3 equals category of object 4?"

Here are the corresponding frames ...<ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere>What is the correct answer to this task? (respond EXACTLY and ONLY with one of the following answers: true, false). Provide your answer here:

A.3.5 InternLM-XComposer2 High Complexity (all properties included + examples excluded)

In this task we will show you a series of frame images. Each frame will either be blank (delay frame) or contain a 3D object.The objects within the task will ALWAYS be from one of 8 categories: benches, boats, cars, chairs, couches,lighting, planes, and tables. For each of these 8 categories there are 8 unique objects that could be used in the task.Any object which is sampled will be displayed as an image taken from a random viewing angle. The objects will be placed in one of four locations: top left, top right, bottom left, and bottom right.

Please solve the following task...

Task instruction: "observe object 1, observe object 2, observe object 3, observe object 4, delay, observe object 5, observe object 6, delay, observe object 7, if location of object 6 not equal location of object 2 or category of object 3 not equal category of object 4, then location of object 7 equals location of object 5? else category of object 1 ?"

Here are the corresponding frames ...<ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere> <ImageHere>What is the correct answer to this task? (respond EXACTLY and ONLY with one of the following answers: true, false, bottom right, bottom left, top left, top right, benches, boats, cars, chairs, couches, lighting, planes, tables). Provide your answer here:

Complexity

# of allowed

and/or

operators

in task

# of switch

operators

# of trial

frames

root

operators

boolean

operators

Low

IsSame, And,

Or, NotSame

IsSame, And,

Or, NotSame

Medium

IsSame, And,

Or, NotSame

IsSame, And,

Or, NotSame

High

1-2

IsSame, And,

Or, NotSame,

GetLoc,

GetCategory

IsSame, And,

Or, NotSame