CLIP, LLaVA, and the Brain

Deep Learning and the Brain

Insights into Multimodal Transformers from Neuroscience

8 min read

5 days ago

Image generated by the author using Dall-E 3.

How do recent multimodal transformer networks, like CLIP (Radford et al. 2021) and LLaVA (Liu et al. 2023), compare to the brain? Are there similarities between the attention in these networks and the brain? In this article, I look at these transformer architectures with an eye on the similarities and differences with the mammalian brain.

What stood out to me was that vision transformers, CLIP, and LLaVA perform a type of processing analogous to pre-attentive visual processing in the brain. This processing is done in the initial feedforward visual responses to a stimulus before recurrence. Although a lot can be accomplished in a feedforward way, studies have shown that feedforward pre-attentive processing in the brain does have difficulty with:

  1. Distinguishing the identity or characteristics of similar types of objects, especially when objects are close together or cluttered or the objects are unnatural or artificial (VanRullen 2007).
  2. More complex tasks such as counting or maze or curve tracing tasks.
  3. Perceiving objects that are more difficult to see, such as where it is difficult to perceive the boundaries of the objects.

In contrast to the feed-forward processing, one of the things that stands out with the brain is the richness in the interaction of areas, which I will discuss in more detail in the next section.

Bidirectional Activity in the Brain

In most current deep learning architectures, activity is propagated in a single direction, for example, an image might be given as input to a network and then propagated from layer to layer until you get to a classification as the output.

Figure 1: A simplified diagram showing some of the feed-forward and feedback connections in the Macaque brain. The earlier (or lower-level) areas are whiter, while the later (or higher-level) areas are bluer. Image by Author.

The brain is much more interesting than these feedforward models. In the visual system, a stimulus will initially propagate from lower- to higher-level visual areas in a feedforward fashion, then the higher-level areas will exert influence over the lower-level areas as depicted in Figure 1.

Some of this feedback is the conscious top-down attention that allows us to allocate more resources to objects and features of interest and disambiguate stimuli that are either complex or ambiguous. Another part of this feedback is automatic and allows higher-level areas to infuse the lower-level areas with information that would not be known in just the feedforward manner.

Conscious top-down attention is thought to support consciousness of visual stimuli. Without conscious access to lower-level areas that encode borders and edges, we wouldn’t have as spatially precise a perception of borders. Tasks like mentally tracing a curve or solving a maze would be impossible.

One example of automatic unconscious feedback is border-ownership coding which is seen in about half of the orientation-selective neurons in visual area V2 (Zhou et al. 2000, Williford and von der Heydt 2013). These neurons will encode local information in about 40 ms and, as early as 10 ms after this initial response, will incorporate global context to resolve occlusions — holding the information about which objects are creating borders by occluding their backgrounds.

Another example of this unconscious feedback was shown by Poort et al. (2012) using images like that in Figure 2. In the Macaque early visual cortex V1, neurons will tend to initially (within 50–75 ms of stimulus presentation) encode only the local features within their receptive fields (e.g., green square). However, after around 75 ms, they will receive feedback from the higher-level areas and tend to have a higher response when that texture belongs to a figure, such as this texture-defined figure above. This happens even when attention is drawn away from the figure, however, if the monkey is paying attention to the figure the neurons will on average respond even more.

Figure 2: Shapes defined only by texture, like the above, can be difficult to see in a pure “feed-forward” manner. The interaction between lower- and higher-level areas enables us to perceive such difficult shapes (Poort et 2012). Image by Author.

One way to look at this bidirectional interaction is that each neuron greedily uses all available predictive signals constantly. Even higher-level areas can be predictive, especially when visual borders do not correspond to significant first-order contrast edges.

Transformers

With all the talk about attention with the introduction of transformers (Vaswani et al. 2017) and with the ability to generate sentences one word at a time, you might be led to believe that transformers are recurrent. However, there are no internal states kept between the steps of the transformer, only the previous output is provided as input. So, the recurrence is limited and does not have the bidirectionality that is ubiquitous in the brain. Transformers do have multi-headed attention, which is like being able to attend to a fixed number of things simultaneously (8 in the original paper). Hence, image transformers can be seen as analogous to pre-attentive feedforward processing with some modifications.

CLIP

Figure 3: CLIP trains an image and text encoder using image caption pairs. I₁ and T₁ are the encodings of image 1 and the corresponding caption. A contrastive learning loss is used to make the Iᵢ and Tj more similar when i=j and more dissimilar when ij. Weights are trained from scratch. Figure reproduced with permission from Radford et al. (2021).

Radford and colleagues from OpenAI introduced CLIP in their 2021 paper “Learning Transferable Visual Models from Natural Language Supervision”. The idea behind CLIP is simple and is shown in Figure 3. It takes a bunch of image and caption pairs from the Internet and feeds the image to an image encoder and the text to a text encoder. It then uses a loss that brings the encoding of the image and the encoding of the text closer together when they are in the same pair, otherwise the loss increases the distance of the encodings. This is what CLIP gives you: the ability to compare the similarity between text and images. This does allow it to be used for zero-shot classification, as shown in Figure 4. CLIP does not, by itself, generate text descriptions from images.

The image encoder and text encoder are independent, meaning there is no way for task-driven modulation to influence the image encoding. This means that the image encoder must encode everything that could be potentially relevant to the task. Typically, the resolution of the input image is small, which helps prevent the computation and memory requirements from exploding.

Figure 4: CLIP can be used for zero-shot classification. Text is created for each of the N classes, which are then encoded into tokens T1…TN. The image is then encoded, and the similarity is measured with the generated text encodings. The most similar text encoding is the chosen class. Figure reproduced with permission from Radford et al. (2021).

LLaVA

Figure 5: LLaVA architecture. Xv: image, Xq: instruction/question, Hv: image tokens, Hq: instruction tokens, Xa: answer, generated one token at a time. Image by Author, based on Figure 1 from Liu et al. (2023).

Large Language and Vision Assistant (LLaVA) (Liu et al. 2023) is a large language and vision architecture that extends and builds onto CLIP to add the ability to describe and answer questions about images. This type of architecture interests me because it can attempt tasks like those used in Neuroscience and Psychology.

LLaVA takes the vision transformer model ViT-L/14 trained by CLIP for image encoding (Figure 5). The first paper uses a single linear projection matrix W to convert the encodings into tokens. The tokens calculated from the images Hᵥ and the text instructions Hq are provided as input. LLaVA can then generate the language response Xₐ one token at a time, appending the response so far as the input to the next iteration.

I won’t go into the details of how LLaVA is trained, but it is interesting how they use ChatGPT to expand the caption (Xc) in Figure 5 to form instructions (Hq) and responses (used to train Xₐ) about an image and the use of bounding box information.

In version 1.5 of LLaVA (Liu et al. 2024), some of the improvements they made include:

  • The linear projection matrix W is replaced with a multilayer perceptron
  • The image resolution is increased by using an image encoder that takes images of size 336×336 pixels and splits the images into grids that are encoded separately

Task-driven attention in the brain can dynamically allocate resources to the object, location, or features of interest, which allows the processing of information that would otherwise be overwhelmed by clutter or other objects. In LLaVA, the image encoder is independent of the text instructions, so to be successful it needs to make sure any potentially useful information is stored in the image tokens (Hᵥ).

Conclusion

LLaVA and CLIP lack bidirectional and recurrence with internal states, which constrains their processing. This is especially true for image processing since image processing is done independently of the text instructions. Most convolutional neural networks also share these limitations. This leads me to my conjecture:

Conjecture: Most convolutional, vision transformer, and multimodal transformer networks are restricted to processing that is analogous to pre-attentive feedforward visual processing in the brain.

This is not a criticism as much as an insight that can be informative. Feedforward processing can do a lot and is fast. However, it is not as dynamic as to what resources can be used to be used, which can lead to informational bottlenecks in cluttered scenes and is unable to encode enough information for complex tasks without an explosion of the size of the encodings. Creating models that work in a feedforward fashion is an important stepping stone because of the difficulty of adding recurrence and bidirectional processing.

Some networks are not limited to pre-attentive feedforward networks, but currently, most of the architectures lag behind those of transformers. These include long-short term memory models (LSTMs) and, more recently, the Mamba architecture, which has several benefits over transformers (Gu and Dao 2024). Extended LSTMs (Beck et al. 2024, Alkin et al. 2024) have recently been proposed, which help close the gap between transformers and LSTMs. Diffusion models also have a limited type of recurrence that uses the image as the state between iterations.

References

B. Alkin, M. Beck, K. Pöppel, S. Hochreiter, and J. Brandstetter, Vision-LSTM: xLSTM as Generic Vision Backbone (2024), http://arxiv.org/abs/2406.04303.

M. Beck, K. Pöppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, xLSTM: Extended Long Short-Term Memory (2024), http://arxiv.org/abs/2405.04517

A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2024) http://arxiv.org/abs/2312.00752

H. Liu, C. Li, Y. Li, and Y. J. Lee “Improved Baselines with Visual Instruction Tuning (2024) Proc. of IEEE/CVF CVPR.

H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual Instruction Tuning (2023), https://doi.org/10.48550/arXiv.2304.08485

J. Poort, F. Raudies, A. Wannig, V. A. F. Lamme, H. Neumann, and P. R. Roelfsema. The Role of Attention in Figure-Ground Segregation in Areas V1 and V4 of the Visual Cortex (2012) Neuron

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. Learning Transferable Visual Models from Natural Language Supervision (2021) ICML

R. VanRullen, The Power of the Feed-Forward Sweep (2007) Advances in Cognitive Psychology

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention Is All You Need (2017) NeurIPs

J. R. Williford and R. von der Heydt, Border-Ownership Coding (2013) Scholarpedia

H. Zhou, H. S. Friedman, and R. von der Heydt. “Coding of Border Ownership in Monkey Visual Cortex (2000) The Journal of Neuroscience

Originally published at http://neural.vision on June 19, 2024.