# Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

> Cobra is introduced, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length.

## Metadata
- Authors: Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang
- Journal: ArXiv
- Published: 2024
- DOI: https://doi.org/10.48550/arXiv.2403.14520
- Citations: 126
- Source: Semantic Scholar

## Technology Hub
- Hub: Large Language Models
- Discipline: Computer Science / AI
- Hub URL: https://science-database.com/technology/large-language-models
- Hub llms.txt: https://science-database.com/technology/large-language-models/llms.txt

## Abstract
In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.

## Links
- DOI: https://doi.org/10.48550/arXiv.2403.14520
- Semantic Scholar: https://www.semanticscholar.org/paper/40e996a7c3e914a67c708704fa9b4c54ea70f36e
- JSON API: https://science-database.com/api/v1/technology/large-language-models

---
Generated by science-database.com — The Knowledge Interface
Paper ID: s2-40e996a7c3e914a67c708704fa9b4c54ea70f36e | Hub: large-language-models