# Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference > Cobra is introduced, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. ## Metadata - Authors: Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang - Journal: ArXiv - Published: 2024 - DOI: https://doi.org/10.48550/arXiv.2403.14520 - Citations: 126 - Source: Semantic Scholar ## Technology Hub - Hub: Large Language Models - Discipline: Computer Science / AI - Hub URL: https://science-database.com/technology/large-language-models - Hub llms.txt: https://science-database.com/technology/large-language-models/llms.txt ## Abstract In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA. ## Links - DOI: https://doi.org/10.48550/arXiv.2403.14520 - Semantic Scholar: https://www.semanticscholar.org/paper/40e996a7c3e914a67c708704fa9b4c54ea70f36e - JSON API: https://science-database.com/api/v1/technology/large-language-models --- Generated by science-database.com — The Knowledge Interface Paper ID: s2-40e996a7c3e914a67c708704fa9b4c54ea70f36e | Hub: large-language-models