---
title: "Computer Vision"
slug: "computer-vision"
discipline: "Computer Science / AI"
description: "Visual AI and image understanding. Object detection, segmentation, 3D vision, video understanding, visual transformers, and multimodal vision-language models."
icon: "👁️"
url: "https://science-database.com/technology/computer-vision"
api: "https://science-database.com/api/v1/technology/computer-vision"
llms_txt: "https://science-database.com/technology/computer-vision/llms.txt"
articles_indexed: 15
last_updated: "2026-04-11T08:29:55.461Z"
search_terms:
  - "computer vision transformer detection"
  - "image segmentation deep learning"
  - "vision language model multimodal"
source: "science-database.com"
license: "metadata CC0, abstracts belong to respective publishers"
---

# Computer Vision

Visual AI and image understanding. Object detection, segmentation, 3D vision, video understanding, visual transformers, and multimodal vision-language models.

**Discipline:** Computer Science / AI  
**Indexed Papers:** 15  
**Last Updated:** 2026-04-11

## Top Publications

Ranked by citation impact across Semantic Scholar, OpenAlex & arXiv.

### Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

- **Authors:** Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
- **Journal:** 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- **Published:** 2021-10-01
- **DOI:** [10.1109/iccv48922.2021.00986](https://doi.org/10.1109/iccv48922.2021.00986)
- **Citations:** 28,719
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W3138516171/llms.txt)

> This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differ...

### A ConvNet for the 2020s

- **Authors:** Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie
- **Journal:** 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- **Published:** 2022-06-01
- **DOI:** [10.1109/cvpr52688.2022.01167](https://doi.org/10.1109/cvpr52688.2022.01167)
- **Citations:** 6,598
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4312443924/llms.txt)

> The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) th...

### Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

- **Authors:** Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao
- **Journal:** 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- **Published:** 2021-10-01
- **DOI:** [10.1109/iccv48922.2021.00061](https://doi.org/10.1109/iccv48922.2021.00061)
- **Citations:** 4,540
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W3131500599/llms.txt)

> Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of...

### PVT v2: Improved baselines with pyramid vision transformer

- **Authors:** Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao
- **Journal:** Computational Visual Media
- **Published:** 2022-03-16
- **DOI:** [10.1007/s41095-022-0274-8](https://doi.org/10.1007/s41095-022-0274-8)
- **Citations:** 2,057
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://link.springer.com/content/pdf/10.1007/s41095-022-0274-8.pdf)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W3175515048/llms.txt)

> Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of...

### Learning RoI Transformer for Oriented Object Detection in Aerial Images

- **Authors:** Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, Qikai Lu
- **Published:** 2019-06-01
- **DOI:** [10.1109/cvpr.2019.00296](https://doi.org/10.1109/cvpr.2019.00296)
- **Citations:** 1,464
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W2964979676/llms.txt)

> Object detection in aerial images is an active yet challenging task in computer vision because of the bird’s-eye view perspective, the highly complex backgrounds, and the variant appearances of objects. Especially when detecting densely packed objects in aerial images, methods relying on horizontal proposals for common object detection often introduce mismatches between the Region of Interests (Ro...

### UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

- **Authors:** Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson
- **Journal:** ISPRS Journal of Photogrammetry and Remote Sensing
- **Published:** 2022-06-24
- **DOI:** [10.1016/j.isprsjprs.2022.06.008](https://doi.org/10.1016/j.isprsjprs.2022.06.008)
- **Citations:** 1,059
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://arxiv.org/pdf/2109.08937)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4283450732/llms.txt)

### Transformers in medical imaging: A survey

- **Authors:** Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, Huazhu Fu
- **Journal:** Medical Image Analysis
- **Published:** 2023-04-05
- **DOI:** [10.1016/j.media.2023.102802](https://doi.org/10.1016/j.media.2023.102802)
- **Citations:** 1,050
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4362603432/llms.txt)

### BiFormer: Vision Transformer with Bi-Level Routing Attention

- **Authors:** Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wei Zhang, Rynson W. H. Lau
- **Published:** 2023-06-01
- **DOI:** [10.1109/cvpr52729.2023.00995](https://doi.org/10.1109/cvpr52729.2023.00995)
- **Citations:** 985
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4386075524/llms.txt)

> As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into ...

### Transformers in Time Series: A Survey

- **Authors:** Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, Liang Sun
- **Published:** 2023-08-01
- **DOI:** [10.24963/ijcai.2023/759](https://doi.org/10.24963/ijcai.2023/759)
- **Citations:** 942
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://www.ijcai.org/proceedings/2023/0759.pdf)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4385763767/llms.txt)

> Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applicati...

### Visual attention network

- **Authors:** Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming‐Ming Cheng, Shi‐Min Hu
- **Journal:** Computational Visual Media
- **Published:** 2023-07-28
- **DOI:** [10.1007/s41095-023-0364-2](https://doi.org/10.1007/s41095-023-0364-2)
- **Citations:** 921
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://link.springer.com/content/pdf/10.1007/s41095-023-0364-2.pdf)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4385346076/llms.txt)

> While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3...

### 3D Human Pose Estimation with Spatial and Temporal Transformers

- **Authors:** Ce Zheng, Sijie Zhu, Matías Mendieta, Taojiannan Yang, Chen Chen, Zhengming Ding
- **Journal:** 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- **Published:** 2021-10-01
- **DOI:** [10.1109/iccv48922.2021.01145](https://doi.org/10.1109/iccv48922.2021.01145)
- **Citations:** 624
- **Source:** OpenAlex
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W3136525061/llms.txt)

> Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D h...

### Transformers in medical image analysis

- **Authors:** Kelei He, Gan Chen, Zhuoyuan Li, Islem Rekik, Zihao Yin, Ji Wen, Yang Gao, Qian Wang, Junfeng Zhang, Dinggang Shen
- **Journal:** Intelligent Medicine
- **Published:** 2022-08-24
- **DOI:** [10.1016/j.imed.2022.07.002](https://doi.org/10.1016/j.imed.2022.07.002)
- **Citations:** 422
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://doi.org/10.1016/j.imed.2022.07.002)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4293163051/llms.txt)

> Transformers have dominated the field of natural language processing and have recently made an impact in the area of computer vision. In the field of medical image analysis, transformers have also been successfully used in to full-stack clinical applications, including image synthesis/reconstruction, registration, segmentation, detection, and diagnosis. This paper aimed to promote awareness of the...

### Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

- **Authors:** Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
- **Journal:** arXiv (Cornell University)
- **Published:** 2021-03-25
- **DOI:** [10.48550/arxiv.2103.14030](https://doi.org/10.48550/arxiv.2103.14030)
- **Citations:** 373
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://arxiv.org/pdf/2103.14030)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W3202406646/llms.txt)

> This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differ...

### Computer Vision Based Transfer Learning-Aided Transformer Model for Fall Detection and Prediction

- **Authors:** Sheldon Mccall, Shina Samuel Kolawole, Afreen Naz, Liyun Gong, Syed Waqar Ahmed, Pandey Shourya Prasad, Miao Yu, James Wingate, Saeid Pourroostaei Ardakani
- **Journal:** IEEE Access
- **Published:** 2024-01-01
- **DOI:** [10.1109/access.2024.3368065](https://doi.org/10.1109/access.2024.3368065)
- **Citations:** 22
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://ieeexplore.ieee.org/ielx7/6287639/6514899/10440637.pdf)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4391953475/llms.txt)

> Falls bring about significant risks to individuals’ well-being and independence, prompting widespread public health concerns. Swift detection and even predicting the risk of falls are crucial for implementing effective measures to alleviate the adverse consequences associated with such incidents. This study presents a new framework for identifying and forecasting fall risks. Our approach utilizes ...

### A Computer Vision Enabled damage detection model with improved YOLOv5 based on Transformer Prediction Head

- **Authors:** Arunabha M. Roy, Jayabrata Bhaduri
- **Journal:** arXiv (Cornell University)
- **Published:** 2023-03-07
- **DOI:** [10.48550/arxiv.2303.04275](https://doi.org/10.48550/arxiv.2303.04275)
- **Citations:** 22
- **Source:** OpenAlex
- **Access:** Open Access
- **PDF:** [Download](https://arxiv.org/pdf/2303.04275)
- **llms.txt:** [View](https://science-database.com/technology/computer-vision/paper/oa-W4323706312/llms.txt)

> Objective:Computer vision-based up-to-date accurate damage classification and localization are of decisive importance for infrastructure monitoring, safety, and the serviceability of civil infrastructure. Current state-of-the-art deep learning (DL)-based damage detection models, however, often lack superior feature extraction capability in complex and noisy environments, limiting the development o...

---

*Generated by [science-database.com](https://science-database.com) — The Knowledge Interface*  
*Full data available via [JSON API](https://science-database.com/api/v1/technology/computer-vision)*