"computer vision transformer detection" — Explore

Computer Vision

Visual AI and image understanding. Object detection, segmentation, 3D vision, video understanding, visual transformers, and multimodal vision-language models.

Computer Science / AI

View curated hub

Results for "computer vision transformer detection"

1,488,696 total results — showing 20 from PubMed + NASA ADS + arXiv + OpenAlex

PubMed 2024 May

Systematic Review of Emotion Detection with Computer Vision and Deep Learning.

Pereira Rafael, Mendes Carla, Ribeiro José, Ribeiro Roberto, Miragaia Rolando, Rodrigues Nuno, Costa Nuno, Pereira António

Sensors (Basel, Switzerland)

Show Abstract

Emotion recognition has become increasingly important in the field of Deep Learning (DL) and computer vision due to its broad applicability by using human-computer interaction (HCI) in areas such as psychology, healthcare, and entertainment. In this paper, we conduct a systematic review of facial and pose emotion recognition using DL and computer vision, analyzing and evaluating 77 papers from different sources under Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Our review covers several topics, including the scope and purpose of the studies, the methods employed, and the used datasets. The scope of this work is to conduct a systematic review of facial and pose emotion recognition using DL methods and computer vision. The studies were categorized based on a proposed taxonomy that describes the type of expressions used for emotion detection, the testing environment, the currently relevant DL methods, and the datasets used. The taxonomy of methods in our review includes Convolutional Neural Network (CNN), Faster Region-based Convolutional Neural Network (R-CNN), Vision Transformer (ViT), and "Other NNs", which are the most commonly used models in the analyzed studies, indicating their trendiness in the field. Hybrid and augmented models are not explicitly categorized within this taxonomy, but they are still important to the field. This review offers an understanding of state-of-the-art computer vision algorithms and datasets for emotion recognition through facial expressions and body poses, allowing researchers to understand its fundamental components and trends.

PubMed DOI PMID: 38894274

PubMed 2023 Aug

Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired.

Bayat Nasrin, Kim Jong-Hwan, Choudhury Renoa, Kadhim Ibrahim F, Al-Mashhadani Zubaidah, Aldritz Dela Virgen Mark, Latorre Reuben, De La Paz Ricardo, Park Joon-Hyuk

Journal of imaging

Show Abstract

This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user's walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.

PubMed DOI PMID: 37623693

PubMed 2023 Mar

Transformer-Based Fire Detection in Videos.

Mardani Konstantina, Vretos Nicholas, Daras Petros

Sensors (Basel, Switzerland)

Show Abstract

Fire detection in videos forms a valuable feature in surveillance systems, as its utilization can prevent hazardous situations. The combination of an accurate and fast model is necessary for the effective confrontation of this significant task. In this work, a transformer-based network for the detection of fire in videos is proposed. It is an encoder-decoder architecture that consumes the current frame that is under examination, in order to compute attention scores. These scores denote which parts of the input frame are more relevant for the expected fire detection output. The model is capable of recognizing fire in video frames and specifying its exact location in the image plane in real-time, as can be seen in the experimental results, in the form of segmentation mask. The proposed methodology has been trained and evaluated for two computer vision tasks, the full-frame classification task (fire/no fire in frames) and the fire localization task. In comparison with the state-of-the-art models, the proposed method achieves outstanding results in both tasks, with 97% accuracy, 20.4 fps processing time, 0.02 false positive rate for fire localization, and 97% for f-score and recall metrics in the full-frame classification task.

PubMed DOI PMID: 36991746

PubMed 2024 Aug

RailTrack-DaViT: A Vision Transformer-Based Approach for Automated Railway Track Defect Detection.

Phaphuangwittayakul Aniwat, Harnpornchai Napat, Ying Fangli, Zhang Jinming

Journal of imaging

Show Abstract

Railway track defects pose significant safety risks and can lead to accidents, economic losses, and loss of life. Traditional manual inspection methods are either time-consuming, costly, or prone to human error. This paper proposes RailTrack-DaViT, a novel vision transformer-based approach for railway track defect classification. By leveraging the Dual Attention Vision Transformer (DaViT) architecture, RailTrack-DaViT effectively captures both global and local information, enabling accurate defect detection. The model is trained and evaluated on multiple datasets including rail, fastener and fishplate, multi-faults, and ThaiRailTrack. A comprehensive analysis of the model's performance is provided including confusion matrices, training visualizations, and classification metrics. RailTrack-DaViT demonstrates superior performance compared to state-of-the-art CNN-based methods, achieving the highest accuracies: 96.9% on the rail dataset, 98.9% on the fastener and fishplate dataset, and 98.8% on the multi-faults dataset. Moreover, RailTrack-DaViT outperforms baselines on the ThaiRailTrack dataset with 99.2% accuracy, quickly adapts to unseen images, and shows better model stability during fine-tuning. This capability can significantly reduce time consumption when applying the model to novel datasets in practical applications.

PubMed DOI PMID: 39194981

PubMed 2021 01

Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection.

Wang Chongwen, Wang Zicheng

Frontiers in neurorobotics

Show Abstract

Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.

PubMed DOI PMID: 35095460

PubMed 2023 Oct

Multispectral Plant Disease Detection with Vision Transformer-Convolutional Neural Network Hybrid Approaches.

De Silva Malithi, Brown Dane

Sensors (Basel, Switzerland)

Show Abstract

Plant diseases pose a critical threat to global agricultural productivity, demanding timely detection for effective crop yield management. Traditional methods for disease identification are laborious and require specialised expertise. Leveraging cutting-edge deep learning algorithms, this study explores innovative approaches to plant disease identification, combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance accuracy. A multispectral dataset was meticulously collected to facilitate this research using six 50 mm filter filters, covering both the visible and several near-infrared (NIR) wavelengths. Among the models employed, ViT-B16 notably achieved the highest test accuracy, precision, recall, and F1 score across all filters, with averages of 83.3%, 90.1%, 90.75%, and 89.5%, respectively. Furthermore, a comparative analysis highlights the pivotal role of balanced datasets in selecting the appropriate wavelength and deep learning model for robust disease identification. These findings promise to advance crop disease management in real-world agricultural applications and contribute to global food security. The study underscores the significance of machine learning in transforming plant disease diagnostics and encourages further research in this field.

PubMed DOI PMID: 37896623

PubMed 2024 Jul

Computer vision-inspired contrastive learning for self-supervised anomaly detection in sensor-based remote healthcare monitoring.

Bijlani Nivedita, Yin Maowen, Carneiro Gustavo, Barnaghi Payam, Kouchaki Samaneh

Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

Show Abstract

Sensor-based remote healthcare monitoring is a promising approach for timely detection of adverse health events such as falls or infections in people living with dementia (PLwD) in the home, and reducing preventable hospital admissions. Current anomaly detection approaches in the home setting are hindered by challenges such as noisy, multivariate data, unreliability of event annotations, and heterogeneity across home settings. Inspired by the simplicity and recent applications of contrastive learning in the field of computer vision, we propose a lightweight, self-supervised, negative sample-free approach to detect anomalous events using home activity changes in PLwD. We use the contrastive loss between hourly-aggregated daily sensor data and a lower temporal resolution augmentation, to extract a noise-robust, discriminative representation of daily activity. The daily difference in representation forms the anomaly score, which is compared to the household-personalized threshold, and alerts raised to a clinical monitoring team. Attention weights from the Transformer encoder and Layer-wise Relevance Propagation support explainability. We evaluated the models on accuracy and generalizability, given a target alert rate. Our model outperformed state-of-the-art algorithms in detecting agitation and fall events for three distinct patient cohorts, with 86%(SD=4%) average recall and 92%(SD=4%) generalizability, at a target alert rate of 7%. Our novel application of contrastive frameworks is domain-agnostic and can extract salient patterns from time-series data in other remote monitoring environments.

PubMed DOI PMID: 40039827

PubMed 2025 Jul

WeedSwin hierarchical vision transformer with SAM-2 for multi-stage weed detection and classification.

Islam Taminul, Sarker Toqi Tahamid, Ahmed Khaled R, Rankrape Cristiana Bernardi, Gage Karla

Scientific reports

Show Abstract

Weed detection and classification using computer vision and deep learning techniques have emerged as crucial tools for precision agriculture, offering automated solutions for sustainable farming practices. This study presents a comprehensive approach to weed identification across multiple growth stages, addressing the challenges of detecting and classifying diverse weed species throughout their developmental cycles. We introduce two extensive datasets: the Alpha Weed Dataset (AWD) with 203,567 images and the Beta Weed Dataset (BWD) with 120,341 images, collectively documenting 16 prevalent weed species across 11 growth stages. The datasets were preprocessed using both traditional computer vision techniques and the advanced SAM-2 model, ensuring high-quality annotations with segmentation masks and precise bounding boxes. Our research evaluates several state-of-the-art object detection architectures, including DINO Transformer (with ResNet-101 and Swin backbones), Detection Transformer (DETR), EfficientNet B4, YOLO v8, and RetinaNet. Additionally, we propose a novel WeedSwin Transformer architecture specifically designed to address the unique challenges of weed detection, such as complex morphological variations and overlapping vegetation patterns. Through rigorous experimentation, WeedSwin demonstrated superior performance, achieving 0.993 ± 0.004 mAP and 0.985 mAR while maintaining practical processing speeds of 218.27 FPS, outperforming existing architectures across various metrics. The comprehensive evaluation across different growth stages reveals the robustness of our approach, particularly in detecting challenging "driver weeds" that significantly impact agricultural productivity. By providing accurate, automated weed identification capabilities, this research establishes a foundation for more efficient and environmentally sustainable weed management practices. The demonstrated success of the WeedSwin architecture, combined with our extensive temporal datasets, represents a significant advancement in agricultural computer vision, supporting the evolution of precision farming techniques while promoting reduced herbicide usage and improved crop management efficiency.

PubMed DOI PMID: 40603967

NASA ADS 2021-10-00

1664 citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Show Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.

NASA ADS DOI 2021iccv.conf..986L

NASA ADS 2021-03-00

1830 citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining

arXiv e-prints

Show Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

NASA ADS DOI 2021arXiv210314030L

NASA ADS 2006-00-00

121 citations

Computer vision based method for real-time fire and flame detection

Töreyin, B. Uğur, Dedeoğlu, Yiğithan, Güdükbay, Uğur, Çetin, A. Enis

Pattern Recognition Letters

Show Abstract

This paper proposes a novel method to detect fire and/or flames in real-time by processing the video data generated by an ordinary camera monitoring a scene. In addition to ordinary motion and color clues, flame and fire flicker is detected by analyzing the video in the wavelet domain. Quasi-periodic behavior in flame boundaries is detected by performing temporal wavelet transform. Color variations in flame regions are detected by computing the spatial wavelet transform of moving fire-colored regions. Another clue used in the fire detection algorithm is the irregularity of the boundary of the fire-colored region. All of the above clues are combined to reach a final decision. Experimental results show that the proposed method is very successful in detecting fire and/or flames. In addition, it drastically reduces the false alarms issued to ordinary fire-colored moving objects as compared to the methods using only motion and color clues.

NASA ADS DOI 2006PaReL..27...49T

NASA ADS 2020-10-00

2932 citations

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Zhu, Xizhou, Su, Weijie, Lu, Lewei, Li, Bin, Wang, Xiaogang, Dai, Jifeng

arXiv e-prints

Show Abstract

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.

NASA ADS DOI 2020arXiv201004159Z

arXiv 2019-09-23

WiCV 2019: The Sixth Women In Computer Vision Workshop

Irene Amerini, Elena Balashova, Sayna Ebrahimi, Kathryn Leonard, Arsha Nagrani, Amaia Salvador

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 0-0

Show Abstract

In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019. This event is meant for increasing the visibility and inclusion of women researchers in the computer vision field. Computer vision and machine learning have made incredible progress over the past years, but the number of female researchers is still low both in academia and in industry. WiCV is organized especially for the following reason: to raise visibility of female researchers, to increase collaborations between them, and to provide mentorship to female junior researchers in the field. In this paper, we present a report of trends over the past years, along with a summary of statistics regarding presenters, attendees, and sponsorship for the current workshop.

arXiv PDF 1909.10225v1

arXiv 2022-05-10

Spatial Monitoring and Insect Behavioural Analysis Using Computer Vision for Precision Pollination

Malika Nisal Ratnayake, Don Chathurika Amarathunga, Asaduz Zaman, Adrian G. Dyer, Alan Dorin

International Journal of Computer Vision (2022)

Show Abstract

Insects are the most important global pollinator of crops and play a key role in maintaining the sustainability of natural ecosystems. Insect pollination monitoring and management are therefore essential for improving crop production and food security. Computer vision facilitated pollinator monitoring can intensify data collection over what is feasible using manual approaches. The new data it generates may provide a detailed understanding of insect distributions and facilitate fine-grained analysis sufficient to predict their pollination efficacy and underpin precision pollination. Current computer vision facilitated insect tracking in complex outdoor environments is restricted in spatial coverage and often constrained to a single insect species. This limits its relevance to agriculture. Therefore, in this article we introduce a novel system to facilitate markerless data capture for insect counting, insect motion tracking, behaviour analysis and pollination prediction across large agricultural areas. Our system is comprised of edge computing multi-point video recording, offline automated multispecies insect counting, tracking and behavioural analysis. We implement and test our system on a commercial berry farm to demonstrate its capabilities. Our system successfully tracked four insect varieties, at nine monitoring stations within polytunnels, obtaining an F-score above 0.8 for each variety. The system enabled calculation of key metrics to assess the relative pollination impact of each insect variety. With this technological advancement, detailed, ongoing data collection for precision pollination becomes achievable. This is important to inform growers and apiarists managing crop pollination, as it allows data-driven decisions to be made to improve food production and food security.

arXiv PDF DOI 2205.04675v2

arXiv 2022-06-21

Vicinity Vision Transformer

Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Lingpeng Kong, Yiran Zhong

arXiv:2206.10552v2 [cs.CV]

Show Abstract

Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.

arXiv PDF 2206.10552v2

arXiv 2020-10-02

Global Adaptive Filtering Layer for Computer Vision

Viktor Shipitsin, Iaroslav Bespalov, Dmitry V. Dylov

Computer Vision and Image Understanding, V. 223, 103519, 2022

Show Abstract

We devise a universal adaptive neural layer to "learn" optimal frequency filter for each image together with the weights of the base neural network that performs some computer vision task. The proposed approach takes the source image in the spatial domain, automatically selects the best frequencies from the frequency domain, and transmits the inverse-transform image to the main neural network. Remarkably, such a simple add-on layer dramatically improves the performance of the main network regardless of its design. We observe that the light networks gain a noticeable boost in the performance metrics; whereas, the training of the heavy ones converges faster when our adaptive layer is allowed to "learn" alongside the main architecture. We validate the idea in four classical computer vision tasks: classification, segmentation, denoising, and erasing, considering popular natural and medical data benchmarks.

arXiv PDF DOI 2010.01177v4

OpenAlex 2021-10-01

29643 citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Show Abstract

OpenAlex DOI https://openalex.org/W3138516171

OpenAlex 2022-06-01

6915 citations

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Show Abstract

The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

OpenAlex DOI https://openalex.org/W4312443924

OpenAlex 2021-10-01

4648 citations

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Show Abstract

Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to current state of the arts. (1) Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. (2) PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could, serre as an alternative and useful backbone for pixel-level predictions and facilitate future research.

OpenAlex DOI https://openalex.org/W3131500599

OpenAlex 2022-03-16

2145 citations

PVT v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao

Computational Visual Media

Show Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT .

OpenAlex PDF DOI https://openalex.org/W3175515048

Explore Research

Computer Vision

Results for "computer vision transformer detection"