PubMed 2024 May
Pereira Rafael, Mendes Carla, Ribeiro José, Ribeiro Roberto, Miragaia Rolando, Rodrigues Nuno, Costa Nuno, Pereira António
Sensors (Basel, Switzerland)
Show Abstract
Emotion recognition has become increasingly important in the field of Deep Learning (DL) and computer vision due to its broad applicability by using human-computer interaction (HCI) in areas such as psychology, healthcare, and entertainment. In this paper, we conduct a systematic review of facial and pose emotion recognition using DL and computer vision, analyzing and evaluating 77 papers from different sources under Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Our review covers several topics, including the scope and purpose of the studies, the methods employed, and the used datasets. The scope of this work is to conduct a systematic review of facial and pose emotion recognition using DL methods and computer vision. The studies were categorized based on a proposed taxonomy that describes the type of expressions used for emotion detection, the testing environment, the currently relevant DL methods, and the datasets used. The taxonomy of methods in our review includes Convolutional Neural Network (CNN), Faster Region-based Convolutional Neural Network (R-CNN), Vision Transformer (ViT), and "Other NNs", which are the most commonly used models in the analyzed studies, indicating their trendiness in the field. Hybrid and augmented models are not explicitly categorized within this taxonomy, but they are still important to the field. This review offers an understanding of state-of-the-art computer vision algorithms and datasets for emotion recognition through facial expressions and body poses, allowing researchers to understand its fundamental components and trends.
PubMed 2023 Aug
Bayat Nasrin, Kim Jong-Hwan, Choudhury Renoa, Kadhim Ibrahim F, Al-Mashhadani Zubaidah, Aldritz Dela Virgen Mark, Latorre Reuben, De La Paz Ricardo, Park Joon-Hyuk
Journal of imaging
Show Abstract
This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user's walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.
PubMed 2023 Mar
Mardani Konstantina, Vretos Nicholas, Daras Petros
Sensors (Basel, Switzerland)
Show Abstract
Fire detection in videos forms a valuable feature in surveillance systems, as its utilization can prevent hazardous situations. The combination of an accurate and fast model is necessary for the effective confrontation of this significant task. In this work, a transformer-based network for the detection of fire in videos is proposed. It is an encoder-decoder architecture that consumes the current frame that is under examination, in order to compute attention scores. These scores denote which parts of the input frame are more relevant for the expected fire detection output. The model is capable of recognizing fire in video frames and specifying its exact location in the image plane in real-time, as can be seen in the experimental results, in the form of segmentation mask. The proposed methodology has been trained and evaluated for two computer vision tasks, the full-frame classification task (fire/no fire in frames) and the fire localization task. In comparison with the state-of-the-art models, the proposed method achieves outstanding results in both tasks, with 97% accuracy, 20.4 fps processing time, 0.02 false positive rate for fire localization, and 97% for f-score and recall metrics in the full-frame classification task.
PubMed 2021 01
Wang Chongwen, Wang Zicheng
Frontiers in neurorobotics
Show Abstract
Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.
PubMed 2024 Aug
Phaphuangwittayakul Aniwat, Harnpornchai Napat, Ying Fangli, Zhang Jinming
Journal of imaging
Show Abstract
Railway track defects pose significant safety risks and can lead to accidents, economic losses, and loss of life. Traditional manual inspection methods are either time-consuming, costly, or prone to human error. This paper proposes RailTrack-DaViT, a novel vision transformer-based approach for railway track defect classification. By leveraging the Dual Attention Vision Transformer (DaViT) architecture, RailTrack-DaViT effectively captures both global and local information, enabling accurate defect detection. The model is trained and evaluated on multiple datasets including rail, fastener and fishplate, multi-faults, and ThaiRailTrack. A comprehensive analysis of the model's performance is provided including confusion matrices, training visualizations, and classification metrics. RailTrack-DaViT demonstrates superior performance compared to state-of-the-art CNN-based methods, achieving the highest accuracies: 96.9% on the rail dataset, 98.9% on the fastener and fishplate dataset, and 98.8% on the multi-faults dataset. Moreover, RailTrack-DaViT outperforms baselines on the ThaiRailTrack dataset with 99.2% accuracy, quickly adapts to unseen images, and shows better model stability during fine-tuning. This capability can significantly reduce time consumption when applying the model to novel datasets in practical applications.
PubMed 2023 Oct
De Silva Malithi, Brown Dane
Sensors (Basel, Switzerland)
Show Abstract
Plant diseases pose a critical threat to global agricultural productivity, demanding timely detection for effective crop yield management. Traditional methods for disease identification are laborious and require specialised expertise. Leveraging cutting-edge deep learning algorithms, this study explores innovative approaches to plant disease identification, combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance accuracy. A multispectral dataset was meticulously collected to facilitate this research using six 50 mm filter filters, covering both the visible and several near-infrared (NIR) wavelengths. Among the models employed, ViT-B16 notably achieved the highest test accuracy, precision, recall, and F1 score across all filters, with averages of 83.3%, 90.1%, 90.75%, and 89.5%, respectively. Furthermore, a comparative analysis highlights the pivotal role of balanced datasets in selecting the appropriate wavelength and deep learning model for robust disease identification. These findings promise to advance crop disease management in real-world agricultural applications and contribute to global food security. The study underscores the significance of machine learning in transforming plant disease diagnostics and encourages further research in this field.
PubMed 2023 11
Waseem Sabir Muhammad, Farhan Muhammad, Almalki Nabil Sharaf, Alnfiai Mrim M, Sampedro Gabriel Avelino
Frontiers in medicine
Show Abstract
Pulmonary Fibrosis (PF) is an immedicable respiratory condition distinguished by permanent fibrotic alterations in the pulmonary tissue for which there is no cure. Hence, it is crucial to diagnose PF swiftly and precisely. The existing research on deep learning-based pulmonary fibrosis detection methods has limitations, including dataset sample sizes and a lack of standardization in data preprocessing and evaluation metrics. This study presents a comparative analysis of four vision transformers regarding their efficacy in accurately detecting and classifying patients with Pulmonary Fibrosis and their ability to localize abnormalities within Images obtained from Computerized Tomography (CT) scans. The dataset consisted of 13,486 samples selected out of 24647 from the Pulmonary Fibrosis dataset, which included both PF-positive CT and normal images that underwent preprocessing. The preprocessed images were divided into three sets: the training set, which accounted for 80% of the total pictures; the validation set, which comprised 10%; and the test set, which also consisted of 10%. The vision transformer models, including ViT, MobileViT2, ViTMSN, and BEiT were subjected to training and validation procedures, during which hyperparameters like the learning rate and batch size were fine-tuned. The overall performance of the optimized architectures has been assessed using various performance metrics to showcase the consistent performance of the fine-tuned model. Regarding performance, ViT has shown superior performance in validation and testing accuracy and loss minimization, specifically for CT images when trained at a single epoch with a tuned learning rate of 0.0001. The results were as follows: validation accuracy of 99.85%, testing accuracy of 100%, training loss of 0.0075, and validation loss of 0.0047. The experimental evaluation of the independently collected data gives empirical evidence that the optimized Vision Transformer (ViT) architecture exhibited superior performance compared to all other optimized architectures. It achieved a flawless score of 1.0 in various standard performance metrics, including Sensitivity, Specificity, Accuracy, F1-score, Precision, Recall, Mathew Correlation Coefficient (MCC), Precision-Recall Area under the Curve (AUC PR), Receiver Operating Characteristic and Area Under the Curve (ROC-AUC). Therefore, the optimized Vision Transformer (ViT) functions as a reliable diagnostic tool for the automated categorization of individuals with pulmonary fibrosis (PF) using chest computed tomography (CT) scans.
PubMed 2023 Jun
Li Alexa L, Feng Moira, Wang Zixi, Baxter Sally L, Huang Lingling, Arnett Justin, Bartsch Dirk-Uwe G, Kuo David E, Saseendrakumar Bharanidharan Radha, Guo Joy, Nudleman Eric
Ophthalmology science
Show Abstract
OBJECTIVE: To develop automated algorithms for the detection of posterior vitreous detachment (PVD) using OCT imaging.
DESIGN: Evaluation of a diagnostic test or technology.
SUBJECTS: Overall, 42 385 consecutive OCT images (865 volumetric OCT scans) obtained with Heidelberg Spectralis from 865 eyes from 464 patients at an academic retina clinic between October 2020 and December 2021 were retrospectively reviewed.
METHODS: We developed a customized computer vision algorithm based on image filtering and edge detection to detect the posterior vitreous cortex for the determination of PVD status. A second deep learning (DL) image classification model based on convolutional neural networks and ResNet-50 architecture was also trained to identify PVD status from OCT images. The training dataset consisted of 674 OCT volume scans (33 026 OCT images), while the validation testing set consisted of 73 OCT volume scans (3577 OCT images). Overall, 118 OCT volume scans (5782 OCT images) were used as a separate external testing dataset.
MAIN OUTCOME MEASURES: Accuracy, sensitivity, specificity, F1-scores, and area under the receiver operator characteristic curves (AUROCs) were measured to assess the performance of the automated algorithms.
RESULTS: Both the customized computer vision algorithm and DL model results were largely in agreement with the PVD status labeled by trained graders. The DL approach achieved an accuracy of 90.7% and an F1-score of 0.932 with a sensitivity of 100% and a specificity of 74.5% for PVD detection from an OCT volume scan. The AUROC was 89% at the image level and 96% at the volume level for the DL model. The customized computer vision algorithm attained an accuracy of 89.5% and an F1-score of 0.912 with a sensitivity of 91.9% and a specificity of 86.1% on the same task.
CONCLUSIONS: Both the computer vision algorithm and the DL model applied on OCT imaging enabled reliable detection of PVD status, demonstrating the potential for OCT-based automated PVD status classification to assist with vitreoretinal surgical planning.
FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found after the references.
NASA ADS 2021-10-00
1188 citations Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Show Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
NASA ADS 2021-03-00
1809 citations Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining
arXiv e-prints
Show Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
NASA ADS 2020-10-00
2820 citations Zhu, Xizhou, Su, Weijie, Lu, Lewei, Li, Bin, Wang, Xiaogang, Dai, Jifeng
arXiv e-prints
Show Abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.
NASA ADS 2006-00-00
119 citations Töreyin, B. Uğur, Dedeoğlu, Yiğithan, Güdükbay, Uğur, Çetin, A. Enis
Pattern Recognition Letters
Show Abstract
This paper proposes a novel method to detect fire and/or flames in real-time by processing the video data generated by an ordinary camera monitoring a scene. In addition to ordinary motion and color clues, flame and fire flicker is detected by analyzing the video in the wavelet domain. Quasi-periodic behavior in flame boundaries is detected by performing temporal wavelet transform. Color variations in flame regions are detected by computing the spatial wavelet transform of moving fire-colored regions. Another clue used in the fire detection algorithm is the irregularity of the boundary of the fire-colored region. All of the above clues are combined to reach a final decision. Experimental results show that the proposed method is very successful in detecting fire and/or flames. In addition, it drastically reduces the false alarms issued to ordinary fire-colored moving objects as compared to the methods using only motion and color clues.
CORE 2016-04-11T00:00:00
Kuen, Jason, Wang, Gang, Wang, Zhenhua
Show Abstract
Convolutional-deconvolution networks can be adopted to perform end-to-end
saliency detection. But, they do not work well with objects of multiple scales.
To overcome such a limitation, in this work, we propose a recurrent attentional
convolutional-deconvolution network (RACDNN). Using spatial transformer and
recurrent network units, RACDNN is able to iteratively attend to selected image
sub-regions to perform saliency refinement progressively. Besides tackling the
scale problem, RACDNN can also learn context-aware features from past
iterations to enhance saliency refinement in future iterations. Experiments on
several challenging saliency detection datasets validate the effectiveness of
RACDNN, and show that RACDNN outperforms state-of-the-art saliency detection
methods.Comment: CVPR 201
arXiv 2019-09-23
Irene Amerini, Elena Balashova, Sayna Ebrahimi, Kathryn Leonard, Arsha Nagrani, Amaia Salvador
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 0-0
Show Abstract
In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019. This event is meant for increasing the visibility and inclusion of women researchers in the computer vision field. Computer vision and machine learning have made incredible progress over the past years, but the number of female researchers is still low both in academia and in industry. WiCV is organized especially for the following reason: to raise visibility of female researchers, to increase collaborations between them, and to provide mentorship to female junior researchers in the field. In this paper, we present a report of trends over the past years, along with a summary of statistics regarding presenters, attendees, and sponsorship for the current workshop.
arXiv 2022-05-10
Malika Nisal Ratnayake, Don Chathurika Amarathunga, Asaduz Zaman, Adrian G. Dyer, Alan Dorin
International Journal of Computer Vision (2022)
Show Abstract
Insects are the most important global pollinator of crops and play a key role in maintaining the sustainability of natural ecosystems. Insect pollination monitoring and management are therefore essential for improving crop production and food security. Computer vision facilitated pollinator monitoring can intensify data collection over what is feasible using manual approaches. The new data it generates may provide a detailed understanding of insect distributions and facilitate fine-grained analysis sufficient to predict their pollination efficacy and underpin precision pollination. Current computer vision facilitated insect tracking in complex outdoor environments is restricted in spatial coverage and often constrained to a single insect species. This limits its relevance to agriculture. Therefore, in this article we introduce a novel system to facilitate markerless data capture for insect counting, insect motion tracking, behaviour analysis and pollination prediction across large agricultural areas. Our system is comprised of edge computing multi-point video recording, offline automated multispecies insect counting, tracking and behavioural analysis. We implement and test our system on a commercial berry farm to demonstrate its capabilities. Our system successfully tracked four insect varieties, at nine monitoring stations within polytunnels, obtaining an F-score above 0.8 for each variety. The system enabled calculation of key metrics to assess the relative pollination impact of each insect variety. With this technological advancement, detailed, ongoing data collection for precision pollination becomes achievable. This is important to inform growers and apiarists managing crop pollination, as it allows data-driven decisions to be made to improve food production and food security.
arXiv 2022-06-21
Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Lingpeng Kong, Yiran Zhong
arXiv:2206.10552v2 [cs.CV]
Show Abstract
Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv 2020-10-02
Viktor Shipitsin, Iaroslav Bespalov, Dmitry V. Dylov
Computer Vision and Image Understanding, V. 223, 103519, 2022
Show Abstract
We devise a universal adaptive neural layer to "learn" optimal frequency filter for each image together with the weights of the base neural network that performs some computer vision task. The proposed approach takes the source image in the spatial domain, automatically selects the best frequencies from the frequency domain, and transmits the inverse-transform image to the main neural network. Remarkably, such a simple add-on layer dramatically improves the performance of the main network regardless of its design. We observe that the light networks gain a noticeable boost in the performance metrics; whereas, the training of the heavy ones converges faster when our adaptive layer is allowed to "learn" alongside the main architecture. We validate the idea in four classical computer vision tasks: classification, segmentation, denoising, and erasing, considering popular natural and medical data benchmarks.
OpenAlex 2021-10-01
28738 citations Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Show Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
OpenAlex 2022-06-01
6601 citations Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Show Abstract
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
OpenAlex 2021-10-01
4540 citations Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Show Abstract
Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to current state of the arts. (1) Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. (2) PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could, serre as an alternative and useful backbone for pixel-level predictions and facilitate future research.
OpenAlex 2022-03-16
2058 citations Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lü, Ping Luo, Ling Shao
Computational Visual Media
Show Abstract
Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT .
Semantic Scholar 2021
30995 citations Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, B. Guo
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Show Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
TL;DR: A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Semantic Scholar 2024
14 citations Arman Keresh, Pakizar Shamoi
IEEE Access
Show Abstract
Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework utilizing CelebA-Spoof, CASIA SURF, and a proprietary dataset. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against traditional models, including CNN Model EfficientNet b2, EfficientNet b2 (Noisy Student), and Mobile ViT on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than other models in terms of accuracy and resistance to different spoofing methods. Our model’s superior performance, particularly in APCER (1.6%), the most critical metric in this domain, underscores its improved ability to detect spoofing relative to other models. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
TL;DR: This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
Semantic Scholar 2025
4 citations S. S, Mohana, Ambika G, K. A, Nataraj K, Sudhangowda B S
2025 7th International Conference on Intelligent Sustainable Systems (ICISS)
Show Abstract
Vision Transformer (ViT) is an image recognition model that uses transformer architecture, which has a numerous advantage over Convolution Neural Networks (CNN). It offers improved accuracy, scalability, flexibility, global context, and transferability. ViT can handle images of different sizes and aspect ratios, making it more versatile than CNN. It can process an entire image at once, allowing it to capture global context information and long-range dependencies. Additionally, ViTs pre-training on huge amounts of image data can be transferred to other image recognition tasks, making it a useful tool for transfer learning. This paper describes the differences between ViT and CNN and how ViT splits images into patches for classification. The positional encoding of different features is done in ViT to avoid the requirement of filters. Proposed implementation obtained final accuracy of prediction 93% for top-1 accuracy.
Semantic Scholar 2024
2 citations Nayem Uddin Prince, Md. Abdullah Al Mamun, Md. Tanvir Miah Shagar, Md. Rezaul Karim Emon, Md. Sahadat Hossen Sajib
2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS)
Show Abstract
Northern Bangladesh is home to the most lychee growing, which has a major economic impact. Lychee output and quality are reduced by various leaf and fruit diseases. Deep learning is used to construct a disease detection system for lychee diseases to detect them early and accurately. We correctly identified 6,000 photos of healthy and unhealthy lychee foliage and fruits. Six deep learning algorithms—VGG16, CustomCNN, MobileNet, InceptionV3, ResNet50, and Vision Transformer— classified the photos. Normalization and augmentation improved model resilience during data preparation. F1-score, recall, accuracy, and precision were utilized to train and evaluate the models. Vision Transformer (ViT) scores 99.91% in accuracy, recall, and F1-score, outperforming the other models. This achievement shows that ViT can better recognize complex lychee disease traits. The ViT algorithm for disease detection can help Bangladeshi farmers identify illnesses quickly and accurately. Reduced lychee loss and improved fruit quality and attractiveness should boost lychee sales in domestic and international markets.