{"technology":{"slug":"large-language-models","name":"Large Language Models","description":"LLM research and development. Transformer architectures, training methods, alignment, reasoning capabilities, multimodal models, and AI safety.","discipline":"Computer Science / AI","icon":"🤖"},"lastUpdated":"2026-04-11T06:42:40.480Z","articleCount":15,"articles":[{"id":"oa-W3138516171","title":"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows","authors":"Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo","journal":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","pubDate":"2021-10-01","doi":"10.1109/iccv48922.2021.00986","abstract":"This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W3138516171","citationCount":28719,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4225672218","title":"Restormer: Efficient Transformer for High-Resolution Image Restoration","authors":"Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming–Hsuan Yang","journal":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","pubDate":"2022-06-01","doi":"10.1109/cvpr52688.2022.00564","abstract":"Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising). The source code and pre-trained models are available at https://github.com/swz30/Restormer.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4225672218","citationCount":3239,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4224308101","title":"PaLM: Scaling Language Modeling with Pathways","authors":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek S. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James T. Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, D. Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Érica Rodrigues Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Fırat, Michele Catasta, Jason Lee, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel","journal":"arXiv (Cornell University)","pubDate":"2022-04-05","doi":"10.48550/arxiv.2204.02311","abstract":"Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4224308101","citationCount":2124,"isOpenAccess":true,"pdfUrl":"https://arxiv.org/pdf/2204.02311"},{"id":"oa-W4281485151","title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","authors":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi","journal":"arXiv (Cornell University)","pubDate":"2022-05-23","doi":"10.48550/arxiv.2205.11487","abstract":"We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4281485151","citationCount":2103,"isOpenAccess":true,"pdfUrl":"https://arxiv.org/pdf/2205.11487"},{"id":"oa-W3105966348","title":"TinyBERT: Distilling BERT for Natural Language Understanding","authors":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Dong Chen, Linlin Li, Fang Wang, Qun Liu","journal":"","pubDate":"2020-01-01","doi":"10.18653/v1/2020.findings-emnlp.372","abstract":"Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large \"teacher\" BERT can be effectively transferred to a small \"student\" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W3105966348","citationCount":1590,"isOpenAccess":true,"pdfUrl":"https://www.aclweb.org/anthology/2020.findings-emnlp.372.pdf"},{"id":"oa-W3202773593","title":"The great Transformer: Examining the role of large language models in the political economy of AI","authors":"Dieuwertje Luitse, Wiebke Denkena","journal":"Big Data & Society","pubDate":"2021-07-01","doi":"10.1177/20539517211047734","abstract":"In recent years, AI research has become more and more computationally demanding. In natural language processing (NLP), this tendency is reflected in the emergence of large language models (LLMs) like GPT-3. These powerful neural network-based models can be used for a range of NLP tasks and their language generation capacities have become so sophisticated that it can be very difficult to distinguish their outputs from human language. LLMs have raised concerns over their demonstrable biases, heavy environmental footprints, and future social ramifications. In December 2020, critical research on LLMs led Google to fire Timnit Gebru, co-lead of the company’s AI Ethics team, which sparked a major public controversy around LLMs and the growing corporate influence over AI research. This article explores the role LLMs play in the political economy of AI as infrastructural components for AI research and development. Retracing the technical developments that have led to the emergence of LLMs, we point out how they are intertwined with the business model of big tech companies and further shift power relations in their favour. This becomes visible through the Transformer, which is the underlying architecture of most LLMs today and started the race for ever bigger models when it was introduced by Google in 2017. Using the example of GPT-3, we shed light on recent corporate efforts to commodify LLMs through paid API access and exclusive licensing, raising questions around monopolization and dependency in a field that is increasingly divided by access to large-scale computing power.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W3202773593","citationCount":164,"isOpenAccess":true,"pdfUrl":"https://journals.sagepub.com/doi/pdf/10.1177/20539517211047734"},{"id":"oa-W4399367209","title":"Transformers and large language models in healthcare: A review","authors":"Subhash Nerella, Sabyasachi Bandyopadhyay, Jiaqing Zhang, Miguel Á. Contreras, Scott Siegel, Aysegül Bumin, Brandon Silva, Jessica Sena, Benjamin Shickel, Azra Bihorac, Kia Khezeli, Parisa Rashidi","journal":"Artificial Intelligence in Medicine","pubDate":"2024-06-05","doi":"10.1016/j.artmed.2024.102900","abstract":"","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4399367209","citationCount":114,"isOpenAccess":true,"pdfUrl":"https://www.ncbi.nlm.nih.gov/pmc/articles/11638972"},{"id":"oa-W4384648484","title":"Retentive Network: A Successor to Transformer for Large Language Models","authors":"Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei","journal":"arXiv (Cornell University)","pubDate":"2023-07-17","doi":"10.48550/arxiv.2307.08621","abstract":"In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4384648484","citationCount":107,"isOpenAccess":true,"pdfUrl":"https://arxiv.org/pdf/2307.08621"},{"id":"oa-W4410857897","title":"Transformers and large language models for efficient intrusion detection systems: A comprehensive survey","authors":"Hamza Kheddar","journal":"Information Fusion","pubDate":"2025-05-29","doi":"10.1016/j.inffus.2025.103347","abstract":"","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4410857897","citationCount":85,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4361766487","title":"Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?","authors":"Byung-Doh Oh, William Schuler","journal":"Transactions of the Association for Computational Linguistics","pubDate":"2023-01-01","doi":"10.1162/tacl_a_00548","abstract":"Abstract This work presents a linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4361766487","citationCount":82,"isOpenAccess":true,"pdfUrl":"https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00548/2075940/tacl_a_00548.pdf"},{"id":"oa-W3166204619","title":"BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA","authors":"Sultan Alrowili, Vijay Shanker","journal":"","pubDate":"2021-01-01","doi":"10.18653/v1/2021.bionlp-1.24","abstract":"The impact of design choices on the performance of biomedical language models recently has been a subject for investigation. In this paper, we empirically study biomedical domain adaptation with large transformer models using different design choices. We evaluate the performance of our pretrained models against other existing biomedical language models in the literature. Our results show that we achieve state-of-the-art results on several biomedical domain tasks despite using similar or less computational cost compared to other models in the literature. Our findings highlight the significant effect of design choices on improving the performance of biomedical language models.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W3166204619","citationCount":68,"isOpenAccess":true,"pdfUrl":"https://aclanthology.org/2021.bionlp-1.24.pdf"},{"id":"oa-W4309685935","title":"Prompt text classifications with transformer models! An exemplary introduction to prompt-based learning with large language models","authors":"Christian Mayer, Sabrina Ludwig, Steffen Brandt","journal":"Journal of Research on Technology in Education","pubDate":"2022-11-22","doi":"10.1080/15391523.2022.2142872","abstract":"This study investigates the potential of automated classification using prompt-based learning approaches with transformer models (large language models trained in an unsupervised manner) for a domain-specific classification task. Prompt-based learning with zero or few shots has the potential to (1) make use of artificial intelligence without sophisticated programming skills and (2) make use of artificial intelligence without fine-tuning models with large amounts of labeled training data. We apply this novel method to perform an experiment using so-called zero-shot classification as a baseline model and a few-shot approach for classification. For comparison, we also fine-tune a language model on the given classification task and conducted a second independent human rating to compare it with the given human ratings from the original study. The used dataset consists of 2,088 email responses to a domain-specific problem-solving task that were manually labeled for their professional communication style. With the novel prompt-based learning approach, we achieved a Cohen’s kappa of .40, while the fine-tuning approach yields a kappa of .59, and the new human rating achieved a kappa of .58 with the original human ratings. However, the classifications from the machine learning models have the advantage that each prediction is provided with a reliability estimate allowing us to identify responses that are difficult to score. We, therefore, argue that response ratings should be based on a reciprocal workflow of machine raters and human raters, where the machine rates easy-to-classify responses and the human raters focus and agree on the responses that are difficult to classify. Further, we believe that this new, more intuitive, prompt-based learning approach will enable more people to use artificial intelligence.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4309685935","citationCount":51,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4393407316","title":"An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models","authors":"Sangsoo Park, Kyung-Soo Kim, Jinin So, Jin Chul Jung, Jong-Geon Lee, Kyoungwan Woo, Nayeon Kim, Younghyun Lee, Hyungyo Kim, Yongsuk Kwon, Jinhyun Kim, Jieun Lee, Yeongon Cho, Yong-Min Tai, Jeong‐Hyeon Cho, Hoyoung Song, Jung Ho Ahn, Nam Sung Kim","journal":"","pubDate":"2024-03-02","doi":"10.1109/hpca57654.2024.00078","abstract":"Transformer-based large language models (LLMs) such as Generative Pre-trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-to-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4393407316","citationCount":50,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4386921320","title":"Self-Attention and Transformers: Driving the Evolution of Large Language Models","authors":"Qing Luo, Wei Zeng, Manni Chen, Gang‐Ding Peng, Xiaofeng Yuan, Qiang Yin","journal":"","pubDate":"2023-07-21","doi":"10.1109/iceict57916.2023.10245906","abstract":"Transformers, originally introduced for machine translation, and built upon the Self-Attention mechanism, have undergone a remarkable evolution, establishing themselves as the bedrock of large language models (LLMs). Their unparalleled capacity to model intricate relationships and capture extensive dependencies within sequences has propelled their prominence. This article, presented in a popular science format, serves as an introduction to the transformer architecture, elucidating its innovative structure that enables efficient processing of long sequences and capturing dependencies over extended distances. We believe that this resource will prove valuable to college students or youth researchers aspiring to delve into the study and research of modern Artificial Intelligence (AI) domains.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4386921320","citationCount":36,"isOpenAccess":false,"pdfUrl":""},{"id":"oa-W4388717695","title":"To Transformers and Beyond: Large Language Models for the Genome","authors":"Micaela Elisa Consens, C Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J. Theis, Alan M Moses, Bo Wang","journal":"arXiv (Cornell University)","pubDate":"2023-11-13","doi":"10.48550/arxiv.2311.07621","abstract":"In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers and other LLMs for genomics. Additionally, we contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research. The paper aims to serve as a guide for computational biologists and computer scientists interested in LLMs for genomic data. We hope the paper can also serve as an educational introduction and discussion for biologists to a fundamental shift in how we will be analyzing genomic data in the future.","tldr":"","source":"OpenAlex","sourceUrl":"https://openalex.org/W4388717695","citationCount":31,"isOpenAccess":true,"pdfUrl":"https://arxiv.org/pdf/2311.07621"}],"links":{"web":"https://science-database.com/technology/large-language-models","llms_txt":"https://science-database.com/technology/large-language-models/llms.txt","api":"https://science-database.com/api/v1/technology/large-language-models"}}