🤖
Technology Hub

Large Language Models

LLM research and development. Transformer architectures, training methods, alignment, reasoning capabilities, multimodal models, and AI safety.

Computer Science / AI
View curated hub

Results for "large language model transformer"

1,804,995 total results — showing 20 from PubMed + NASA ADS + arXiv + OpenAlex
PubMed 2025 May

Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis.

Iqbal Usman, Tanweer Afifa, Rahmanti Annisa Ristya, Greenfield David, Lee Leon Tsung-Ju, Li Yu-Chuan Jack

Journal of biomedical science

Show Abstract

BACKGROUND: The emergence of Artificial Intelligence (AI), particularly Chat Generative Pre-Trained Transformer (ChatGPT), a Large Language Model (LLM), in healthcare promises to reshape patient care, clinical decision-making, and medical education. This review aims to synthesise research findings to consolidate the implications of ChatGPT integration in healthcare and identify research gaps.

MAIN BODY: The umbrella review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The Cochrane Library, PubMed, Scopus, Web of Science, and Google Scholar were searched from inception until February 2024. Due to the heterogeneity of the included studies, no quantitative analysis was performed. Instead, information was extracted, summarised, synthesised, and presented in a narrative form. Two reviewers undertook title, abstract, and full text screening independently. The methodological quality and overall rating of the included reviews were assessed using the A Measurement Tool to Assess systematic Reviews (AMSTAR-2) checklist. The review examined 17 studies, comprising 15 systematic reviews and 2 meta-analyses, on ChatGPT in healthcare, revealing diverse focuses. The AMSTAR-2 assessment identified 5 moderate and 12 low-quality reviews, with deficiencies like study design justification and funding source reporting. The most reported theme that emerged was ChatGPT's use in disease diagnosis or clinical decision-making. While 82.4% of studies focused on its general usage, 17.6% explored unique topics like its role in medical examinations and conducting systematic reviews. Among these, 52.9% targeted general healthcare, with 41.2% focusing on specific domains like radiology, neurosurgery, gastroenterology, public health dentistry, and ophthalmology. ChatGPT's use for manuscript review or writing was mentioned in 17.6% of reviews. Promising applications include enhancing patient care and clinical decision-making, though ethical, legal, and accuracy concerns require cautious integration.

CONCLUSION: We summarise the identified areas in reviews regarding ChatGPT's transformative impact in healthcare, highlighting patient care, decision-making, and medical education. Emphasising the importance of ethical regulations and the involvement of policymakers, we urge further investigation to ensure the reliability of ChatGPT and to promote trust in healthcare and research.

PubMed 2025 Mar

A systematic review of large language model (LLM) evaluations in clinical medicine.

Shool Sina, Adimi Sara, Saboori Amleshi Reza, Bitaraf Ehsan, Golpira Reza, Tara Mahmood

BMC medical informatics and decision making

Show Abstract

BACKGROUND: Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment.

OBJECTIVE: This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends.

METHODS: A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts.

RESULTS: The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks.

CONCLUSIONS: The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.

PubMed Review 2023 Oct

ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.

Kim Jin K, Chua Michael, Rickard Mandy, Lorenzo Armando

Journal of pediatric urology

Show Abstract

INTRODUCTION: There is currently no clear consensus on the standards for using large language models such as ChatGPT in academic medicine. Hence, we performed a scoping review of available literature to understand the current state of LLM use in medicine and to provide a guideline for future utilization in academia.

MATERIALS AND METHODS: A scoping review of the literature was performed through a Medline search on February 16, 2023 using a combination of keywords including artificial intelligence, machine learning, natural language processing, generative pre-trained transformer, ChatGPT, and large language model. There were no restrictions to language or date of publication. Records not pertaining to LLMs were excluded. Records pertaining to LLM ChatBots and ChatGPT were identified and evaluated separately. Among the records pertaining to LLM ChatBots and ChatGPT, those that suggest recommendations for ChatGPT use in academia were utilized to create guideline statements for ChatGPT and LLM use in academic medicine.

RESULTS: A total of 87 records were identified. 30 records were not pertaining to large language models and were excluded. 54 records underwent a full-text review for evaluation. There were 33 records related to LLM ChatBots or ChatGPT.

DISCUSSION: From assessing these texts, five guideline statements for LLM use was developed: (1) ChatGPT/LLM cannot be cited as an author in scientific manuscripts; (2) If use of ChatGPT/LLM are considered for use in academic work, author(s) should have at least a basic understanding of what ChatGPT/LLM is; (3) Do not use ChatGPT/LLM to produce entirety of text in manuscripts; humans must be held accountable for use of ChatGPT/LLM and contents created by ChatGPT/LLM should be meticulously verified by humans; (4) ChatGPT/LLMs may be used for editing and refining of text; (5) Any use of ChatGPT/LLM should be transparent and should be clearly outlined in scientific manuscripts and acknowledged.

CONCLUSION: Future authors should remain mindful of the potential impact their academic work may have on healthcare and continue to uphold the highest ethical standards and integrity when utilizing ChatGPT/LLM.

PubMed 2024 Nov

Examining the Role of Large Language Models in Orthopedics: Systematic Review.

Zhang Cheng, Liu Shanshan, Zhou Xingyu, Zhou Siyu, Tian Yinglun, Wang Shenglin, Xu Nanfang, Li Weishi

Journal of medical Internet research

Show Abstract

BACKGROUND: Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent.

OBJECTIVE: The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges.

METHODS: PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment.

RESULTS: A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections.

CONCLUSIONS: LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.

PubMed 2025 May

Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.

Lieberum Judith-Lisa, Toews Markus, Metzendorf Maria-Inti, Heilmeyer Felix, Siemens Waldemar, Haverkamp Christian, Böhringer Daniel, Meerpohl Joerg J, Eisele-Metzger Angelika

Journal of clinical epidemiology

Show Abstract

BACKGROUND AND OBJECTIVES: Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research.

METHODS: We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another.

RESULTS: Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%).

CONCLUSION: Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.

PLAIN LANGUAGE SUMMARY: Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.

PubMed 2025 Apr

Artificial intelligence-assisted academic writing: recommendations for ethical use.

Cheng Adam, Calhoun Aaron, Reedy Gabriel

Advances in simulation (London, England)

Show Abstract

Generative artificial intelligence (AI) tools have been selectively adopted across the academic community to help researchers complete tasks in a more efficient manner. The widespread release of the Chat Generative Pre-trained Transformer (ChatGPT) platform in 2022 has made these tools more accessible to scholars around the world. Despite their tremendous potential, studies have uncovered that large language model (LLM)-based generative AI tools have issues with plagiarism, AI hallucinations, and inaccurate or fabricated references. This raises legitimate concern about the utility, accuracy, and integrity of AI when used to write academic manuscripts. Currently, there is little clear guidance for healthcare simulation scholars outlining the ways that generative AI could be used to legitimately support the production of academic literature. In this paper, we discuss how widely available, LLM-powered generative AI tools (e.g. ChatGPT) can help in the academic writing process. We first explore how academic publishers are positioning the use of generative AI tools and then describe potential issues with using these tools in the academic writing process. Finally, we discuss three categories of specific ways generative AI tools can be used in an ethically sound manner and offer four key principles that can help guide researchers to produce high-quality research outputs with the highest of academic integrity.

PubMed 2024 May

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

Preiksaitis Carl, Ashenburg Nicholas, Bunney Gabrielle, Chu Andrew, Kabeer Rana, Riley Fran, Ribeira Ryan, Rose Christian

JMIR medical informatics

Show Abstract

BACKGROUND: Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM.

OBJECTIVE: Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field.

METHODS: Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data.

RESULTS: A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills.

CONCLUSIONS: LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.

PubMed 2026 Jan

Large Language Model Agent for Managing Patients With Suspected Hypertension.

Wang Yijun, Tan Wuping, Cheng Siyi, Peng Chen, Peng Jin, Qin Fanglin, Tang Long, Zhu Tongjian, Wu Bing, Liu Jinjun, Wang Jun

Hypertension (Dallas, Tex. : 1979)

Show Abstract

BACKGROUND: The effectiveness of Large Language Model agent frameworks for hypertension screening and personalized health management has not been fully studied. This study aimed to develop and evaluate a Large Language Model-based Agent, called the Cascade Framework, and assess its effectiveness in hypertension education and clinical decision support.

METHODS: The Cascade Framework was developed utilizing the Dify platform, and its performance was tested via a robust 2-phase evaluation protocol from August 2024 to June 2025. The first phase involved systematic performance benchmarking of 6 configurations: 3 foundational Large Language Models (Chat Generative Pretrained Transformer [ChatGPT]-4o, ChatGPT-4oMini, and DeepSeek-V3) and their respective Cascade-enhanced versions. The second phase included an external validation in a cohort of patients with suspected hypertension.

RESULTS: Cascade integration yielded significant performance improvements across all models. For ChatGPT-4o, educational outcomes improved (Accuracy: 3.87&#x2192;4.10, P=0.02; Comprehensiveness: 4.07&#x2192;4.32, P=0.16; Credibility: 3.79&#x2192;4.03, P<0.001; Understandability: 3.90&#x2192;3.96, P=0.005; Emotional Support: 3.87&#x2192;4.01, P<0.001). Blood pressure classification accuracy rose from 62.5% to 87.0% (P<0.001) and risk factor stratification from 60.4% to 98.6% (P<0.001). Clinical decision-making improved, with accuracy of 72.0% to 92.5%. A similar trend of performance improvement was observed in the external validation cohort, where the 4o-Cascade model achieved increases in blood pressure classification accuracy (58.9%&#x2192;95.3%), risk stratification accuracy (71.0%&#x2192;90.7%), and clinical decision appropriateness (66.4%&#x2192;92.5%), all with P<0.001 and surpassing the performance of the 3 physicians.

CONCLUSIONS: Cascade Framework can improve the management of hypertension. Its extensible architecture allows integration with existing clinical workflows while providing transparent reasoning pathways.

NASA ADS 2023-04-00
136 citations

Can Large Language Models Transform Computational Social Science?

Ziems, Caleb, Held, William, Shaikh, Omar, Chen, Jiaao, Zhang, Zhehao, Yang, Diyi

arXiv e-prints

Show Abstract

Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references. We conclude that the performance of today's LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.

NASA ADS 2024-00-00
31 citations

A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends

Younesi, Abolfazl, Ansari, Mohsen, Fazli, Mohammadamin, Ejlali, Alireza, Shafique, Muhammad, Henkel, Jörg

IEEE Access

Show Abstract

In today's digital age, Convolutional Neural Networks (CNNs), a subset of Deep Learning (DL), are widely used for various computer vision tasks such as image classification, object detection, and image segmentation. There are numerous types of CNNs designed to meet specific needs and requirements, including 1D, 2D, and 3D CNNs, as well as dilated, grouped, attention, depthwise convolutions, and NAS, among others. Each type of CNN has its unique structure and characteristics, making it suitable for specific tasks. It's crucial to gain a thorough understanding and perform a comparative analysis of these different CNN types to understand their strengths and weaknesses. Furthermore, studying the performance, limitations, and practical applications of each type of CNN can aid in the development of new and improved architectures in the future. We also dive into the platforms and frameworks that researchers utilize for their research or development from various perspectives. Additionally, we explore the main research fields of CNN like 6D vision, generative models, and meta-learning. This survey paper provides a comprehensive examination and comparison of various CNN architectures, highlighting their architectural differences and emphasizing their respective advantages, disadvantages, applications, challenges, and future trends.

NASA ADS 2022-10-00
14 citations

How Large Language Models are Transforming Machine-Paraphrased Plagiarism

Wahle, Jan Philip, Ruas, Terry, Kirstein, Frederic, Gipp, Bela

arXiv e-prints

Show Abstract

The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.

NASA ADS 2024-10-00
15 citations

Large language models in plant biology

Lam, Hilbert Yuen In, Ong, Xing Er, Mutwil, Marek

Trends in Plant Science

Show Abstract

Large language models (LLMs), such as ChatGPT, have taken the world by storm. However, LLMs are not limited to human language and can be used to analyze sequential data, such as DNA, protein, and gene expression. The resulting foundation models can be repurposed to identify the complex patterns within the data, resulting in powerful, multipurpose prediction tools able to predict the state of cellular systems. This review outlines the different types of LLMs and showcases their recent uses in biology. Since LLMs have not yet been embraced by the plant community, we also cover how these models can be deployed for the plant kingdom.

arXiv 2024-02-05

A Survey on Transformer Compression

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

arXiv:2402.05964v2 [cs.LG]

Show Abstract

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.

arXiv 2024-02-18

Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin

arXiv:2402.11651v2 [cs.CL]

Show Abstract

Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.

arXiv 2023-12-17

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Renxi Wang, Haonan Li, Minghao Wu, Yuxia Wang, Xudong Han, Chiyu Zhang, Timothy Baldwin

arXiv:2312.10793v3 [cs.CL]

Show Abstract

Instruction tuning significantly enhances the performance of large language models (LLMs) across various tasks. However, the procedure to optimizing the mixing of instruction datasets for LLM fine-tuning is still poorly understood. This study categorizes instructions into three primary types: NLP downstream tasks, coding, and general chat. We explore the effects of instruction tuning on different combinations of datasets on LLM performance, and find that certain instruction types are more advantageous for specific applications but can negatively impact other areas. This work provides insights into instruction mixtures, laying the foundations for future research.

arXiv 2023-04-24

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang

The Twelfth International Conference on Learning Representations (ICLR 2024)

Show Abstract

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

OpenAlex 2021-10-01
28738 citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Show Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.

OpenAlex 2021-07-01
164 citations

The great Transformer: Examining the role of large language models in the political economy of AI

Dieuwertje Luitse, Wiebke Denkena

Big Data & Society

Show Abstract

In recent years, AI research has become more and more computationally demanding. In natural language processing (NLP), this tendency is reflected in the emergence of large language models (LLMs) like GPT-3. These powerful neural network-based models can be used for a range of NLP tasks and their language generation capacities have become so sophisticated that it can be very difficult to distinguish their outputs from human language. LLMs have raised concerns over their demonstrable biases, heavy environmental footprints, and future social ramifications. In December 2020, critical research on LLMs led Google to fire Timnit Gebru, co-lead of the company’s AI Ethics team, which sparked a major public controversy around LLMs and the growing corporate influence over AI research. This article explores the role LLMs play in the political economy of AI as infrastructural components for AI research and development. Retracing the technical developments that have led to the emergence of LLMs, we point out how they are intertwined with the business model of big tech companies and further shift power relations in their favour. This becomes visible through the Transformer, which is the underlying architecture of most LLMs today and started the race for ever bigger models when it was introduced by Google in 2017. Using the example of GPT-3, we shed light on recent corporate efforts to commodify LLMs through paid API access and exclusive licensing, raising questions around monopolization and dependency in a field that is increasingly divided by access to large-scale computing power.

OpenAlex 2023-07-17
107 citations

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei

arXiv (Cornell University)

Show Abstract

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.