LLMs and Evidence Summarization
Yifan Peng
April 19, 2024, Friday, 2:00 PM - 3:00 PM EDT
Generative AI, exemplified by large language models (LLMs) shows great promise in assisting medical evidence summarization. However, concerns have been raised about the quality of outputs generated by pre-trained LLMs, which potentially results in harmful misinformation. In this talk, I will first discuss our investigation into the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. Our study demonstrates that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Furthermore, we observe that automatic metrics often do not strongly correlate with the quality of summaries. I will then discuss our research on the impact of fine-tuning LLMs to enhance their performance in evidence summarization. We found that compared to zero-shot learning, the fine-tuned LLMs improved the automatic evaluation metrics such as ROUGE, METEOR, CHRF, and PICO-F1. We also found that smaller fine-tuned models sometimes demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our findings confirmed the potential for LLMs to empower medical evidence summarization.
Dr. Yifan Peng obtained his M.S. and Ph.D. from the University of Delaware. His research lab is primarily interested in developing and applying computational approaches to biomedical text data and medical images (aka natural language processing and medical image analysis). It is motivated by the integration of clinical-inspired approaches to machine learning and, reciprocally, the use of these approaches to better understand decision-making in clinical systems. The three goals for future research include informatics-empowered diagnostics and prognosis assistance, patient-centered multimodal data processing and mining, and AI technology for unstructured biomedical text. Taken together, the long-term impacts of his lab's research will allow medical personnel to consider different dimensions of clinical/scientific data to find the best viable treatment methods for a complex medical condition. This will improve diagnostic performance, provide consistent recommendations for follow-up, and ultimately assist the creation of high-quality information services relevant to public health.