MedCite: Can Language Models Generate Verifiable Text for Medicine?

News

2025-05-15 MedCite has been accepted by ACL 2025 Findings!

Abstract

Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce MedCite, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations.

Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.

Method

MedCite integrates multiple design choices for citation generation in medical question answering. The system combines non-parametric citation, retrieval-augmented generation (RAG), and retrieval with LLM-based reranking.

Multi-pass citation generation

To fully leverage both the LLM's internal knowledge and external retrieval capabilities, MedCite adopts a multi-pass design. The model first generates an initial answer while assigning preliminary citations based on retrieved documents. It then performs a second retrieval step for each statement, identifying additional supporting documents. After deduplication, citations from both stages are merged to ensure comprehensive evidence coverage. This hybrid approach combines the strengths of generation and retrieval, improving citation precision and recall while maintaining answer correctness.

Hierarchical two-stage retrieval

Accurate citation requires both high recall and factual precision. MedCite implements a hierarchical retrieval strategy where BM25 is first used to retrieve keyword-matching candidates, followed by semantic reranking with MedCPT to prioritize factually relevant documents. This two-stage design improves the system's ability to locate precise evidence for complex biomedical claims.

Evaluation

We evaluate MedCite against several strong baselines covering both medical-domain and general-domain citation generation methods. Specifically, we compare with:

1. MedRAG: A retrieval-augmented generation approach for medical question answering.

2. Post-retrieval generation (PRG): An approach where the LLM generates answers directly conditioned on both the query and retrieved documents, with citations embedded during generation.

3. Post-generation citation (PGC): A pipeline where the LLM first generates answers using chain-of-thought prompting, and citations are added afterward through re-retrieval and LLM-based reranking.

Experiments are conducted on three models: Llama-3-8B-Instruct, UltraMedical, and the commercial model GPT-4o. The evaluation focuses on both citation accuracy and answer correctness across different model sizes and domains.

BibTeX

Citation information will be available soon after arXiv publication.