Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models
[Purpose/Significance] Scientific literature contains rich domain knowledge and scientific data, which can provide high-quality data support for AI-driven scientific research (AI4S). This paper systematically reviews the methods, tools, and applications of arge language models (LLMs) in scientific l...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Editorial Department of Journal of Library and Information Science in Agriculture
2025-02-01
|
| Series: | Nongye tushu qingbao xuebao |
| Subjects: | |
| Online Access: | http://nytsqb.aiijournal.com/fileup/1002-1248/PDF/1747741303657-999650976.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | [Purpose/Significance] Scientific literature contains rich domain knowledge and scientific data, which can provide high-quality data support for AI-driven scientific research (AI4S). This paper systematically reviews the methods, tools, and applications of arge language models (LLMs) in scientific literature data mining, and discusses their research directions and development trends. It addresses critical shortcomings in interdisciplinary knowledge extraction and provides practical insights to enhance AI4S workflows, thereby aligning AI capabilities with domain-specific scientific needs. [Method/Process] This study employs a systematic literature review and case analysis to formulate a tripartite framework: 1) Methodological dimension: Textual knowledge mining uses dynamic prompts, few-shot learning, and domain-adaptive pre-training (such as MagBERT and MatSciBERT) to improve entity recognition. Scientific data extraction uses chain-of-thought prompting and knowledge graphs (such as ChatExtract and SynAsk) to parse experimental datasets. Chart decoding uses neural networks to extract numerical values and semantic patterns from visual elements. 2) Tool dimension: This explores the core functionalities of notable AI tools, including data mining platforms (such as LitU, SciAIEngine) and knowledge generation systems (such as Agent Laboratory, VirSci), with a focus on multimodal processing and automation. 3) Application dimension: LLMs produce high-quality datasets to tackle the issue of data scarcity. They facilitate tasks such as predicting material properties and diagnosing medical conditions. The scientific credibility of these datasets is ensured through a process of "LLMs + expert validation". [Results/Conclusions] The findings indicate that LLMs significantly improve the automation of scientific literature mining. Methodologically, this research introduces dynamic prompt learning frameworks and domain adaptation fine-tuning technologies to address the shortcomings of traditional rule-driven approaches. In terms of tools, cross-modal parsing tools and interactive analysis platforms have been developed to facilitate end-to-end data mining and knowledge generation. In terms of applications, the study has accelerated the transition of scientific literature from single-modal to multimodal formats, thereby supporting the creation of high-quality scientific datasets, vertical domain-specific models, and knowledge service platforms. However, significant challenges remain, including insufficient depth of domain knowledge embedding, the low efficiency of multimodal data collaboration, and a lack of model interpretability. Future research should focus on developing interpretable LLMs with knowledge graph integration, improving cross-modal alignment techniques, and integrating "human-in-the-loop" systems to enhance reliability. It is also imperative to establish standardized data governance and intellectual property frameworks to promote the ethical utilization of scientific literature data. Such advances will facilitate a shift from efficiency optimization to knowledge generation in AI4S. |
|---|---|
| ISSN: | 1002-1248 |