An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design an...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924012174 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576460102041600 |
---|---|
author | Deperias Kerre Anne Laurent Kenneth Maussang Dickson Owuor |
author_facet | Deperias Kerre Anne Laurent Kenneth Maussang Dickson Owuor |
author_sort | Deperias Kerre |
collection | DOAJ |
description | Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output. |
format | Article |
id | doaj-art-21107ef164534a18b62dc0887ba64739 |
institution | Kabale University |
issn | 2352-3409 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj-art-21107ef164534a18b62dc0887ba647392025-01-31T05:11:42ZengElsevierData in Brief2352-34092025-02-0158111255An instruction dataset for extracting quantum cascade laser properties from scientific textDataverseDeperias Kerre0Anne Laurent1Kenneth Maussang2Dickson Owuor3LIRMM, Univ Montpellier, CNRS, Montpellier, France; SCES, Strathmore University, Nairobi, Kenya; Corresponding author at: LIRMM, Univ Montpellier, CNRS, Montpellier, FranceLIRMM, Univ Montpellier, CNRS, Montpellier, FranceInstitut d'Electronique et des Systèmes, UMR 5214, Univ Montpellier, CNRS, Montpellier, FranceSCES, Strathmore University, Nairobi, KenyaQuantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.http://www.sciencedirect.com/science/article/pii/S2352340924012174Information extractionLarge language modelsMachine learningQuantum cascade lasers |
spellingShingle | Deperias Kerre Anne Laurent Kenneth Maussang Dickson Owuor An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse Data in Brief Information extraction Large language models Machine learning Quantum cascade lasers |
title | An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse |
title_full | An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse |
title_fullStr | An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse |
title_full_unstemmed | An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse |
title_short | An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse |
title_sort | instruction dataset for extracting quantum cascade laser properties from scientific textdataverse |
topic | Information extraction Large language models Machine learning Quantum cascade lasers |
url | http://www.sciencedirect.com/science/article/pii/S2352340924012174 |
work_keys_str_mv | AT deperiaskerre aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT annelaurent aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT kennethmaussang aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT dicksonowuor aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT deperiaskerre instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT annelaurent instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT kennethmaussang instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse AT dicksonowuor instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse |