An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse

Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design an...

Full description

Saved in:
Bibliographic Details
Main Authors: Deperias Kerre, Anne Laurent, Kenneth Maussang, Dickson Owuor
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924012174
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576460102041600
author Deperias Kerre
Anne Laurent
Kenneth Maussang
Dickson Owuor
author_facet Deperias Kerre
Anne Laurent
Kenneth Maussang
Dickson Owuor
author_sort Deperias Kerre
collection DOAJ
description Quantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.
format Article
id doaj-art-21107ef164534a18b62dc0887ba64739
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-21107ef164534a18b62dc0887ba647392025-01-31T05:11:42ZengElsevierData in Brief2352-34092025-02-0158111255An instruction dataset for extracting quantum cascade laser properties from scientific textDataverseDeperias Kerre0Anne Laurent1Kenneth Maussang2Dickson Owuor3LIRMM, Univ Montpellier, CNRS, Montpellier, France; SCES, Strathmore University, Nairobi, Kenya; Corresponding author at: LIRMM, Univ Montpellier, CNRS, Montpellier, FranceLIRMM, Univ Montpellier, CNRS, Montpellier, FranceInstitut d'Electronique et des Systèmes, UMR 5214, Univ Montpellier, CNRS, Montpellier, FranceSCES, Strathmore University, Nairobi, KenyaQuantum Cascade Lasers (QCL) are promising semiconductor lasers, compact and powerful, but of complex design. Availability of structured data of the QCL properties can support data mining activities that seek to understand the relationship between these properties, for instance between the design and performance features. The main open source of QCL data is in scientific text which in most cases is usually unstructured. One of the ways to extract and organize this data is by utilizing Information Extraction techniques. These techniques can accelerate the process of curating QCL properties data from scientific articles for further analysis. One of the main challenges in developing machine learning algorithms for extraction of QCL properties from text is lack of quality training data for these algorithms. Large Language Models (LLMs) have demonstrated great capabilities in materials property extraction from text. They however experience challenges with domain specific properties, for instance the heterostructure and design types in the QCL domain hence for adaptation. In this paper, we present an original instruction dataset for training and evaluation of LLMs for QCL properties extraction from text. The data is generated by augmenting sample sentences from scientific articles with GPT-3.5 instruct with a few shot strategy. The dataset then is manually annotated with the help of QCL experts and is composed of 1300 rows of training examples consisting of an Instruction, Input Text and the Output.http://www.sciencedirect.com/science/article/pii/S2352340924012174Information extractionLarge language modelsMachine learningQuantum cascade lasers
spellingShingle Deperias Kerre
Anne Laurent
Kenneth Maussang
Dickson Owuor
An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
Data in Brief
Information extraction
Large language models
Machine learning
Quantum cascade lasers
title An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
title_full An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
title_fullStr An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
title_full_unstemmed An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
title_short An instruction dataset for extracting quantum cascade laser properties from scientific textDataverse
title_sort instruction dataset for extracting quantum cascade laser properties from scientific textdataverse
topic Information extraction
Large language models
Machine learning
Quantum cascade lasers
url http://www.sciencedirect.com/science/article/pii/S2352340924012174
work_keys_str_mv AT deperiaskerre aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT annelaurent aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT kennethmaussang aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT dicksonowuor aninstructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT deperiaskerre instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT annelaurent instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT kennethmaussang instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse
AT dicksonowuor instructiondatasetforextractingquantumcascadelaserpropertiesfromscientifictextdataverse