OryzaGP: rice gene and protein dataset for named-entity recognition

Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant mol...

Full description

Saved in:
Bibliographic Details
Main Authors: Pierre Larmande, Huy Do, Yue Wang
Format: Article
Language:English
Published: BioMed Central 2019-06-01
Series:Genomics & Informatics
Subjects:
Online Access:http://genominfo.org/upload/pdf/gi-2019-17-2-e17.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832569263057010688
author Pierre Larmande
Huy Do
Yue Wang
author_facet Pierre Larmande
Huy Do
Yue Wang
author_sort Pierre Larmande
collection DOAJ
description Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.
format Article
id doaj-art-6e73defb9b5c4efa9f4d304f5edad4f1
institution Kabale University
issn 2234-0742
language English
publishDate 2019-06-01
publisher BioMed Central
record_format Article
series Genomics & Informatics
spelling doaj-art-6e73defb9b5c4efa9f4d304f5edad4f12025-02-02T22:28:50ZengBioMed CentralGenomics & Informatics2234-07422019-06-0117210.5808/GI.2019.17.2.e17559OryzaGP: rice gene and protein dataset for named-entity recognitionPierre Larmande0Huy Do1Yue Wang2 UMR DIADE, Institute of Research for Sustainable Development (IRD), F-34394 Montpellier, France ICT Lab, University of Science and Technology of Hanoi (USTH), 100000 Hanoi, Vietnam Database Center for Life Science (DBCLS), Chiba 277-0871, JapanText mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.http://genominfo.org/upload/pdf/gi-2019-17-2-e17.pdfnamed-entity recognitionnatural language processingOryza sativaplant molecular biologyricetext mining
spellingShingle Pierre Larmande
Huy Do
Yue Wang
OryzaGP: rice gene and protein dataset for named-entity recognition
Genomics & Informatics
named-entity recognition
natural language processing
Oryza sativa
plant molecular biology
rice
text mining
title OryzaGP: rice gene and protein dataset for named-entity recognition
title_full OryzaGP: rice gene and protein dataset for named-entity recognition
title_fullStr OryzaGP: rice gene and protein dataset for named-entity recognition
title_full_unstemmed OryzaGP: rice gene and protein dataset for named-entity recognition
title_short OryzaGP: rice gene and protein dataset for named-entity recognition
title_sort oryzagp rice gene and protein dataset for named entity recognition
topic named-entity recognition
natural language processing
Oryza sativa
plant molecular biology
rice
text mining
url http://genominfo.org/upload/pdf/gi-2019-17-2-e17.pdf
work_keys_str_mv AT pierrelarmande oryzagpricegeneandproteindatasetfornamedentityrecognition
AT huydo oryzagpricegeneandproteindatasetfornamedentityrecognition
AT yuewang oryzagpricegeneandproteindatasetfornamedentityrecognition