The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed tex...

Full description

Saved in:

Bibliographic Details
Main Authors:	Susanne Haaf, Alexander Geyken, Frank Wiegand
Format:	Article
Language:	deu
Published:	Text Encoding Initiative Consortium 2015-04-01
Series:	Journal of the Text Encoding Initiative
Subjects:	interchange interoperability standardization corpus annotation schema design TEI customization
Online Access:	https://journals.openedition.org/jtei/1114
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832578510426734592
author	Susanne Haaf Alexander Geyken Frank Wiegand
author_facet	Susanne Haaf Alexander Geyken Frank Wiegand
author_sort	Susanne Haaf
collection	DOAJ
description	In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text.
format	Article
id	doaj-art-cdb2aa2a292f432ea5b942a927b7447c
institution	Kabale University
issn	2162-5603
language	deu
publishDate	2015-04-01
publisher	Text Encoding Initiative Consortium
record_format	Article
series	Journal of the Text Encoding Initiative
spelling	doaj-art-cdb2aa2a292f432ea5b942a927b7447c2025-01-30T13:56:21ZdeuText Encoding Initiative ConsortiumJournal of the Text Encoding Initiative2162-56032015-04-01810.4000/jtei.1114The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple SourcesSusanne HaafAlexander GeykenFrank WiegandIn this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text.https://journals.openedition.org/jtei/1114interchangeinteroperabilitystandardizationcorpus annotationschema designTEI customization
spellingShingle	Susanne Haaf Alexander Geyken Frank Wiegand The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources Journal of the Text Encoding Initiative interchange interoperability standardization corpus annotation schema design TEI customization
title	The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
title_full	The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
title_fullStr	The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
title_full_unstemmed	The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
title_short	The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
title_sort	dta base format a tei subset for the compilation of a large reference corpus of printed text from multiple sources
topic	interchange interoperability standardization corpus annotation schema design TEI customization
url	https://journals.openedition.org/jtei/1114
work_keys_str_mv	AT susannehaaf thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT alexandergeyken thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT frankwiegand thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT susannehaaf dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT alexandergeyken dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT frankwiegand dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Similar Items