Historical Text Corpora for the Humanities and Social Sciences. Digitization, Annotation, Quality Assurance and Analysis
Guiding Questions of this one week course (16 teaching hours) will be:
- How can historical texts be digitized, annotated and registered in a sustainable, interoperable format?
- How can these textual resources be integrated into the infrastructure of CLARIN(-D)?
- How can these textual resources be further processed by usage of CLARIN(-D)-Tools?
This course gives an overview of the methods for creating standard conformant and interoperable resources of historical texts. We will show how new and existing textual resources can be built up or further processed, respectively, in order to meet the requirements of CLARIN. In this context, we present different possible workflows for the integration of textual resources into the infrastructure of CLARIN-D by example of the Deutsches Textarchiv project. Problem fields discussed here include:
- the acquisition and provision of high quality image sources,
- guidelines for transcriptions true to the source material,
- text structuring and metadata recording according to the TEI-P5 based DTA ›Base Format‹ (DTABf), the best practice format for the annotation of historical written corpora in
- quality assurance within the digitization process.
We will introduce several tools and services provided by the DTA and by CLARIN to support the different tasks in the process of building up CLARIN-conformant resources. Finally, we will demonstrate how resources which are built up and structured homogeneously according to the DTABf may be linguistically analyzed with automatic methods and automatically converted into other standardized formats, how they may be made available and provided in the long term within the CLARIN infrastructure and how they can be analyzed by usage of CLARIN-/DTA-tools.
DTAE workflow (www.deutschestextarchiv.de/dtae)
- Deutsches Textarchiv DTA: www.deutschestextarchiv.de
- DTA-Extensions DTAE: www.deutschestextarchiv.de/dtae
- DTA ›Base Format‹ DTABf: www.deutschestextarchiv.de/doku/basisformat
- DTA Quality Assurance Platform DTAQ: www.deutschestextarchiv.de/dtaq
- Cascaded Analysis Broker CAB: www.deutschestextarchiv.de/doku/software#cab
(tool for the automatic modernization of historical orthography)
- CLARIN-D Service Center at the Zentrum Sprache of the BBAW: http://clarin.bbaw.de/
- CLARIN-D WP 5: Services and Resources:
- CLARIN-D Curation Project 1 WG 1: http://www.deutschestextarchiv.de/clarin_kupro
- Important dates
- Child care
- Scientific Committee