Multi-format publishing and re-purposing of historical linguistics data

Helena Bermúdez Sabel

University of Neuchâtel


  • Introduction
    • the WoPoss project as a use case
    • research question and goals
  • An open science workflow
  • Multi-format publishing as a re-usability strategy


The WoPoss project

The WoPoss project

  • Research question: how have Latin modal markers evolved during a long period of time (1000 years)?
  • Method: automatic linguistic annotation + manual semantic annotation of a representative corpus

WoPoss as an extrapolable use case

  • Generic goals
    • to have an annotated corpus that is useful and shareable
    • to have a GUI that is useful and intuitive

WoPoss as an extrapolable use case

  • Generic challenges
    • source text retrieval: copyright, philological quality, heterogeneity of formats
    • preprocessing: orthographic variants, typographical conventions, abbreviations, editorial information
    • annotation: pipeline, tool-dependency, formats
    • publication: formats

An open science workflow

woposs workflow

Corpus preparation

  • Text retrieval (online, open-source)
  • Homogenization
example TEI

Source example: TEI-Epidoc

example HTML

Source example: HTML

From “anything” to plain text

  • Analysis of each file
  • Conversion of typographical conventions and/or markup to pseudo-markup
  • Plain text output

Automatic linguistic analysis

  • Input: plain text
  • Method: StanfordNLP library for Python (Stanza)
  • Output: CONLL-U
CONLL-U example


Manual annotation

INCEpTION example


  • Exportation of results of annotation process: UIMA CAS XMI
    • Unstructured Information Management Applications: standard for annotation
    • Feature structures are represented in the UIMA Common Analysis Structure (CAS)

XMI snippets

XMI example XMI example XMI example XMI example

From XMI to TEI

  • More widespread format than UIMA CAS in the DH community
  • More suitable for editorial information
TEI example
TEI example TEI example

Correction and enrichment

  • Validation of the annotation
  • Correction of textual issues
  • Transformation of pseudo-markup into TEI elements
  • Addition of metadata (DHTK, Picca & Egloff 2017)

Diachronic semantic maps
of modal markers

  • Initial source: synthesis of lexicographical works
  • To be reviewed with the results of the corpus-based, empirical analysis
semantic map
semantic map

Multi-format publishing
as a re-usability strategy

Resource sharing

  • Datasets
    • Diachronic modal semantic maps in JSON, SVG and PNG. Interactive version online
    • Plain text version of the source texts
    • Automatic annotation results in CONLL-U format

Resource sharing

  • Datasets
  • GUI
  • Web application
    • GUI-based access to the dataset and analysis functions as an eXist-DB application


Multi-format publishing
for different specialist groups

  • Use of standardized formats in the workflow
  • External software dependencies: free and open source
  • Creation of customized programs tailored to the project specifications: open source
  • Open science workfow and FAIR principles (both during development and results)

Thank you!