XML Schema Proposal for Encoding Verse Corpora

July 13, 2017 | Autor: Petr Plecháč | Categoría: Digital Humanities, Corpus Linguistics, Prosody, Metrics and Prosody
Share Embed


Descripción

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA Petr Plecháč

Varun deCastro-Arrazola

Institute of Czech Literature, Czech Academy of Sciences

Leiden University, Meertens Institute

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

VERSE CORPORA (SELECTED)

Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

WHY DO WE NEED A STANDARD? (1)

Avoid reinventing the wheel (it's very laborious).

(2)

Common formats facilitate (and encourage) common tools for...

(3)



data access: visualisations, search engines and user interfaces;



data analysis: scripts to extract and analyse information (e.g. rhyme detection, syllable count).

Reproducibility: •

understanding each other's data is easier with standards;



being able to reproduce each other's analyses is beneficial for science.

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

WHY XML SCHEMA? (1)

XML is widely used as a data interchange format

(2)

Flexibility in terms of adding new categories and modifying data types.

(3)

Each XML file can easy be converted into relational database by means of simple script with no need to change the existing corpora structures.

(4)

Great support in all common programming languages.

(5)

Elaborated tagset (TEI).

[For sake of clarity we will use here common terms in tags instead.]

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

WELL-FORMEDNESS In XML all elements need to be correctly nested, i.e. no element is allowed to overlap its parent.



XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

WELL-FORMEDNESS In XML all elements need to be correctly nested, i.e. no element is allowed to overlap its parent.



But the basic tree model, e.g.: ...

is however inappropriate in case of encoding verse, since various units may overlap.

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM (1)

LINGUISTIC UNIT × VERSIFICATION UNIT

Spoke like a tail fellow that respects his reputation. Come, shall we to this gear? (Shakespeare: Richard III)

! Spoke ... hisreputation.Come ...

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM (2)

VERSIFICATION UNIT × VERSIFICATION UNIT

Cet enfant que la vie effaçait de son livre, Et qui n’avait pas même un lendemain à vivre, C’est moi. — Je vous dirai peut-être quelque jour Quel lait pur, que de soins, que de vœux, que d’amour... (Hugo: Les Feuilles d'automne)

! C'est moi. —Je vous dirai peut-être quelque jour

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM (3)

LINGUISTIC UNIT × LINGUISTIC UNIT

Кружась в лазурной высоте (Lermontov: Demon)

! влазурной

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (1)

NON-XML FORMATS

TexMECS

LMNL [syll}[word}в{word][word}ла{syll][syll}зур{syll][syll}ной{syll]{word]

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (2)

XML FORMATS

TWIN DOCUMENTS

elements that may overlap are defined in different documents

document1.xml: влазурной document2.xml: в лазурной MILESTONES (ECLIX) range of one the overlapping elements is defined by empty elements (milestones) в лазурной

FRAGMENTATION

one of the overlapping elements is broken into two partial elements

в лазурной

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (2)

XML FORMATS

STAND-OFF MARKUP

document is divided into minimal units (one-dimensional array) and the range of all other elements is defined by references j





XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

STAND-OFF MARKUP (LESS VERBOSE) Since syllable overlap is rather rare, minimal units may be defined as two-dimensional array [syllables]×[sounds], where position of sound in second dimension is set just by its position in phonetic transcription. References thus can be made directly to sounds only in case of overlap.





XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

(SYLLABLES)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)



Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

Petr Plecháč & Varun deCastro-Arrazola

(SYLLABLE ATTRIBUTES)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)



XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

Petr Plecháč & Varun deCastro-Arrazola

(LINE)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)





XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

Petr Plecháč & Varun deCastro-Arrazola

(METRE)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)





XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

Petr Plecháč & Varun deCastro-Arrazola

(WORDS + ATTRIBUTES)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)





XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

EXAMPLE

Petr Plecháč & Varun deCastro-Arrazola

(METRICAL POSITIONS)

Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)









XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA (XSD) XML schema (XSD) specifies which elements and attributes are required and which are optional as well as which values are allowed in particular attributes.

REQUIRED ELEMENTS & ATTRIBUTES Basically there are only two required elements with following required attributes:

i.e. any poetic text where number of syllables in each line is known may be encoded.

OPTIONAL ELEMENTS & ATTRIBUTES All the other levels of description are optional. Thus the set of elements, attributes and their allowed values may be extended simply by modifying the governing XSD.

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA (XSD) To be usable in various languages and various versification systems, XSD may be easily extended by: (1)

Adding new elements, e.g. , , ...

(2)

Adding new attributes, e.g. syllable properties, morpheme properties ...

(3)

Adding new allowed values, e.g. types of metre ...

Furthermore, in case of sung verse XML file can be easily linked to some music notation (e.g. MusicXML) by attributes of the . e.g.:

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

VERSE CORPORA (SELECTED)

Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

VERSE CORPORA: SHARING DATA

Petr Plecháč & Varun deCastro-Arrazola

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

CONVERTING DATA: ANAMÈTRE project

...

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

Petr Plecháč & Varun deCastro-Arrazola

CONCLUSIONS Summary: how to encode (sometimes overlapping) constituents? 1.

Establish a temporal frame of reference (e.g. position, beat).

2.

Define constituents (metrical, linguistic, musical) by referring to the frame.

XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA

CONCLUSIONS Disclaimer: standards only become standard if widely adopted.

Ok, maybe different formats will be used, but let's follow similar principles so that we can easily convert and link corpora.

Petr Plecháč & Varun deCastro-Arrazola

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.