XML Schema Proposal for Encoding Verse Corpora
Descripción
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA Petr Plecháč
Varun deCastro-Arrazola
Institute of Czech Literature, Czech Academy of Sciences
Leiden University, Meertens Institute
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
VERSE CORPORA (SELECTED)
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
WHY DO WE NEED A STANDARD? (1)
Avoid reinventing the wheel (it's very laborious).
(2)
Common formats facilitate (and encourage) common tools for...
(3)
•
data access: visualisations, search engines and user interfaces;
•
data analysis: scripts to extract and analyse information (e.g. rhyme detection, syllable count).
Reproducibility: •
understanding each other's data is easier with standards;
•
being able to reproduce each other's analyses is beneficial for science.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
WHY XML SCHEMA? (1)
XML is widely used as a data interchange format
(2)
Flexibility in terms of adding new categories and modifying data types.
(3)
Each XML file can easy be converted into relational database by means of simple script with no need to change the existing corpora structures.
(4)
Great support in all common programming languages.
(5)
Elaborated tagset (TEI).
[For sake of clarity we will use here common terms in tags instead.]
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
WELL-FORMEDNESS In XML all elements need to be correctly nested, i.e. no element is allowed to overlap its parent.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
WELL-FORMEDNESS In XML all elements need to be correctly nested, i.e. no element is allowed to overlap its parent.
But the basic tree model, e.g.: ...
is however inappropriate in case of encoding verse, since various units may overlap.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM (1)
LINGUISTIC UNIT × VERSIFICATION UNIT
Spoke like a tail fellow that respects his reputation. Come, shall we to this gear? (Shakespeare: Richard III)
! Spoke ... hisreputation.Come ...
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM (2)
VERSIFICATION UNIT × VERSIFICATION UNIT
Cet enfant que la vie effaçait de son livre, Et qui n’avait pas même un lendemain à vivre, C’est moi. — Je vous dirai peut-être quelque jour Quel lait pur, que de soins, que de vœux, que d’amour... (Hugo: Les Feuilles d'automne)
! C'est moi. —Je vous dirai peut-être quelque jour
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM (3)
LINGUISTIC UNIT × LINGUISTIC UNIT
Кружась в лазурной высоте (Lermontov: Demon)
! влазурной
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (1)
NON-XML FORMATS
TexMECS
LMNL [syll}[word}в{word][word}ла{syll][syll}зур{syll][syll}ной{syll]{word]
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (2)
XML FORMATS
TWIN DOCUMENTS
elements that may overlap are defined in different documents
document1.xml: влазурной document2.xml: в лазурной MILESTONES (ECLIX) range of one the overlapping elements is defined by empty elements (milestones) в лазурной
FRAGMENTATION
one of the overlapping elements is broken into two partial elements
в лазурной
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
OVERLAPPING CONSTITUENTS PROBLEM: SOLUTIONS (2)
XML FORMATS
STAND-OFF MARKUP
document is divided into minimal units (one-dimensional array) and the range of all other elements is defined by references j
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
STAND-OFF MARKUP (LESS VERBOSE) Since syllable overlap is rather rare, minimal units may be defined as two-dimensional array [syllables]×[sounds], where position of sound in second dimension is set just by its position in phonetic transcription. References thus can be made directly to sounds only in case of overlap.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
(SYLLABLES)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
Petr Plecháč & Varun deCastro-Arrazola
(SYLLABLE ATTRIBUTES)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
Petr Plecháč & Varun deCastro-Arrazola
(LINE)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
Petr Plecháč & Varun deCastro-Arrazola
(METRE)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
Petr Plecháč & Varun deCastro-Arrazola
(WORDS + ATTRIBUTES)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
EXAMPLE
Petr Plecháč & Varun deCastro-Arrazola
(METRICAL POSITIONS)
Tvá loď jde po vysokém moři, (Albert: Na zemi a na nebi)
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA (XSD) XML schema (XSD) specifies which elements and attributes are required and which are optional as well as which values are allowed in particular attributes.
REQUIRED ELEMENTS & ATTRIBUTES Basically there are only two required elements with following required attributes:
i.e. any poetic text where number of syllables in each line is known may be encoded.
OPTIONAL ELEMENTS & ATTRIBUTES All the other levels of description are optional. Thus the set of elements, attributes and their allowed values may be extended simply by modifying the governing XSD.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA (XSD) To be usable in various languages and various versification systems, XSD may be easily extended by: (1)
Adding new elements, e.g. , , ...
(2)
Adding new attributes, e.g. syllable properties, morpheme properties ...
(3)
Adding new allowed values, e.g. types of metre ...
Furthermore, in case of sung verse XML file can be easily linked to some music notation (e.g. MusicXML) by attributes of the . e.g.:
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
VERSE CORPORA (SELECTED)
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
VERSE CORPORA: SHARING DATA
Petr Plecháč & Varun deCastro-Arrazola
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
CONVERTING DATA: ANAMÈTRE project
...
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
Petr Plecháč & Varun deCastro-Arrazola
CONCLUSIONS Summary: how to encode (sometimes overlapping) constituents? 1.
Establish a temporal frame of reference (e.g. position, beat).
2.
Define constituents (metrical, linguistic, musical) by referring to the frame.
XML SCHEMA PROPOSAL FOR ENCODING VERSE CORPORA
CONCLUSIONS Disclaimer: standards only become standard if widely adopted.
Ok, maybe different formats will be used, but let's follow similar principles so that we can easily convert and link corpora.
Petr Plecháč & Varun deCastro-Arrazola
Lihat lebih banyak...
Comentarios