Segmenting with tokens

6 Segmenting with tokens

Morpho-syntactic annotations are carried by segments, called tokens, present in the document flow, but this does not imply that the resulting segmentation corresponds to a sequence of adjacent segments partitioning the original document. It is particularly important to distinguish the morpho-syntactic units from their realizations. Some parts of a document may carry no annotations (typographic marks, didascalies, markup elements, ...); other parts may not exactly correspond to their segmented form (abbreviations, brachygraphies, typographic errors and variations, typographic and morphological contractions, ...). Also, a morpho-syntactic unit may not correspond to a segment identified by typographic marks (such as white spaces or hyphens), for instance for German compound words, speech transcription, or Sanskrit writing.

The element token is used to represent these segments of the original document that, roughly speaking, follow typographical, morphological, or phonological boundaries. The current proposal does not define the linguistic properties of tokens. In different languages, a token may be identified through typographic properties (white-space, hyphens, characters, ...) and/or morphological properties (radical, affix, morpheme, ...). The description of the morphological, phonological or lexicological structures that may define a token is not covered by the current proposal.

Other typographical marks used to format pages or to separate words and paragraphs, as well as encoding information, do not belong to morpho-syntactic annotations and are also not covered by this proposal, but rather by TEI.

6.1 Standoff notation

The element token provides an independence from the original document by providing a way to reference intervals in documents. The attributes from and to are used to define such intervals. The content of these attributes depends on some chosen addressing schema to denote non ambiguous document positions and depends on the nature of the original document.

6.2 Embedding notation

It is not always necessary to separate the original document from its annotations. For simple cases, textual content may be directly embedded within token.

<token id="t1">The</token> <token id="t2">victim</token> <token id="t3">'s</token> <token id="t4">friends</token> <token id="t5">told</token> <token id="t6">police</token> <token id="t7">that</token> <token id="t8">Krueger</token> <token id="t9">drove</token> <token id="t10">into</token> <token id="t11">the</token> <token id="t12">quarry</token> <token id="t13">and</token> <token id="t14">never</token> <token id="t15">surfaced</token> <token id="t16">.</token>

The embedding notation will be used for most of the provided examples for MAF but it should be noted that the use of this notation is not recommended. A first reason is that the morpho-syntactic annotations may conflict with other annotations. A second reason is that the content of the textual material separating the textual content embedded within token is not precisely defined (white-space, newlines, no space, hyphen, ...), except by relying on attribute join.

6.3 Informative attributes

Tokens address segments of the original document but also provide a level of possible abstraction w.r.t. this document, for instance w.r.t. graphical or phonological variations that are not linguistically pertinent. The non mandatory attributes form, transcription, transliteration may be used to perform this abstraction, providing, for instance, the phonetic transcription of a speech segment, the roman transliteration of some Cyrillic word, the expansion of an abbreviation, the correction of a typographical error, or the choice of a normalized form in presence of variations:

<token form="et cetera" id="t1">etc.</token> <token form="tzar" id="t2">csar</token> <token form="tzar" id="t3">tsar</token> <token form="23/02/03" id="t4">February, 23rd 2003</token> <token form="et cetera" phonetic="/etsettr/" from="1251" to="1253" id="t5"/> <token phonetic="/platto/" id="t6">plateau</token>

The abstraction provided by the attribute form is also adequate to handle the phenomena of contraction and agglutination where two tokens may cover the same segment of the original document for distinct values (see Section 6.4.2).

6.4 Completing the embedding token notation

As above mentioned, the embedding token notation is less precise than the standoff one, in particular to explicit the contiguity and the overlapping of tokens (which are obvious to check using the document positions in the case of the standoff notation).

6.4.1 Joining tokens

The embedding notation for tokens is completed by the attribute join used to specify how a token is joined with its sibling tokens. By default, two sibling tokens are considered to be separated by whatever separator is standard for the document language (for instance, space separated for many languages). By using the attribute join, it is possible to indicate that a token is contiguous with its left or right sibling or with both.

<token id="t1">L'</token> <token id="t2" join="left">on</token> <token id="t3">dit</token>

It should be noted that a token may enclose material usually considered as separator, such as spaces, newline, dash, apostrophe, ..., even if these tokens do not anchor linguistic units at the level of word forms.

<token id="t1">L</token> <token id="t2" join="both">'</token> <token id="t3">on</token> <token id="t4">dit</token>

Another example, in Modern Greek, is provided by the idiomatic expression “καλοκαγαθός” (good and brave) that may be segmented in three agglutinated segments “καλός”, “και”, and “αγαθός” and represented by:

<token form="καλός" id="t0">καλο</token> <token form="και" id="t1" join="left">κ</token> <token form="αγαθός" id="t2" join="left">αγαθός</token>

6.4.2 Overlapping tokens

Two tokens may overlap, for instance to denote an agglutinated or contracted form (for instance, in French, “des” may be seen as a contraction for “de les” [of the]), or to denote multi-locutor documents with overlapping discourses. In these cases, a token may not mark just the realization of a typographical or vocal sequence, but expresses a deeper linguistic reality pertinent for segmenting a document. It is however still possible not to mention overlapping at the level of tokens and to postpone the issue at the level of linguistic units , i.e. word forms.

The value overlap for the token attribute join may be used to denote overlapping at the level of embedding tokens. For instance, the following example illustrates the contraction of an abbreviation with a punctuation mark for “etc.”, for the standoff and embedding notations for element token:

a Standoff notation
<token form="et cetera" id="t1" from="p1" to="p3"/> <token form="#dot#" id="t2" from="p1" to="p3"/>
b Embedding notation
<token form="et cetera" id="t1">etc.</token> <token form="#dot#" id="t2" join="overlap"/>

6.5 Formal description: `token`

\token = element token { attribute id { xsd:ID }?, token.information, ( ( attribute from { DocumentLocation }, attribute to { DocumentLocation } ) | ## DTD => , ( [ a:defaultValue = "no" ] attribute join { "no" | "left" | "right" | "both" | "overlap" }?, text ) ) } token.information &= attribute form { string }? token.information &= attribute phonetic { string }? token.information &= attribute transcription { string }? token.information &= attribute transliteration { string }?