6 Segmenting with tokens
Morpho-syntactic annotations are carried by segments, called
tokens, present in the document flow, but this does not imply
that the resulting segmentation corresponds to a sequence of adjacent
segments partitioning the original document. It is particularly
important to distinguish the morpho-syntactic units from their
realizations. Some parts of a document may carry no annotations
(typographic marks, didascalies, markup elements, ...); other parts
may not exactly correspond to their segmented form (abbreviations,
brachygraphies, typographic errors and variations, typographic and
morphological contractions, ...). Also, a morpho-syntactic unit may
not correspond to a segment identified by typographic marks (such as
white spaces or hyphens), for instance for German compound words,
speech transcription, or Sanskrit writing.
The element token is used to represent these segments of the
original document that, roughly speaking, follow typographical,
morphological, or phonological boundaries. The current proposal does
not define the linguistic properties of tokens. In different
languages, a token may be identified through typographic properties
(white-space, hyphens, characters, ...) and/or morphological
properties (radical, affix, morpheme, ...). The description of the
morphological, phonological or lexicological structures that may define
a token is not covered by the current proposal.
Other typographical marks used to format pages or to separate words
and paragraphs, as well as encoding information, do not belong to
morpho-syntactic annotations and are also not covered by this
proposal, but rather by TEI.
6.1 Standoff notation
The element token provides an independence from the original
document by providing a way to reference intervals in documents. The
attributes from and to are used to define such
intervals. The content of these attributes depends on some chosen
addressing schema to denote non ambiguous document positions and
depends on the nature of the original document.
6.2 Embedding notation
It is not always necessary to separate the original document from its
annotations. For simple cases, textual content may be directly
embedded within token.
<token id="t1">The</token>
<token id="t2">victim</token>
<token id="t3">'s</token>
<token id="t4">friends</token>
<token id="t5">told</token>
<token id="t6">police</token>
<token id="t7">that</token>
<token id="t8">Krueger</token>
<token id="t9">drove</token>
<token id="t10">into</token>
<token id="t11">the</token>
<token id="t12">quarry</token>
<token id="t13">and</token>
<token id="t14">never</token>
<token id="t15">surfaced</token>
<token id="t16">.</token>
The embedding notation will be used for most of the provided examples
for MAF but it should be noted that the use of this notation is not
recommended. A first reason is that the morpho-syntactic annotations
may conflict with other annotations. A second reason is that the
content of the textual material separating the textual content
embedded within token is not precisely defined (white-space,
newlines, no space, hyphen, ...), except by relying on attribute
join.
6.3 Informative attributes
Tokens address segments of the original document but also provide a
level of possible abstraction w.r.t. this document, for instance
w.r.t. graphical or phonological variations that are not
linguistically pertinent. The non mandatory attributes form,
transcription, transliteration may be used to perform
this abstraction, providing, for instance, the phonetic transcription
of a speech segment, the roman transliteration of some Cyrillic word,
the expansion of an abbreviation, the correction of a typographical
error, or the choice of a normalized form in presence of variations:
<token form="et cetera" id="t1">etc.</token>
<token form="tzar" id="t2">csar</token>
<token form="tzar" id="t3">tsar</token>
<token form="23/02/03" id="t4">February, 23rd 2003</token>
<token form="et cetera" phonetic="/etsettr/" from="1251" to="1253" id="t5"/>
<token phonetic="/platto/" id="t6">plateau</token>
The abstraction provided by the attribute form is also adequate
to handle the phenomena of contraction and agglutination where two
tokens may cover the same segment of the original document for
distinct values (see Section 6.4.2).
6.4 Completing the embedding token notation
As above mentioned, the embedding token notation is less precise than
the standoff one, in particular to explicit the contiguity and the
overlapping of tokens (which are obvious to check using the document
positions in the case of the standoff notation).
6.4.1 Joining tokens
The embedding notation for tokens is completed by the attribute
join used to specify how a token is joined with its sibling
tokens. By default, two sibling tokens are considered to be separated
by whatever separator is standard for the document language (for
instance, space separated for many languages). By using the attribute
join, it is possible to indicate that a token is contiguous
with its left or right sibling or with both.
<!-- it is said ... -->
<token id="t1">L'</token>
<token id="t2" join="left">on</token>
<token id="t3">dit</token>
It should be noted that a token may enclose material usually
considered as separator, such as spaces, newline, dash, apostrophe,
..., even if these tokens do not anchor linguistic units at the
level of word forms.
<!-- it is said ... -->
<token id="t1">L</token>
<token id="t2" join="both">'</token>
<token id="t3">on</token>
<token id="t4">dit</token>
Another example, in Modern Greek, is provided by the idiomatic
expression “καλοκαγαθός” (good and brave) that may
be segmented in three agglutinated segments
“καλός”,
“και”, and
“αγαθός” and represented by:
<token form="καλός" id="t0">καλο</token>
<token form="και" id="t1" join="left">κ</token>
<token form="αγαθός" id="t2" join="left">αγαθός</token>
6.4.2 Overlapping tokens
Two tokens may overlap, for instance to denote an agglutinated or
contracted form (for instance, in French, “des” may be seen
as a contraction for “de les” [of the]), or to denote
multi-locutor documents with overlapping discourses. In these cases, a
token may not mark just the realization of a typographical or
vocal sequence, but expresses a deeper linguistic reality pertinent
for segmenting a document. It is however still possible not to mention
overlapping at the level of tokens and to postpone the issue at the
level of linguistic units , i.e. word forms.
The value overlap for the token attribute join may be
used to denote overlapping at the level of embedding tokens. For
instance, the following example illustrates the contraction of an
abbreviation with a punctuation mark for “etc.”, for the
standoff and embedding notations for element token:
6.5 Formal description: token
\token =
element token {
attribute id { xsd:ID }?,
token.information,
(
(
attribute from { DocumentLocation },
attribute to { DocumentLocation }
)
| ## DTD => ,
(
[ a:defaultValue = "no" ]
attribute join { "no" | "left" | "right" | "both" | "overlap" }?,
text
)
)
}
token.information &= attribute form { string }?
token.information &= attribute phonetic { string }?
token.information &= attribute transcription { string }?
token.information &= attribute transliteration { string }?