7 Word Forms as linguistic units
The segments identified by token elements are used to anchor
word forms, that may generally be associated, through attribute
entry, to a lexical entry in a lexicon. Words forms are also
characterized by a part of speech as well as morphological and
grammatical properties expressed by feature structures (see
Section 8.1). Immediate information about the lemma
and inflected forms may also be attached with the attributes
lemma and form. In particular, the attribute form
is useful when the inflected form attached to the word form does not
coincide with the content attached to the covered tokens, because, for
instance, of spelling corrections.
A token may be associated to more than one word form and, conversely,
a word form may cover more than one token.
For instance, in French, the morphological agglutination of
auquel (“of which”) may have several representations,
depending on the granularity of the tokenization:
-
coarse granularity
- The character sequence auquel is
not decomposed and is covered by a single token, with two word
forms covering this segment.
<token id="t0">auquel</token>
<wordForm lemma="à" tag="pos.prep" tokens="t0"/>
<wordForm lemma="lequel" tag="pos.pronrel" tokens="t0"/>
- fine granularity
- The tokenizer identifies two agglutinated
parts materialized by two tokens, each of them anchoring a word
form:
<token form="a" id="t0">auquel</token>
<token form="lequel" id="t1" join="overlap"/>
<wordForm lemma="à" tag="pos.prep" tokens="t0"/>
<wordForm lemma="lequel" tag="pos.pronrel" tokens="t1"/>
The choice of a level of granularity can be motivated by the usage or
by the available tools for a given language.
As mentioned before, there are no mandatory linguistic properties for
defining the tokens, which can, for instance, be automatically
recognized by regular languages. On the other hand, a word form, that
may cover zero, one or more tokens, should represent a linguistic unit
carrying morpho-syntactic information.
The current proposal does not discuss the linguistic choices that
define these linguistic units but provides enough flexibility to
annotate them. The choice may be motivated by lexical or morphological
properties based on context and language (depending on the
nature and function of words).
7.1 Token attachment
7.1.1 One token; one word form
The simplest case of relationship between tokens and word forms is
when a word form covers a single token.
<token id="t0">apple</token>
<wordForm lemma="apple" tokens="t0"/>
7.1.2 Several contiguous tokens; one word form
However, the current proposal allows the handling of more complex
cases, as the identification of compound words covering several
adjacent tokens:
<token id="t0">prime</token>
<token id="t1">minister</token>
<wordForm lemma="prime_minister" tokens="t0 t1"/>
7.1.3 Several discontinuous tokens; one word form
A sequence of non contiguous tokens may also be attached to a word
form, for instance to handle cases where some material is inserted
inside the components of a word form:
<token id="t1">afin</token>
<token id="t2">justement</token>
<token id="t3">de</token>
<wordForm lemma="afin_de" tokens="t1 t3"/>
<wordForm lemma="justement" tokens="t2"/>
This kind of phenomena may also occur for verbs with detached
particles, for instance in English or German. The English infinitive
verbal form “to <verb>” may also fit in this scheme.
<token id="t1">to</token>
<token id="t2">eventually</token>
<token id="t3">decide</token>
<wordForm lemma="to_decide" tokens="t1 t3"/>
<wordForm lemma="eventually" tokens="t2"/>
In order to identify discontinuous word-form while preserving some
information about the position of each component in the flow of word
forms, one may use word forms covering the same sequence tokens and
referring to the same entry (but possibly sub-entries).
<token id="t1">to</token>
<token id="t2">eventually</token>
<token id="t3">decide</token>
<wordForm entry="urn:lexicon:en:decide:to" tokens="t1 t3"/>
<wordForm entry="urn:lexicon:en:eventually" tokens="t2"/>
<wordForm entry="urn:lexicon:en:decide:main" tokens="t1 t3"/>
7.1.4 Zero token; one word form
Another case that may arise is when one wishes to insert a word form
which is not realized in the original document, and is, therefore,
associated with an empty sequence of tokens, e.g., some pronouns in
Spanish or the hypothesis of traces.
<token id="t1">Jean</token>
<token id="t2">propose</token>
<token id="t3">de</token>
<token id="t4">partir</token>
<wordForm lemma="Jean" tokens="t1"/>
<wordForm lemma="proposer" tokens="t2"/>
<wordForm lemma="de" tokens="t3"/>
<wordForm lemma="PRO" tokens=""/>
<wordForm lemma="partir" tokens="t4"/>
Even if a word form covers no tokens, it still has a relative position
w.r.t. the other word forms. It is this relative position which is
pertinent for further processing, rather than some absolute document
position.
7.1.5 One token; several word forms
Finally, several word forms may be attached to a same token, as
illustrated by the following examples.
<!-- Give it to me -->
<token form="damelo" id="t1">Damelo</token>
<wordForm lemma="da" tokens="t1"/> <!-- (Donne) -->
<wordForm lemma="me" tokens="t1"/> <!-- (le) -->
<wordForm lemma="lo" tokens="t1"/> <!-- (moi) -->
<!-- of which -->
<token id="t0">auquel</token>
<wordForm lemma="à" tag="pos.prep" tokens="t0"/>
<wordForm lemma="lequel" tag="pos.pronrel" tokens="t0"/>
7.2 Referring lexicon entries
A word form is a linguistic unit carrying morpho-syntactic properties.
Generally, a linguistic unit may be characterized by a label
corresponding to an entry if some lexicon. This identification is
materialized by the attribute entry, whose content should
express a reference (an URN) to the lexicon entry.
<token id="t1">Prime</token>
<token id="t2">minister</token>
<wordForm entry="urn:lexicon:en:prime_minister" tokens="t1 t2"/>
The notion of “lexicon entry” is outside the scope of MAF. A
reference to a lexicon entry is therefore not precisely defined but,
in first approximation, should correspond to an URN (Uniform
Resource Name). It should be noted that one may wish to reference
lexicons “sub-entries” for polysemous entries or for compound forms.
<token id="t1">to</token>
<token id="t2">eventually</token>
<token id="t3">decide</token>
<wordForm entry="urn:lexicon:en:decide:to" tokens="t1 t3"/>
<wordForm entry="urn:lexicon:en:eventually" tokens="t2"/>
<wordForm entry="urn:lexicon:en:decide:main" tokens="t1 t3"/>
A token or a sequence of tokens may sometimes be identified as forming
a word form because of various properties but can not associated to
some lexicon entry, either because no lexicon is available or because
the word form corresponds to a named entity (a proper name, a date, an
address, ...) or to a neologism. In that case, the content of
attribute entry may be left empty. The other informative
attributes lemma and form may still be used to provide
more information about the word form.
<token id="t0">October</token>
<token id="t1">,</token>
<token id="t2">23rd</token>
<token id="t3">2005</token>
<wordForm lemma="DATE" form="2005/10/23" tokens="t0 t1 t2 t3"/>
For such unknown words, it is however suggested that they can be
collected into a document specific lexicon, in order for the unknown
words to refer entries in this lexicon.
7.3 Compound word forms
The structure of compound forms (including multi-word expressions) may
be expressed using nested word forms, therefore providing information
about the subparts even when none is available for the whole, for
instance for neologisms:
<!-- birthday gift wrapping paper -->
<token form="Geburtstag" id="t1" join="right">Geburtstags</token>
<token form="Geschenk" id="t2" join="right">geschenk</token>
<token form="Papier" id="t3">papier</token>
<wordForm tokens="t1 t2 t3">
<wordForm entry="urn:lexicon:de:geburstag" lemma="geburstag" tokens="t1"/>
<wordForm entry="urn:lexicon:de:geschenk" lemma="geschenk" tokens="t2"/>
<wordForm entry="urn:lexicon:de:papier" lemma="papier" tokens="t3"/>
</wordForm>
Note: Precising the derivational morphology of a compound
word is outside the scope of MAF. Still, the addition of a
deriv attribute on embedded word forms is being investigated,
for instance to mention the head of a compound form.
7.4 Formal description: wordForm
wordForm =
element wordForm {
wordForm.identification,
wordForm.tokens,
wordForm*,
wordForm.content ?
}
wordForm.tokens =
( attribute tokens { xsd:IDREFS }
| ## DTD => ,
\token*
)
wordForm.identification &= attribute entry { xsd:anyURI } ?
wordForm.identification &= attribute lemma { string } ?
wordForm.identification &= attribute form { string } ?