Word Forms as linguistic units

7 Word Forms as linguistic units

The segments identified by token elements are used to anchor word forms, that may generally be associated, through attribute entry, to a lexical entry in a lexicon. Words forms are also characterized by a part of speech as well as morphological and grammatical properties expressed by feature structures (see Section 8.1). Immediate information about the lemma and inflected forms may also be attached with the attributes lemma and form. In particular, the attribute form is useful when the inflected form attached to the word form does not coincide with the content attached to the covered tokens, because, for instance, of spelling corrections.

A token may be associated to more than one word form and, conversely, a word form may cover more than one token.

For instance, in French, the morphological agglutination of auquel (“of which”) may have several representations, depending on the granularity of the tokenization:

coarse granularity: The character sequence auquel is not decomposed and is covered by a single token, with two word forms covering this segment.
<token id="t0">auquel</token> <wordForm lemma="à" tag="pos.prep" tokens="t0"/> <wordForm lemma="lequel" tag="pos.pronrel" tokens="t0"/>
fine granularity: The tokenizer identifies two agglutinated parts materialized by two tokens, each of them anchoring a word form:

<token form="a" id="t0">auquel</token> <token form="lequel" id="t1" join="overlap"/> <wordForm lemma="à" tag="pos.prep" tokens="t0"/> <wordForm lemma="lequel" tag="pos.pronrel" tokens="t1"/>

The choice of a level of granularity can be motivated by the usage or by the available tools for a given language.

As mentioned before, there are no mandatory linguistic properties for defining the tokens, which can, for instance, be automatically recognized by regular languages. On the other hand, a word form, that may cover zero, one or more tokens, should represent a linguistic unit carrying morpho-syntactic information.

The current proposal does not discuss the linguistic choices that define these linguistic units but provides enough flexibility to annotate them. The choice may be motivated by lexical or morphological properties based on context and language (depending on the nature and function of words).

7.1 Token attachment

7.1.1 One token; one word form

The simplest case of relationship between tokens and word forms is when a word form covers a single token.

<token id="t0">apple</token> <wordForm lemma="apple" tokens="t0"/>

7.1.2 Several contiguous tokens; one word form

However, the current proposal allows the handling of more complex cases, as the identification of compound words covering several adjacent tokens:

<token id="t0">prime</token> <token id="t1">minister</token> <wordForm lemma="prime_minister" tokens="t0 t1"/>

7.1.3 Several discontinuous tokens; one word form

A sequence of non contiguous tokens may also be attached to a word form, for instance to handle cases where some material is inserted inside the components of a word form:

<token id="t1">afin</token> <token id="t2">justement</token> <token id="t3">de</token> <wordForm lemma="afin_de" tokens="t1 t3"/> <wordForm lemma="justement" tokens="t2"/>

This kind of phenomena may also occur for verbs with detached particles, for instance in English or German. The English infinitive verbal form “to <verb>” may also fit in this scheme.

<token id="t1">to</token> <token id="t2">eventually</token> <token id="t3">decide</token> <wordForm lemma="to_decide" tokens="t1 t3"/> <wordForm lemma="eventually" tokens="t2"/>

In order to identify discontinuous word-form while preserving some information about the position of each component in the flow of word forms, one may use word forms covering the same sequence tokens and referring to the same entry (but possibly sub-entries).

<token id="t1">to</token> <token id="t2">eventually</token> <token id="t3">decide</token> <wordForm entry="urn:lexicon:en:decide:to" tokens="t1 t3"/> <wordForm entry="urn:lexicon:en:eventually" tokens="t2"/> <wordForm entry="urn:lexicon:en:decide:main" tokens="t1 t3"/>

7.1.4 Zero token; one word form

Another case that may arise is when one wishes to insert a word form which is not realized in the original document, and is, therefore, associated with an empty sequence of tokens, e.g., some pronouns in Spanish or the hypothesis of traces.

<token id="t1">Jean</token> <token id="t2">propose</token> <token id="t3">de</token> <token id="t4">partir</token> <wordForm lemma="Jean" tokens="t1"/> <wordForm lemma="proposer" tokens="t2"/> <wordForm lemma="de" tokens="t3"/> <wordForm lemma="PRO" tokens=""/> <wordForm lemma="partir" tokens="t4"/>

Even if a word form covers no tokens, it still has a relative position w.r.t. the other word forms. It is this relative position which is pertinent for further processing, rather than some absolute document position.

7.1.5 One token; several word forms

Finally, several word forms may be attached to a same token, as illustrated by the following examples.

<token form="damelo" id="t1">Damelo</token> <wordForm lemma="da" tokens="t1"/>  <wordForm lemma="me" tokens="t1"/>  <wordForm lemma="lo" tokens="t1"/>

<token id="t0">auquel</token> <wordForm lemma="à" tag="pos.prep" tokens="t0"/> <wordForm lemma="lequel" tag="pos.pronrel" tokens="t0"/>

7.2 Referring lexicon entries

A word form is a linguistic unit carrying morpho-syntactic properties. Generally, a linguistic unit may be characterized by a label corresponding to an entry if some lexicon. This identification is materialized by the attribute entry, whose content should express a reference (an URN) to the lexicon entry.

<token id="t1">Prime</token> <token id="t2">minister</token> <wordForm entry="urn:lexicon:en:prime_minister" tokens="t1 t2"/>

The notion of “lexicon entry” is outside the scope of MAF. A reference to a lexicon entry is therefore not precisely defined but, in first approximation, should correspond to an URN (Uniform Resource Name). It should be noted that one may wish to reference lexicons “sub-entries” for polysemous entries or for compound forms.

A token or a sequence of tokens may sometimes be identified as forming a word form because of various properties but can not associated to some lexicon entry, either because no lexicon is available or because the word form corresponds to a named entity (a proper name, a date, an address, ...) or to a neologism. In that case, the content of attribute entry may be left empty. The other informative attributes lemma and form may still be used to provide more information about the word form.

<token id="t0">October</token> <token id="t1">,</token> <token id="t2">23rd</token> <token id="t3">2005</token> <wordForm lemma="DATE" form="2005/10/23" tokens="t0 t1 t2 t3"/>

For such unknown words, it is however suggested that they can be collected into a document specific lexicon, in order for the unknown words to refer entries in this lexicon.

7.3 Compound word forms

The structure of compound forms (including multi-word expressions) may be expressed using nested word forms, therefore providing information about the subparts even when none is available for the whole, for instance for neologisms:

<token form="Geburtstag" id="t1" join="right">Geburtstags</token> <token form="Geschenk" id="t2" join="right">geschenk</token> <token form="Papier" id="t3">papier</token> <wordForm tokens="t1 t2 t3"> <wordForm entry="urn:lexicon:de:geburstag" lemma="geburstag" tokens="t1"/> <wordForm entry="urn:lexicon:de:geschenk" lemma="geschenk" tokens="t2"/> <wordForm entry="urn:lexicon:de:papier" lemma="papier" tokens="t3"/> </wordForm>

Note: Precising the derivational morphology of a compound word is outside the scope of MAF. Still, the addition of a deriv attribute on embedded word forms is being investigated, for instance to mention the head of a compound form.

7.4 Formal description: `wordForm`

wordForm = element wordForm { wordForm.identification, wordForm.tokens, wordForm*, wordForm.content ? } wordForm.tokens = ( attribute tokens { xsd:IDREFS } | ## DTD => , \token* ) wordForm.identification &= attribute entry { xsd:anyURI } ? wordForm.identification &= attribute lemma { string } ? wordForm.identification &= attribute form { string } ?