Previous Up Next

5  General characteristics of MAF

5.1  Overview

In the Linguistic Community, morpho-syntactic annotations provide an important layer of linguistic information in a document, even if they do not cover the full range of possible linguistic annotations. Other kinds of annotation on references, discourse, prosody, or parsing may complete morpho-syntactic annotations.

Syntax and semantics can not be avoided in the definition of parts of speech and of grammatical categories. For instance, pronouns and substantives intrinsically carry a reference to some entity; the tense or the aspect of verbs indicate the temporal deixis; the person, modality and other grammatical categories indicate the enunciation context, .... Therefore, it is not easy to provide an exact and precise definition of what morpho-syntactic annotations cover because they are strongly related to many other linguistic properties of a given language in a given context.

Nevertheless, the present proposal tries to delimit minimal and maximal sequences in documents (either text or speech) that can be identified as morpho-syntactic units and tries to categorize the linguistic properties that may be used to mark these units, within some larger syntagmatic context. Minimal units can not be broken into sub-parts that could be identified by similar morpho-syntactic criteria, but may however still be broken into smaller units with morphological or phonological properties. Morpho-syntactic units can be nested to form maximal units (such as compound words) that act as elementary units for other level of linguistic analysis, particularly parsing. The exact boundary between morpho-syntax and parsing is sometimes difficult to define.

5.2  MAF Meta-Model




Figure 1: Simplified view of MAF meta-model



Figure 1 presents a simplified view of the proposed meta-model for morpho-syntactic annotations, while Figure 2 presents a more formal view based on UML. An annotated document is formed by a raw original document and a set of annotations. The annotations are carried by word forms covering zero, one or more segments or tokens of the original document. A word form may reference a lexicon entry and provides information about its underlying lemma and inflected form. The morpho-syntactic content attached to a word form is expressed by feature structures following the guidelines of one or more tagsets. The terminology or set of categories used in tagsets are described w.r.t. registered data categories. Because of structural ambiguities, both tokens and word forms are organized into one or more flows, materialized by lattices, or more formally by Directed Acyclic Graphs [DAGs]. The current proposal addresses the representation of segments (through tokens), word forms, morpho-syntactic content, tagsets, and ambiguity. A MAF model is instantiated from the MAF meta-model through the selection of a set of data categories.



Figure 2: UML view of MAF meta-model




Previous Up Next