Morpho-syntactic content

8 Morpho-syntactic content

This section explains how to attach morpho-syntactic content to word forms and how to define reusable tagsets to provide compact notations through tags and to control the validity of these contents.

The previous section explains how to enrich a document with morpho-syntactic annotations. However, it does not define the content of these annotations. What set of features and feature values should we use to express this content (within element wordForm) and with which meaning ?

Such a set is usually referred as a tagset specifying the content of possible annotations. However, the diversity of approaches and languages makes almost impossible the proposition of an unique tagset. More modestly or pragmatically, the current proposal seeks to provide mechanisms to define tagsets by relying on a Data Category Registry (DCR) and Feature Structures Representations (FSR).

An annotated document will therefore be completed by either adding or referring to a tagset.

8.1 Using feature structures

A word form may be completed by a morpho-syntactic content defining its linguistic nature and its grammatical function in its current context. This content is expressed using Feature Structures, following the recommendation of ISO 24610 Part 1 document on “Feature Structure Representation” [FSR]. In first approximation, a feature structure may attach one or several (possibly complex) values to linguistic properties (i.e., noun to part of speech, present to tense, indicative to mood, ...).

<token id="t0">belle</token> <wordForm entry="urn:lexicon:fr:beau" lemma="beau" tokens="t0"> <fs> <f name="pos"><symbol value="adjective"/></f> <f name="adj_type"><symbol value="qualifier"/></f> <f name="gender"><symbol value="feminine"/></f> <f name="number"><symbol value="singular"/></f> </fs> </wordForm>

The feature structure content attached to a word form may also provides additional information of interest about a word form.

8.2 Compact morpho-syntactic tags

FSR proposal provides ways for the compact representation of feature structures, by relying on libraries naming feature values and feature specifications (a feature specification being a pair formed by a feature and a value). These names may be used in wordForm attribute tag to get compact tags, following a standard practice in the NLP community.

<token id="t0">belle</token> <wordForm tokens="t0" entry="urn:lexicon:fr:beau" tag="pos.adj adj_type.qual gender.fem num.sing"/>

The content of attribute tag should be similar to the content of attribute feats defined in FSR, namely a space-separated sequence of feature specification identifiers.

The libraries naming recurrent values and feature specifications are part of the tagset(s) coming with the annotated document.

8.2.1 FSR libraries

The generic way provided by FSR to use libraries is illustrated by the following example, with the attribute feats of element fs:

<fvLib n="French morpho values"> <symbol xml:id="noun" value="noun"/> <symbol xml:id="sing" value="singular"/> <symbol xml:id="plu" value="plural"/> <symbol xml:id="masc" value="masculine"/> <symbol xml:id="fem" value="feminine"/> </fvLib>  <fLib> <f xml:id="pos.n" name="pos" fVal="noun"/> <f xml:id="num.s" name="number" fVal="sing"/> <f xml:id="num.p" name="number" fVal="plu"/> <f xml:id="gen.f" name="gender" fVal="fem"/> <f xml:id="gen.m" name="gender" fVal="masc"/> </fLib>

With such a library, following FSR rules, one may write:

or, equivalently, by using attribute tag, one may write:

Disjunctive values are allowed by FSR and may also be simplified, following the same mechanism:

<tagset> <fvLib> <vAlt xml:id="first.third"> <symbol value="first"/> <symbol value="third"/> </vAlt> <symbol xml:id="verb" value="verb"/> <symbol xml:id="sing" value="singular"/> </fvLib>  <fLib> <f xml:id="pers.13" name="pers" fVal="first.third"> </f> <f xml:id="pos.v" name="pos" fVal="verb"/> <f xml:id="num.s" name="number" fVal="sing"/> </fLib> </tagset>  <token id="t0">porte</token> <wordForm tokens="t0" entry="urn:lexicon:fr:porter" tag="pos.v pers.13 num.s"/>

8.3 Designing tagsets

The features, values, and possibly feature types used to specify morpho-syntactic content are not just labels but carry linguistic meanings, or, in other words, semantic content. To avoid misinterpretations, the semantic content attached to a feature, a value or a type should be clearly defined. The combination of features, values and types should also be controlled in order to avoid linguistically invalid combinations, such as using /neuter/ as a value for /gender/ in French, or using a feature /tense/ for nouns in most languages.

MAF does not try to define the semantic content of an unique complete set of such features, values, and types. It would be an almost impossible task given the diversity of languages, and it would be equally impossible to assign to each component a meaning agreed on by the whole community.

Instead, it is proposed that an annotated document should be completed by including or referring one or more tagsets.

The first objective of a tagset is to list the terminology used to annotate a document as a set of data categories whose meanings is precisely defined in a Data Category Registry, following the recommendation of ISO 12620 proposal on “Data Category Registry”. The process may be seen as selecting a subset of morpho-syntactic data categories (Data Category Selection – DCS).

The correspondence with a registered data category may not be perfect. The rel may be used to specify which relationship exists between the local and registered data categories. For instance, one may introduce a local data category /advneg/ as being subsumed by a more general registered data category /adverb/.

It is also possible (but not advised) to introduce a local data category bearing no relationship with any registered data category.

When the correspondence is not perfect or missing, a few words of description should be added to define the meaning of a local data category.

<dcs local="title"> <description> A part of speech used to denote honorific titles like Pr. or S.A.S. </description> </dcs>

The second objective of a tagset is to specify the set of valid feature structures based on the selected data categories. It will be achieved by relying on the proposed ISO 24610 Part 2 on “Feature System Declaration” [FSD].

The third objective of a tagset is to name the most common morpho-syntactic structures through the use of FSR libraries, as seen in Section 8.2.1.

8.4 Formal description: `tagset`

wordForm.content = ( attribute tag { xsd:IDREFS } | ## DTD => , fs ) fs |= notAllowed # defined in iso-fs-standalone.rnc tagset = element tagset { ( attribute ref { xsd:anyURI } | ## DTD => , (dcs* & fsd* & tagset.lib*) ) } dcs = element dcs { attribute local { xsd:NCName }, ( attribute registered { xsd:anyURI }, attribute rel { "eq" | "subs" | "gen" } )?, element description { text }* } fsd |= notAllowed # defined in future iso-fsd.rnc tagset.lib |= fvLib tagset.lib |= fLib fLib |= notAllowed # defined in iso-fs-standalone.rnc fvLib |= notAllowed # defined in iso-fs-standalone.rnc