8 Morpho-syntactic content
This section explains how to attach morpho-syntactic content to word
forms and how to define reusable tagsets to provide compact
notations through tags and to control the validity of these
contents.
The previous section explains how to enrich a document with
morpho-syntactic annotations. However, it does not define the
content of these annotations. What set of features and feature
values should we use to express this content (within element
wordForm) and with which meaning ?
Such a set is usually referred as a tagset specifying the
content of possible annotations. However, the diversity of
approaches and languages makes almost impossible the proposition
of an unique tagset. More modestly or pragmatically, the current
proposal seeks to provide mechanisms to define tagsets by relying
on a Data Category Registry (DCR) and Feature Structures
Representations (FSR).
An annotated document will therefore be completed by either adding or
referring to a tagset.
8.1 Using feature structures
A word form may be completed by a morpho-syntactic content defining
its linguistic nature and its grammatical function in its current
context. This content is expressed using Feature Structures,
following the recommendation of ISO 24610 Part 1 document on
“Feature Structure Representation” [FSR]. In first
approximation, a feature structure may attach one or several (possibly
complex) values to linguistic properties (i.e., noun to part of
speech, present to tense, indicative to mood, ...).
<!-- nice -->
<token id="t0">belle</token>
<wordForm entry="urn:lexicon:fr:beau" lemma="beau" tokens="t0">
<fs>
<f name="pos"><symbol value="adjective"/></f>
<f name="adj_type"><symbol value="qualifier"/></f>
<f name="gender"><symbol value="feminine"/></f>
<f name="number"><symbol value="singular"/></f>
</fs>
</wordForm>
The feature structure content attached to a word form may also
provides additional information of interest about a word form.
8.2 Compact morpho-syntactic tags
FSR proposal provides ways for the compact representation of feature
structures, by relying on libraries naming feature values and
feature specifications (a feature specification
being a pair formed by a feature and a value). These names may be used
in wordForm attribute tag to get compact tags, following
a standard practice in the NLP community.
<token id="t0">belle</token>
<wordForm tokens="t0"
entry="urn:lexicon:fr:beau"
tag="pos.adj adj_type.qual gender.fem num.sing"/>
The content of attribute tag should be similar to the content
of attribute feats defined in FSR, namely a space-separated
sequence of feature specification identifiers.
The libraries naming recurrent values and feature
specifications are part of the tagset(s) coming with the annotated
document.
8.2.1 FSR libraries
The generic way provided by FSR to use libraries is illustrated by the
following example, with the attribute feats of element
fs:
<!-- A feature value library -->
<fvLib n="French morpho values">
<symbol xml:id="noun" value="noun"/>
<symbol xml:id="sing" value="singular"/>
<symbol xml:id="plu" value="plural"/>
<symbol xml:id="masc" value="masculine"/>
<symbol xml:id="fem" value="feminine"/>
</fvLib>
<!-- A feature specification library -->
<fLib>
<f xml:id="pos.n" name="pos" fVal="noun"/>
<f xml:id="num.s" name="number" fVal="sing"/>
<f xml:id="num.p" name="number" fVal="plu"/>
<f xml:id="gen.f" name="gender" fVal="fem"/>
<f xml:id="gen.m" name="gender" fVal="masc"/>
</fLib>
With such a library, following FSR rules, one may write:
<wordForm lemma="prime_minister" tokens="t1">
<fs feats ="pos.n num.s gen.f"/>
</wordForm>
or, equivalently, by using attribute tag, one may
write:
<wordForm tokens="t1 t2"
lemma="prime_minister"
tag ="pos.n num.sg gen.f"/>
Disjunctive values are allowed by FSR and may also be simplified,
following the same mechanism:
<!-- A feature value library -->
<tagset>
<fvLib>
<vAlt xml:id="first.third">
<symbol value="first"/>
<symbol value="third"/>
</vAlt>
<symbol xml:id="verb" value="verb"/>
<symbol xml:id="sing" value="singular"/>
</fvLib>
<!-- A feature specification library -->
<fLib>
<f xml:id="pers.13" name="pers" fVal="first.third">
</f>
<f xml:id="pos.v" name="pos" fVal="verb"/>
<f xml:id="num.s" name="number" fVal="sing"/>
</fLib>
</tagset>
<!-- Annotated document -->
<token id="t0">porte</token>
<wordForm tokens="t0"
entry="urn:lexicon:fr:porter"
tag="pos.v pers.13 num.s"/>
8.3 Designing tagsets
The features, values, and possibly feature types used to specify
morpho-syntactic content are not just labels but carry linguistic
meanings, or, in other words, semantic content. To avoid
misinterpretations, the semantic content attached to a feature, a
value or a type should be clearly defined. The combination of
features, values and types should also be controlled in order to avoid
linguistically invalid combinations, such as using /neuter/ as
a value for /gender/ in French, or using a feature
/tense/ for nouns in most languages.
MAF does not try to define the semantic content of an unique complete
set of such features, values, and types. It would be an almost
impossible task given the diversity of languages, and it would be
equally impossible to assign to each component a meaning agreed on by
the whole community.
Instead, it is proposed that an annotated document should be completed
by including or referring one or more tagsets.
The first objective of a tagset is to list the terminology used to
annotate a document as a set of data categories whose
meanings is precisely defined in a Data Category Registry,
following the recommendation of ISO 12620 proposal on “Data
Category Registry”. The process may be seen as selecting a subset
of morpho-syntactic data categories (Data Category Selection –
DCS).
<tagset>
<dcs local="genre" registered="dcs:morphosyntax:gender:fr" rel="eq"/>
<dcs local="fem" registered="dcs:morphosyntax:gender:fr:feminine" rel="eq"/>
</tagset>
The correspondence with a registered data category may not be perfect.
The rel may be used to specify which relationship exists
between the local and registered data categories. For instance, one
may introduce a local data category /advneg/ as being subsumed
by a more general registered data category /adverb/.
<dcs local="advneg" registered="dcs:morphosyntax:pos:adverb" rel="subs"/>
<dcs local="strange" rel="none"/>
It is also possible (but not advised) to introduce a local data
category bearing no relationship with any registered data category.
<dcs local="title"/>
When the correspondence is not perfect or missing, a few words of
description should be added to define the meaning of a local data
category.
<dcs local="title">
<description> A part of speech used to denote honorific titles like
Pr. or S.A.S.
</description>
</dcs>
The second objective of a tagset is to specify the set of valid
feature structures based on the selected data categories. It will be
achieved by relying on the proposed ISO 24610 Part 2 on
“Feature System Declaration” [FSD].
The third objective of a tagset is to name the most common
morpho-syntactic structures through the use of FSR libraries, as seen
in Section 8.2.1.
8.4 Formal description: tagset
wordForm.content =
( attribute tag { xsd:IDREFS }
| ## DTD => ,
fs
)
fs |= notAllowed # defined in iso-fs-standalone.rnc
tagset =
element tagset {
( attribute ref { xsd:anyURI }
| ## DTD => ,
(dcs* & fsd* & tagset.lib*)
)
}
dcs =
element dcs {
attribute local { xsd:NCName },
( attribute registered { xsd:anyURI },
attribute rel { "eq" | "subs" | "gen" } )?,
element description { text }*
}
fsd |= notAllowed # defined in future iso-fsd.rnc
tagset.lib |= fvLib
tagset.lib |= fLib
fLib |= notAllowed # defined in iso-fs-standalone.rnc
fvLib |= notAllowed # defined in iso-fs-standalone.rnc
The dcs corresponds to a Data Category Selection part whose
exact content is still to be defined.
The fsd corresponds to a Feature Structure Declaration part whose
normalization is yet to be done.