FRMG and Multi Word Expressions

MWEs are a real difficulty in parsing.

The main issues are

• the lack of consensus on defining and capturing MWEs
• no closed lists or operational specification of MWEs (defining such a list was a large part of the discussions during the French parsing evaluation campaign PASSAGE!)
• a large diversity of MWE kinds: named entities, terms, locutions, idioms, ...
• a range of situation going from frozen to semi-productive MWEs

In FRMG, these diverse situations has led to a diversity of solutions, more or less perfect, at all levels, from the meta-grammar level, in the pre-parsing phases, during parsing, in the disambiguation phase, or even during conversion to some conversion schema.

Note: A large part of what is presented here may also be found in these slides presented at the AIM-West & Parseme-FR workshop (Grenoble, October 2016).

At tokenization level (SxPipe)

Many simple MWEs (such complex preps, conjunctions, adverbs) have lexical entries in Lefff and are recognized during the tokenization phase with SxPipe. However, because the recognition may be erroneous, SxPipe generally returns lattices (or DAGs) for the different readings, with and without MWEs, as shown below for "en fait" that may be interpreted as a MWE adverb or as the sequence "en" + "fait".

an ambiguous DAG produced by SxPipe with a MWE reading

As illustrated by the two following sentences, both readings may be correct depending on the contex.

 0 0 Graph

 0 0 Graph

For a given sentence, the (untuned) disambiguation phase tends to favor MWEs when possible (through specific disamb rule).

The same mechanism also applies to Named Entities recognized by one of the components of SxPipe and that are often MWEs.

 0 0 Graph

At parser level (and metagrammar level)

There exists a specific "hack" for lightverbs and predicative nouns. A notion of ncpred verbal argument is introduce at the level of the meta-grammar, with some properties, for instance to be usable without a determiner or to be modifiable by some intensifier adverbs (while keeping most of their noun properties). Lefff provides entries for so-called lightverbs (such as "faire") that have no subcategerization frames and entries for predicative nouns (such as "attention") with a frame ("to something") and the mention of the lightverb to be used ("faire"). At parsing time, when there is both the presence of a predicative nouns (such as "attention") and its expected lightverb ("faire"), the frame of the noun is transferred to the lightverb because it is acting as the tree anchor.

Concretely, the Lefff lexicon provides the following entries:

attention       inv     100;Lemma;cfi;<Suj:cln|scompl|sinf|sn,                                           Objà:scompl|à-scompl|à-sinf|à-sn>;                            lightverb=faire;%default    faire   v65     300;0;v;<Suj:cln|sn>;cat=v,lightverb=faire;%actif

Just before parsing, the occurrence of faire in the following sentence will catch the subcategorization frame of attention (because faire is the expected lightverb for this entry of attention as an ncpred), leading to the following dependency tree (with an ncpred edge between the predicative noun and its lightverb).

 0 0 Graph

Because we are using regular verbal trees for handling predicative nouns, we are more robust to all kinds of standard variations such as the use of negation, insertion of adverbs, extractions, ...

 0 0 Graph

 0 0 Graph

 0 0 Graph

Note: there is currently around 320 entries for predicative nouns in Lefff lexicon (+ a few other added in a customization file of FRMG). There also exists other lexical files in Lefff related to polylexical verbal constructions for verb avoir.

 0 0 Graph

Note: Experiments were tried in collaboration with Elsa Tolone to use the tables of the French Lexique Grammaire, in particular for predicate nouns. It was mostly OK, but efficiency problems and lower accuracy, because of the large number of entries (with no clear indication of frequent vs rare entries). More precisely, there are around 30700 entries for predicative nouns, including many like "pratiquer le yoga royal" (?)

At metagrammar level

Some locutions (or fragment of locutions) that do not really correspond to a base part-of-speech and/or that remain relatively productive (semi-frozen) are better described at the level of the meta-grammar.

For instance, one may assume that "c'est X que" or "est-ce que" is a kind of locution in French occurring in a large set of situations but with many variations (ordering, tense, negation, ...). There exists a specific class (cleft_verb) for handling "c'est que" that inherits from the verb construction with restrictions.

 0 0 Graph

 0 0 Graph

We have a similar case with "il y a" (as a synonym for "ago" in English) with class ilya_as_time_mod

 0 0 Graph

A much more specific case is provided for the construction "X oblige" (en: X obligates), essentially found in "noblesse oblige" , but that seems to be productive:

 0 0 Graph

We also found a "Fin de l'appartheid oblige, ..." in the French TreeBank

The issue with this approach is the addition of very specific constructions in the meta-grammars, leading to an increase of the number of trees in the grammar. Inheritance in the meta-grammar eases the additions of these locutions (starting from a standard verb base and adding restrictions), but that remains a tedious task and maybe not the best way to do it !

Other cases handled at meta-grammar level include (lexicalized) discontinuous constructions, such as "ni X ni Y" coordinations that may be applied on all kinds on coordinated groups

 0 0 Graph

 0 0 Graph

The meta-grammar also proposes quoted constructions that are interesting for some specific Named Entities, such as titles of books, movies, ... But no clear solution when these entities are not quoted !

 0 0 Graph

At disambiguation level

As already mentioned, the disambiguation process tends to favor MWEs when possible.

As a special case, it is also possible to alter/complete the disamb process with a set of terms. Terms are so numerous and so domain-specific that we don't really wish to add them as lexical entries in Lefff. Most of them have also a standard underlying syntactic structure (such as "N de N") which means they can be parsed even if we don't know they are terms. However, it may be interesting for the post-parsing phase (e.g. for information extraction) to recognize them. It may also be important to recognize them as term to correctly handle some attachments. When using a list of terms, some disamb rules will favor the reading as a term and will also favor the attachment of the dependents on the head of the term.

 0 0 Graph

Using a term list in the lexer will forward information to the disamb proces

> echo "ce tremblement de terre a fait beaucoup de dégats" | ./frmg_lexer -terms CPL9          ...          term(1,4,'term63_tremblement_de_terre').          ....
Terms may impact disamb

The conversions issues related to output schema

FRMG provides outputs for several syntactic annotation schema, such as PASSAGE, FTB/CONLL, or the more recent Universal Dependency (UD) schema for French. Unfortunately, all these schema differ on their notion, list, and representation of MWEs. The conversion process should therefore take care, as much as possible, of these cases.

Actually, there is a simple case and a complex one:

• the simple case arises when a sequence of words for the FRMG Parser corresponds to a MWE for the output schema. Essentially, we need to forget the parse structure provided by FRMG. Of course, we are in trouble when the parse structure is not compatible with a reading as a word or a constituent, for instance when the components of the sequence belongs to distinct constuents.
• the complex case arise when a MWE for FRMG is not one for the output schema. It is then necessary to emit the expected parse structure for the MWE, but we generally have not access to this structure. Several strategies are considered. One of them consists into using rules (and default rules) providing the internal structures for large classes of MWEs. Because FRMG produces all possible analysis, another strategy consists into using one of the alternative parse not based on a MWE reading.

The following conversion rule illustrates this complex case for dates that are recognized as unitary Named Entities by FRMG but need to be expanded for UD.

udep_mwe_complex_expansion([day[],N,month[]],_, % Mercredi 26 Novembre                    [head @ ('NOUN', 'nc'),                     (nmod: 1) @ ('NUM', 'adj'),                    (compound: -2) @ ('NOUN', 'nc')                   ]                     ) :- is_number(N).

 0 0 FRMG native output

 0 0 UD output

Some limit and complex cases

Some well identified MWEs tend to get a lexical entry in Lefff, but may be the trace of some more productive construction. As a result, we get several distinct parses that actually corresponds to a same syntactic phenomena !

For instance, we have the case of "beaucoup de" ou "peu de" that are listed as determiners in Lefff, but may also be seen as the combination of a predet with the prep de.

 0 0 Graph

 0 0 Graph

And this notion of predet is also productive for other constructions

 0 0 Graph

 0 0 Graph

Another similar case is given by "il y a" that is so common that it has an entry in Lefff as a preposition.

 0 0 Graph

 0 0 Graph

The productivity of some construction related to the fact they denote unusual part-of-speech may be a problem. Among the MWEs that not yet handled by Sxpipe and FRMG, we have expressions like "je ne sais" like in "il est venu je ne sais comment" or "il a pris je ne sais quel livre", with many variations "on ne sait", "nous ne savons", ... We also have the expression "n'importe qu" as in "il fait n'importe quoi", "il travaille n'importe comment", "n'importe quel élève te le dira".

 0 0 Graph