Révisions

MWEs are a real difficulty in parsing.

The main issues are

  • the lack of consensus on defining and capturing MWEs
  • no closed lists or operational specif of MWEs
  • a large diversity of MWE kinds: named entities, terms, locutions, idioms, ...
  • a range of situation going from frozen to semi-productive MWEs

In FRMG, these diverse situations has led to a diversity of solutions, more or less perfect, at all levels, from the meta-grammar level, in the pre-parsing phases, during parsing, in the disambiguisation phase, or even during conversion to some conversion schema.

At tokenization level (SxPipe)

Many simple MWEs (such complex preps, conjonctions, adverbs) have lexical entries in Lefff and are recognized during the tokenization phase with SxPipe. However, because the recognition may be erroneous, SxPipe generally returns lattice (or DAG) of the different reading, with and without MWEs, as shown below for "en fait" that be interpreted as a MWE adverb or as the sequence "en" + "fait".

an ambiguous DAG produced by SxPipe with a MWE reading

As illustrated by the two following sentences, both readings may be correct.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

For a given sentence, the disambiguation phase will tend to favor MWEs when possible (through specific disamb rule).

The same mechanism also applies to Named Entities recognized by one of the components of SxPipe and that are often MWEs.

  • 0
  • 0
Graph

At parser level (and metagrammar level)

There exists a speciifc "hack" for lightverbs and predicative nouns. A notion of ncpred verbal argument is introduce at the level of the meta-grammar, with some properties, for instance to be usable without a determiner or to be modifiable by some adverbs. Lefff provides entries for so-called lightverbs (such as "faire") that have no subcategerization frames and entries for predicative nouns (such as "attention") with a frame ("to something") and the mention of the lightverb to be used ("faire"). At parsing time, when there is both the presence of a predicative nouns (such as "attention") and its expected lightverb ("faire"), the frame of the noun is transferred to the lightverb because it is acting as the tree anchor.

  • 0
  • 0
Graph

Because we are using regular verbal trees for handling predicative nouns, we are more robust to all kinds of standard variations such as the use of negation, insertion of adverbs, extractions, ...

  • 0
  • 0
Graph

  • 0
  • 0
Graph

  • 0
  • 0
Graph

At metagrammar level

Some locutions (or fragment of locutions) that remain relatively productive (semi-frozen) are better described at the level of the meta-grammar.

For instance, one may assume that "c'est X que" or "est-ce que" is a kind of locution in French occurring in a large set of situations but with many variations (ordering, tense, negation, ...). There exists a specific class for handling "c'est que" that inherits from the verb construction with restrictions.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

We have a similar case with "il y a" (as a synonym for "ago" in English).

  • 0
  • 0
Graph

A much more specific case is provided for the construction "X oblige", essentially found for "noblesse oblige" , but that seems to be productive:

  • 0
  • 0
Graph

We also found a "Fin de l'appartheid oblige, ..." in the FTB

The issue with this approach is the addition of very specific constructions in the meta-grammars, leading to an increase of the number of trees in the grammar. Inheritance in the meta-grammar eases the additions of these locutions (starting from a standard verb base and adding restrictions), but that remains a tedious task !

Also quoted constructions interesting for some specific Named Entities. But no clear solution when there is no quotes !

  • 0
  • 0
Graph

At disambiguation level

As already mentioned, the disambiguation process tends to favor MWEs when possible.

As a special case, it is also possible to alter/complete the disamb process with a set of terms. Terms are so numerous and so domain-specific that we don't really wish to add them as lexical entries in Lefff. Most of them have also a standard underlying syntactic structure (such as "N de N") which means they can be parsed even if we don't know they are terms. However, it may be interesting for the post-parsing phase (e.g. information extraction) to recognize them. It may also be important to recognize them as term to correctly handle some attachments. When using a list of terms, some disamb rules will favor the reading as a term and will also favor the attachment of the dependents on the head of the term.

  • 0
  • 0
Graph

Using a term list in the lexer will forward information to the disamb proces

  1. > echo "ce tremblement de terre a fait beaucoup de dégats" | ./frmg_lexer -terms CPL9
  2. ...
  3. term(1,4,'term63_tremblement_de_terre').
  4. ....

the conversions issues for output schema

with different notions and lists of MWEs

FRMG provides outputs following several syntactic annotation schema, such as PASSAGE, FTB/CONLL, or the more recent Universal Dependency (UD) schema for French. Unfortunately, all these schema differ on their notion, list, and representation of MWEs. The conversion process should therefore take care, as much as possible, of these cases.

Some limit and complex cases

Some well identified MWEs tend to get a lexical entry in Lefff, but may be the trace of some more productive construction. As a result, we get several distinct parses that actually corresponds to a same syntactic phenomena !

For instance, we have the case of "beaucoup de" ou "peu de" that are listed as determiners in Lefff, but may also be seen as the combination of a predet with the prep de.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

And this notion of predet is also productive for other constructions

  • 0
  • 0
Graph

  • 0
  • 0
Graph

Another similar case is given by "il y a" that is so common that it has an entry in Lefff as a preposition.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

The productivity of some construction related to the fact they denote unusual part-of-speech may be a problem. Among the MWEs that not yet handled by Sxpipe and FRMG, we have expressions like "je ne sais" like in "il est venu je ne sais comment" or "il a pris je ne sais quel livre", with many variations "on ne sait", "nous ne savons", ... We also have the expression "n'importe qu" as in "il fait n'importe quoi", "il travaille n'importe comment", "n'importe quel élève te le dira".

  • 0
  • 0
Graph