Révisions

MWEs are a real difficulty in parsing.

The main issues are

  • the lack of consensus on defining and capturing MWEs
  • no closed lists or operational specif of MWEs
  • a large diversity of MWE kinds: named entities, terms, locutions, idioms, ...
  • a range of situation going from frozen to semi-productive MWEs

In FRMG, these diverse situations has led to a diversity of solutions, more or less perfect, at all levels, from the meta-grammar level, in the pre-parsing phases, during parsing, in the disambiguisation phase, or even during conversion to some conversion schema.

at the level of SxPipe (named entities and some frozen expressions such as complex csu)

Many simple MWEs (such complex preps, conjonctions, adverbs) have lexical entries in Lefff and are recognized during the tokenization phase with SxPipe. However, because the recognition may be erroneous, SxPipe generally returns lattice (or DAG) of the different reading, with and without MWEs, as shown below for "en fait" that be interpreted as a MWE adverb or as the sequence "en" + "fait".

an ambiguous DAG produced by SxPipe with a MWE reading

As illustrated by the two following sentences, both readings may be correct.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

For a given sentence, the disambiguation phase will tend to favor MWEs when possible (through specific disamb rule).

at the level of the parser (+ metagrammar)

The

  • 0
  • 0
Graph

  • 0
  • 0
Graph

at the level of the metagrammar

  • 0
  • 0
Graph

  • 0
  • 0
Graph

  • 0
  • 0
Graph

  • 0
  • 0
Graph

also found "Fin de l'appartheid oblige, ..." in the FTB (where is the limit between a locution like "noblesse oblige" and a productive construction "N oblige" ?)

also quoted constructions interesting for some specific Named Entities. But no clear solution when there is no quotes !

  • 0
  • 0
Graph

at disambiguation level

As already mentioned, the disambiguation process tends to favor MWEs when possible.

As a special case, it is also possible to alter/complete the disamb process with a set of terms. Terms are so numerous and so domain-specific that we don't really wish to add them as lexical entries in Lefff. Most of them have also a standard underlying syntactic structure (such as "N de N") which means they can be parsed even if we don't know they are terms. However, it may be interesting for the post-parsing phase (e.g. information extraction) to recognize them. It may also be important to recognize them as term to correctly handle some attachments. When using a list of terms, some disamb rules will favor the reading as a term and will also favor the attachment of the dependents on the head of the term.

  • 0
  • 0
Graph

Using a term list in the lexer will forward information to the disamb proces

<code>

> echo "ce tremblement de terre a fait beaucoup de dégats" | ./frmg_lexer -terms CPL9

...

term(1,4,'term63_tremblement_de_terre').

....

<code>

the conversions issues for output schema

with different notions and lists of MWEs

FRMG provides outputs following several syntactic annotation schema, such as PASSAGE, FTB/CONLL, or the more recent Universal Dependency (UD) schema for French. Unfortunately, all these schema differ on their notion, list, and representation of MWEs. The conversion process should therefore take care, as much as possible, of these cases.

Some limit and complex cases

Some well identified MWEs tend to get a lexical entry in Lefff, but may be the trace of some more productive construction. As a result, we get several distinct parses that actually corresponds to a same syntactic phenomena !

For instance, we have the case of "beaucoup de" ou "peu de" that are listed as determiners in Lefff, but may also be seen as the combination of a predet with the prep de.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

And this notion of predet is also productive for other constructions

  • 0
  • 0
Graph

  • 0
  • 0
Graph

Another similar case is given by "il y a" that is so common that it has an entry in Lefff as a preposition.

  • 0
  • 0
Graph

  • 0
  • 0
Graph

The productivity of some construction related to the fact they denote unusual part-of-speech may be a problem. Among the MWEs that not yet handled by Sxpipe and FRMG, we have expressions like "je ne sais" like in "il est venu je ne sais comment" or "il a pris je ne sais quel livre", with many variations "on ne sait", "nous ne savons", ... We also have the expression "n'importe qu" as in "il fait n'importe quoi", "il travaille n'importe comment", "n'importe quel élève te le dira".

  • 0
  • 0
Graph