ALPaGE

Statistical dependency parsing of French

Université Paris Diderot - INRIA

Content : This page gathers various resources for the statistical dependency parsing of French. In particular, preprocessing code and learnt models/grammars for MaltParser, MSTParser, Berkeley Parser, and a constituent-to-dependencies conversion tool for French.

Many thanks to Joakim Nivre, Ryan McDonald and Slav Petrov for making their parsers available.

Contributors : Marie Candito (contact), Benoît Crabbé, Pascal Denis, Mathieu Falco, François Guérin, Enrique Henestroza Anguiano, Joakim Nivre, Djamé Seddah

Funding : Part of this work is performed within the ANR SEQUOIA project (large coverage probabilistic syntactic parsing of French).

Dependency Corpus

(converted from French Treebank constituency trees)

The French Treebank consists in approx. 20000 sentences of the Le Monde newspaper, annotated for morphology and phrase-structure (Abeillé A. and Barrier N., Enriching a French Treebank, LREC'2004). Part of the treebank also contains grammatical functions for the constituents that depend on verbs, this part growing over time.

We designed an automatic procedure for converting the constituency trees with functional information into surface dependency trees. It resulted in a first version of 12531 surface dependency trees (cf. TALN'2009, LREC'2010), and here is a description of the resulting annotation scheme : French surface dependencies annotation scheme (in French).

At the occasion of the SPMRL 2013 shared task on statistical parsing of morphologically rich languages (Seddah et al., 2013), we revised the conversion procedure and applied it to a larger set of constituency trees with functional information available at that time, namely 18535 sentences (see the README of the French part of the SPMRL 2013 dataset for more information).

We distribute the converted dependency treebank freely provided one has the licence for the original French Treebank (see here).

Once you have the licence, you may contact marie . candito @ linguist . univ-paris-diderot .fr for the dependency treebank.

Parsing

Parsing resources for MaltParser (bonsai 3.2)

Parsing resources for MSTParser (bonsai 3.2)

Parsing resources for Berkeley Parser (bonsai 3.2)

Summary of parsing performances (imac 2,66GHz)
(see the benchmark paper for details)
LASUASParsing time
(one-sentence raw file)
Parsing time
(1235-sentence raw file)
Malt87.389.742s1m 25s
MST88.290.91m 50s14m 39s
BKY86.891.06s12m 46s

Publications

Candito M.-H., Nivre J., Denis P. and Henestroza Anguiano E., 2010,
Benchmarking of Statistical Dependency Parsers for French, Proceedings of COLING'2010 (poster session), Beijing, China
pdf
Candito M.-H., Crabbé B., and Denis P., 2010,
Statistical French dependency parsing: treebank conversion and first results, Proceedings of LREC'2010, La Valletta, Malta
pdf
Seddah D., Candito M.-H. and Crabbé B., 2009,
Cross-parser evaluation and tagset variation: a French treebank study, in Proceedings of IWPT 2009, Paris, France
pdf
Candito M.-H. and Crabbé B., 2009,
Improving generative statistical parsing with semi-supervised word clustering, in Proceedings of IWPT 2009 (short paper), Paris, France
pdf
Candito M.-H., Crabbé B., Denis P. and Guérin F., 2009,
Analyse syntaxique du français : des constituants aux dépendances, Proceedings of TALN 2009, Senlis, France
pdf
Crabbé B. and Candito M.-H., 2008,
Expériences d'analyse syntaxique du français, Proceedings of TALN 2008, Avignon, France
pdf