Funding : Part of this work is performed within the ANR SEQUOIA project (large coverage probabilistic syntactic parsing of French).


Parsing French using the Berkeley parser

home: Statistical dependency parsing of French

Université Paris Diderot - INRIA

Parsing with Berkeley Parser

The following code makes use of the Berkeley Parser v1.0, slightly adapted to French (for unknown words suffixes, using Abishek Arun's heuristics). Many thanks to Slav Petrov for making his code available.


  • Download the BONSAI v3.2 archive, to get preprocessing code, Berkeley jar 1.0 adapted to French, Berkeley grammar learnt on French Treebank, and code for functional role labelling and constituency-to-dependency conversion. Note it requires:
    • perl and python >2.5
    • python-cjson, to install with : python install
    • ply (>3.3) (needed only to get labeled dependencies)

  • Set the BONSAI variable to your local path to BONSAI v3.2

Parsing command

The following command will preprocess and parse a raw UTF-8 text file INFILE and print output to STDOUT

$BONSAI/bin/ -f [const|udep|ldep] [-n] [-h] INFILE

Use -h option for online help
Use -n option if your text is already tokenized
Use -f option to choose the ouput format :
-f const outputs FrenchTreebank-like constituents
-f udep outputs unlabeled dependencies (conll format)
-f ldep outputs labeled dependencies (conll format)

The parsing corresponds to the best Berkeley configuration described in the benchmark (coling 2010 poster) : it segments and tokenizes text, replaces tokens by clusters, parses, and reinserts original tokens.

Note : newlines in input text are systematically interpreted as sentence frontiers.

Coming soon... : improved functional role labeler


Candito M.-H., Nivre J., Denis P. and Henestroza Anguiano E., 2010,
Benchmarking of Statistical Dependency Parsers for French, Proceedings of COLING'2010 (poster session), Beijing, China
Candito M.-H., Crabbé B., and Denis P., 2010,
Statistical French dependency parsing: treebank conversion and first results, Proceedings of LREC'2010, La Valletta, Malta
Seddah D., Candito M.-H. and Crabbé B., 2009,
Cross-parser evaluation and tagset variation: a French treebank study, in Proceedings of IWPT 2009, Paris, France
Candito M.-H. and Crabbé B., 2009,
Improving generative statistical parsing with semi-supervised word clustering, in Proceedings of IWPT 2009 (short paper), Paris, France
Candito M.-H., Crabbé B., Denis P. and Guérin F., 2009,
Analyse syntaxique du français : des constituants aux dépendances, Proceedings of TALN 2009, Senlis, France
Crabbé B. and Candito M.-H., 2008,
Expériences d'analyse syntaxique du français, Proceedings of TALN 2008, Avignon, France