Funding : Part of this work is performed within the ANR SEQUOIA project (large coverage probabilistic syntactic parsing of French).

Parsing French using the Berkeley parser

home: Statistical dependency parsing of French

Université Paris Diderot - INRIA

Parsing with Berkeley Parser	The following code makes use of the Berkeley Parser v1.0, slightly adapted to French (for unknown words suffixes, using Abishek Arun's heuristics). Many thanks to Slav Petrov for making his code available. Prerequisites Download the BONSAI v3.2 archive, to get preprocessing code, Berkeley jar 1.0 adapted to French, Berkeley grammar learnt on French Treebank, and code for functional role labelling and constituency-to-dependency conversion. Note it requires: perl and python >2.5 python-cjson, to install with : python setup.py install ply (>3.3) (needed only to get labeled dependencies) Set the BONSAI variable to your local path to BONSAI v3.2 Parsing command The following command will preprocess and parse a raw UTF-8 text file INFILE and print output to STDOUT $BONSAI/bin/bonsai_bky_parse_via_clust.sh -f [const\|udep\|ldep] [-n] [-h] INFILE Use -h option for online help Use -n option if your text is already tokenized Use -f option to choose the ouput format : -f const outputs FrenchTreebank-like constituents -f udep outputs unlabeled dependencies (conll format) -f ldep outputs labeled dependencies (conll format) The parsing corresponds to the best Berkeley configuration described in the benchmark (coling 2010 poster) : it segments and tokenizes text, replaces tokens by clusters, parses, and reinserts original tokens. Note : newlines in input text are systematically interpreted as sentence frontiers. Coming soon... : improved functional role labeler
Publications	Candito M.-H., Nivre J., Denis P. and Henestroza Anguiano E., 2010, Benchmarking of Statistical Dependency Parsers for French, Proceedings of COLING'2010 (poster session), Beijing, China	pdf
	Candito M.-H., Crabbé B., and Denis P., 2010, Statistical French dependency parsing: treebank conversion and first results, Proceedings of LREC'2010, La Valletta, Malta	pdf
	Seddah D., Candito M.-H. and Crabbé B., 2009, Cross-parser evaluation and tagset variation: a French treebank study, in Proceedings of IWPT 2009, Paris, France	pdf
	Candito M.-H. and Crabbé B., 2009, Improving generative statistical parsing with semi-supervised word clustering, in Proceedings of IWPT 2009 (short paper), Paris, France	pdf
	Candito M.-H., Crabbé B., Denis P. and Guérin F., 2009, Analyse syntaxique du français : des constituants aux dépendances, Proceedings of TALN 2009, Senlis, France	pdf
	Crabbé B. and Candito M.-H., 2008, Expériences d'analyse syntaxique du français, Proceedings of TALN 2008, Avignon, France	pdf