IWPT 2009 - Improving generative statistical parsing with semi-supervised word clustering

Improving generative statistical parsing with semi-supervised word clustering

Marie Candito and Benoît Crabbé

11th International Conference on Parsing Technology (IWPT 2009)
Paris, France, 7th-9th October, 2009

Summary

We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, train with the PCFG-LA unlexicalized algorithm of (Petrov et al.,06). We find a gain in French parsing performance: we improve results from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are very encouraging for statistically parsing morphologically-rich languages, and languages with small amount of annotated data.

START Conference Manager (V2.56.8 - Rev. 780)