Benoît Sagot's homepage  —  team Alpage (INRIA / University Paris 7)

The WOLF (Wordnet Libre du Français, Free French Wordnet) is a free semantic lexical resource (wordnet) for French.

The WOLF has been built from the Princeton WordNet (PWN) and various multilingual resources (Sagot and Fišer 2008a, Sagot and Fišer 2008b, Fišer and Sagot 2008). Polysemous literals have been dealt with by an approach based on word-aligning a parallel corpora in 5 languages. The extracted multilingual lexical has been semantically disambiguated thanks to wordnets for the languages involved. Moreover, a bilingual approach was sufficient for building new entries for monosemous words. To achieve this, we extracted bilingual lexicons from Wikipedia and thesauri. The resulting wordnet has been evaluated against the French wordnet developed during the EuroWordNet project.

In 2009, a specific work has been done on adverbial synsets (Sagot, Fort et Venant 2009a, Sagot, Fort et Venant 2009b).

Since then, several efforts have allowed for an extension of WOLF's coverage and a reduction of its noise. First, a disambiguation technique for translation pairs extracted from freely available resources lead to version 0.2 (Sagot and Fišer 2011, 2012a). An approach targeted towards nominalization extracted from parsed corpora (version 0.2.1, Gábor et al. 2012) and another one based on word clusters extracted from aligned corpora (version 0.2.2, Apidianaki et Sagot, 2012) were used to further extend the resource. Version 0.2.5 is the result of the merging of WOLF 0.2.2 and another wordnet extracted automatically using a new graph-based appraoch based on translation pairs extracted from wiktionaries (Hanoka and Sagot 2012).

An error identification approach was also developed (Sagot and Fišer 2012b), followed by a manual validation of several thousands of candidate errors. In parallel, most verbal Basic Concept Set synsets were validated and extended manually. Finally, we performed a manual filtering of a large number of (literal, synset) pairs that were inconsistent with POS information from the Lefff lexicon, which allowed for an additional reduction of the noise in the resource. The result of these semi-manual efforts is WOLF version 1.0b4.

The WOLF contains all PWN synsets, including those for which no French literal is known.

The WOLF is in the XML format used by the DebVisDic tool, which is an updated version of the XML format used in the BalkaNet project. For now, SENSE elements are filled with information on the sources and approaches thanks to which the lexeme was found, and not with sense numbers. Among those, a tag starting with "ManVal" indicates a manually validated (literal, synset) pair, a tag starting with "ManAdd" indicates a pair that was manually added.

The WOLF is a free resource, distributed under the Cecill-C license (LGPL compatible).


Latest distributed version (1.0b4)