Index Suivant
Revision: 1.3
1 Motivation

Extracting information from an encyclopedic corpus of botanic may be done by hand but it is a long and tedious work. More and more, it becomes interesting and possible to speed-up the process by automatizing it but still keeping an human expert for validation.

Among the different kind of information that may be extracted from a botanic corpus, we can cite terminology, conceptual information to model a specialized domain (for instance African tropical flora), and descriptions of a set of plants that follows the conceptual model.

The group ATOLL at INRIA Rocquencourt is interested in this area of research for different reasons. The group is primarily concerned with parsing for natural language and we believe that parsing is a key component for knowledge acquisition tasks from textual documents. Actually, we are already involved through participations to an ARC RLT (Resources Linguistiques pour les TAGs) and to the working group A3CTE (Applications, Apprentissage, Acquisition de Connaissances à partir de Textes Electroniques). For the European project TermIT, we also get some expertise in the domains of the conceptual structures such as thesaurus, ontologies, and semantic networks, and of the knowledge representation languages (conceptual graphs, descriptive logics). Finally, the group has worked on the issues of structuring and representing documents and linguistic resources (using for instance XML, and more anciently, using the systems Mentor and Centaur), issues that are relevant when processing encyclopedic corpus with a strong underlying structure.

The corpus of flora that we have examined seem to be promising for these tasks of acquisition. Indeed, they are strongly structured, following relatively precise patterns of presentation. Style is relatively formal in some parts (descriptive parts) and freer in others (explicative parts), allowing to test the dependence of acquisition w.r.t. style. More over, the corpus present a rich but very uncommon vocabulary to describe, for instance, colors, textures, and shapes. This vocabulary will not be found in available electronic dictionaries and will require a phase of extraction. Botanic conceptual models are not trivial but seems relatively well delimited and organized, having been developped over a long period of time. It means that a large knowledge can be instilled at the beginning of the acquisition process by an expert and that validation of extracted knowledge should be easier.


Index Suivant