Revision: 1.3
Extracting information from an encyclopedic corpus of botanic may be
done by hand but it is a long and tedious work. More and more, it
becomes interesting and possible to speed-up the process by
automatizing it but still keeping an human expert for validation.
Among the different kind of information that may be extracted from a
botanic corpus, we can cite terminology, conceptual information to
model a specialized domain (for instance African tropical flora),
and descriptions of a set of plants that follows the conceptual model.
The group ATOLL at INRIA Rocquencourt is interested in this area of
research for different reasons. The group is primarily concerned with
parsing for natural language and we believe that parsing is a key
component for knowledge acquisition tasks from textual documents.
Actually, we are already involved through participations to an ARC RLT
(Resources
Linguistiques pour les TAGs) and to the working group A3CTE
(Applications,
Apprentissage, Acquisition de Connaissances à partir de Textes
Electroniques). For the European project
TermIT, we also get
some expertise in the domains of the conceptual structures such as
thesaurus, ontologies, and semantic networks, and of the knowledge
representation languages (conceptual graphs, descriptive logics).
Finally, the group has worked on the issues of structuring and
representing documents and linguistic resources (using for instance
XML, and more anciently, using the systems Mentor and Centaur), issues
that are relevant when processing encyclopedic corpus with a strong
underlying structure.
The corpus of flora that we have examined seem to be promising for
these tasks of acquisition. Indeed, they are strongly structured,
following relatively precise patterns of presentation. Style is
relatively formal in some parts (descriptive parts) and freer in
others (explicative parts), allowing to test the dependence of
acquisition w.r.t. style. More over, the corpus present a rich but
very uncommon vocabulary to describe, for instance, colors, textures,
and shapes. This vocabulary will not be found in available electronic
dictionaries and will require a phase of extraction. Botanic
conceptual models are not trivial but seems relatively well delimited
and organized, having been developped over a long period of time. It
means that a large knowledge can be instilled at the beginning of the
acquisition process by an expert and that validation of extracted
knowledge should be easier.