The first step is the transformation of the original unstructured
textual documents into a structured one, better adapted for further
processing. The aim of the transformation is to delimit (with markup
elements) the different entries and the different components of each
entry (description part, explication part, bibliographic references,
...). The structured document may be encoded in XML, following a
DTD (to be defined) characterizing this class of document.
The task may be achieved using scripts (Perl or Python) based on
regular expressions.
Note that from the structured document may be derived an HTML version,
adequate for on-line browsing. Such a browsable version would be
helpful during the validation phases occurring in the following tasks.
2.2
Terminological Extraction
The first linguistic task consists in identifying the terminology
associated to the corpus and building a list of terms with
morphological variants.
This task may be achieved using scripts, morphological analyzing and
terminological extractors (such as FASTER or
LEXTER). Part-of-speech tagging and /or superficial parsing
may also be needed to find the syntactic category (noun, adjective,
...) of the different terms (because we are dealing with a
vocabulary not necessarily listed in available electronic
dictionaries).
Note that the collected terminology may be used to efficiently index
the corpus and speed-up searches in the on-line version.
2.3
Conceptual Acquisition
The next task would be the core of the research effort for ATOLL: we
plan to organize the collected terminology into some kind of
conceptual structure (such as a thesaurus, an ontology, or a semantic
network). The result should model (more or less precisely) the
concepts and relations between these concepts that emerge from the
corpus. For instance, a flower is a kind of plant (kind-of
relation) and has different parts (part-of relation); each part
has different characteristics such as a color, a texture, a shape, a
dimension, a number, .... A plant has also a geographic
distribution (location relation).
The set of relations is very helpful to do inferences, in particular
using heritage through the kind-of relation.
Of course, the modeling task may be helped a lot using a general model
explicitly given by botanists. The acquisition process will complete
and specialize this general model w.r.t. the corpus. In particular, it
will assign semantic categories (such as color, shape, texture) to
different terms (mostly adjectives).
This task being more prospective, we only sketch here a few ideas that
should be investigated. We plan to run several passes of partial
parsing and of knowledge acquisition. Each pass of parsing will use
the knowledge already present to be more complete and precise than the
previous one, in particular in resolving ambiguities. From the
partial parse trees may be collected a set of properties and relations
on concepts that will be forwarded to the knowledge acquisition
module. This module organizes this set, looking what properties and
relations seems to be pertinent w.r.t. the whole corpus (based for
instance on statistical or ponderation criteria). The emerging
properties and relations may then be proposed to a human supervisor
for validation. A new round of acquisition may then be started.
The efficiency of the parsing passes will be helped by first
identifying the recurrent syntactic and stylistic patterns used in the
corpus. This is most specially true for the entry parts that follow a
quasi controlled pattern.
2.4
Text Mining
The last task of the proposition is also a research task but will be
eased by the quality of the conceptual modeling given by the previous
task. The objective is to process each entry in the corpus in order to
fill a database. The processing of an entry will be guided by the
conceptual structure in order to identify where each piece of
information that is found should go in the database. By some aspects,
this task is similar to the previous one, except that the work is done
at the level on an entry rather than on the whole corpus and can no
longer rely on statistical methods to correct errors. Parsing backed
by knowledge and validation by a supervisor will again be used to
achieve this task.
The adequation of language Delta to encode the information extracted
form each entry will be examined.