Program

Revision: 1.3

2	Program

We identify 4 main tasks.

2.1

Structuration of Corpus

The first step is the transformation of the original unstructured textual documents into a structured one, better adapted for further processing. The aim of the transformation is to delimit (with markup elements) the different entries and the different components of each entry (description part, explication part, bibliographic references, ...). The structured document may be encoded in XML, following a DTD (to be defined) characterizing this class of document.

The task may be achieved using scripts (Perl or Python) based on regular expressions.

Note that from the structured document may be derived an HTML version, adequate for on-line browsing. Such a browsable version would be helpful during the validation phases occurring in the following tasks.

2.2

Terminological Extraction

The first linguistic task consists in identifying the terminology associated to the corpus and building a list of terms with morphological variants.

This task may be achieved using scripts, morphological analyzing and terminological extractors (such as FASTER or LEXTER). Part-of-speech tagging and /or superficial parsing may also be needed to find the syntactic category (noun, adjective, ...) of the different terms (because we are dealing with a vocabulary not necessarily listed in available electronic dictionaries).

Note that the collected terminology may be used to efficiently index the corpus and speed-up searches in the on-line version.

2.3

Conceptual Acquisition

The next task would be the core of the research effort for ATOLL: we plan to organize the collected terminology into some kind of conceptual structure (such as a thesaurus, an ontology, or a semantic network). The result should model (more or less precisely) the concepts and relations between these concepts that emerge from the corpus. For instance, a flower is a kind of plant (kind-of relation) and has different parts (part-of relation); each part has different characteristics such as a color, a texture, a shape, a dimension, a number, .... A plant has also a geographic distribution (location relation).

The set of relations is very helpful to do inferences, in particular using heritage through the kind-of relation.

Of course, the modeling task may be helped a lot using a general model explicitly given by botanists. The acquisition process will complete and specialize this general model w.r.t. the corpus. In particular, it will assign semantic categories (such as color, shape, texture) to different terms (mostly adjectives).

This task being more prospective, we only sketch here a few ideas that should be investigated. We plan to run several passes of partial parsing and of knowledge acquisition. Each pass of parsing will use the knowledge already present to be more complete and precise than the previous one, in particular in resolving ambiguities. From the partial parse trees may be collected a set of properties and relations on concepts that will be forwarded to the knowledge acquisition module. This module organizes this set, looking what properties and relations seems to be pertinent w.r.t. the whole corpus (based for instance on statistical or ponderation criteria). The emerging properties and relations may then be proposed to a human supervisor for validation. A new round of acquisition may then be started.

The efficiency of the parsing passes will be helped by first identifying the recurrent syntactic and stylistic patterns used in the corpus. This is most specially true for the entry parts that follow a quasi controlled pattern.

2.4

Text Mining

The last task of the proposition is also a research task but will be eased by the quality of the conceptual modeling given by the previous task. The objective is to process each entry in the corpus in order to fill a database. The processing of an entry will be guided by the conceptual structure in order to identify where each piece of information that is found should go in the database. By some aspects, this task is similar to the previous one, except that the work is done at the level on an entry rather than on the whole corpus and can no longer rely on statistical methods to correct errors. Parsing backed by knowledge and validation by a supervisor will again be used to achieve this task.

The adequation of language Delta to encode the information extracted form each entry will be examined.