Working on a botanic corpus Eric de la Clergerie January 30, 2001 1 Motivation Extracting information from an encyclopedic corpus of botanic may be done by hand but it is a long and tedious work. More and more, it becomes interesting and possible to speed-up the process by automatizing it but still keeping an human expert for validation. Among the different kind of information that may be extracted from a botanic corpus, we can cite terminology, conceptual information to model a specialized domain (for instance African tropical flora), and descriptions of a set of plants that follows the conceptual model. The group ATOLL at INRIA Rocquencourt is interested in this area of research for different reasons. The group is primarily concerned with parsing for natural language and we believe that parsing is a key component for knowledge acquisition tasks from textual documents. Actually, we are already involved through participations to an ARC RLT (Resources Linguistiques pour les TAGs 1) and to the working group A3CTE (Applications, Apprentissage, Acquisition de Connaissances à partir de Textes Electroniques 2). For the European project TermIT 3, we also get some expertise in the domains of the conceptual structures such as thesaurus, ontologies, and semantic networks, and of the knowledge representation languages (conceptual graphs, descriptive logics). Finally, the group has worked on the issues of structuring and representing documents and linguistic resources (using for instance XML, and more anciently, using the systems Mentor and Centaur), issues that are relevant when processing encyclopedic corpus with a strong underlying structure. The corpus of flora that we have examined seem to be promising for these tasks of acquisition. Indeed, they are strongly structured, following relatively precise patterns of presentation. Style is relatively formal in some parts (descriptive parts) and freer in others (explicative parts), allowing to test the dependence of acquisition w.r.t. style. More over, the corpus present a rich but very uncommon vocabulary to describe, for instance, colors, textures, and shapes. This vocabulary will not be found in available electronic dictionaries and will require a phase of extraction. Botanic conceptual models are not trivial but seems relatively well delimited and organized, having been developped over a long period of time. It means that a large knowledge can be instilled at the beginning of the acquisition process by an expert and that validation of extracted knowledge should be easier. 2 Program We identify 4 main tasks. 2.1 Structuration of Corpus The first step is the transformation of the original unstructured textual documents into a structured one, better adapted for further processing. The aim of the transformation is to delimit (with markup elements) 1http://atoll.inria.fr/RLT/ 2http://www-lipn.univ-paris13.fr/groupes-de-travail/A3CTE/ 3http://www.mda.org.uk/term-it/ 1 the different entries and the different components of each entry (description part, explication part, bibli- ographic references, . . . ). The structured document may be encoded in XML, following a DTD (to be defined) characterizing this class of document. The task may be achieved using scripts (Perl or Python) based on regular expressions. Note that from the structured document may be derived an HTML version, adequate for on-line brows- ing. Such a browsable version would be helpful during the validation phases occurring in the following tasks. 2.2 Terminological Extraction The first linguistic task consists in identifying the terminology associated to the corpus and building a list of terms with morphological variants. This task may be achieved using scripts, morphological analyzing and terminological extractors (such as FASTER or LEXTER). Part-of-speech tagging and /or superficial parsing may also be needed to find the syntactic category (noun, adjective, . . . ) of the different terms (because we are dealing with a vocabulary not necessarily listed in available electronic dictionaries). Note that the collected terminology may be used to efficiently index the corpus and speed-up searches in the on-line version. 2.3 Conceptual Acquisition The next task would be the core of the research effort for ATOLL: we plan to organize the collected termi- nology into some kind of conceptual structure (such as a thesaurus, an ontology, or a semantic network). The result should model (more or less precisely) the concepts and relations between these concepts that emerge from the corpus. For instance, a flower is a kind of plant (kind-of relation) and has different parts (part-of relation); each part has different characteristics such as a color, a texture, a shape, a dimension, a number, . . . . A plant has also a geographic distribution (location relation). The set of relations is very helpful to do inferences, in particular using heritage through the kind-of relation. Of course, the modeling task may be helped a lot using a general model explicitly given by botanists. The acquisition process will complete and specialize this general model w.r.t. the corpus. In particular, it will assign semantic categories (such as color, shape, texture) to different terms (mostly adjectives). This task being more prospective, we only sketch here a few ideas that should be investigated. We plan to run several passes of partial parsing and of knowledge acquisition. Each pass of parsing will use the knowledge already present to be more complete and precise than the previous one, in particular in resolving ambiguities. From the partial parse trees may be collected a set of properties and relations on concepts that will be forwarded to the knowledge acquisition module. This module organizes this set, looking what properties and relations seems to be pertinent w.r.t. the whole corpus (based for instance on statistical or ponderation criteria). The emerging properties and relations may then be proposed to a human supervisor for validation. A new round of acquisition may then be started. The efficiency of the parsing passes will be helped by first identifying the recurrent syntactic and stylistic patterns used in the corpus. This is most specially true for the entry parts that follow a quasi controlled pattern. 2.4 Text Mining The last task of the proposition is also a research task but will be eased by the quality of the conceptual modeling given by the previous task. The objective is to process each entry in the corpus in order to fill a database. The processing of an entry will be guided by the conceptual structure in order to identify where each piece of information that is found should go in the database. By some aspects, this task is similar to the previous one, except that the work is done at the level on an entry rather than on the whole corpus and can no longer rely on statistical methods to correct errors. Parsing backed by knowledge and validation by a supervisor will again be used to achieve this task. 2 The adequation of language Delta to encode the information extracted form each entry will be exam- ined. 3 Objectives The objective for ATOLL is to explore several research issues during this work in the area of parsing and acquisition. We also wish to establish a methodology that may used to deal with other corpus of encyclopedic information and other domains than botanic. Beside the research effort, this project will imply the development of prototypes and, hopefully, the creation of resources (structured corpus, terminology, conceptual structure, database). It may be noted that, among of different tools that are needed, one should think to the validation tools for the human supervisors. We consider that this project requires a long term commitment of, for instance, a PhD student to co- ordinate the different tasks and to dialog with all the involved communities (botanists and computational linguists). 4 Related works 4.1 WordNet and EuroWordNet WordNet is a computational lexicon that classify English words into sets of synonyms (synsets) that denote concepts. Synsets are themselves organized in a kind-of taxonomy. Addtionnal relations between synsets are present, such as the part-of relations. WordNet has been done by hand, but experiences to automatize the building of brother wordnets for European languages (Project EuroWordNet) are being conducted. The importance of WordNet as a source of conceptual information for all kinds of linguistic processing has been recognized with many different experiences and specialized workshops. 4.2 MindNet MindNet is a massive semantic network built by a Microsoft Research Team by automatically extracting knowledge from the Machine Readable Dictionary LDOCE and more recently from the encyclopedia En- carta. Knowledge is extracted using a robust parser based on the grammar checker of Microsoft Office bundle and incorporated in the semantic network. The nodes of the network are either concepts (such as car) or relations (such as drive) and edges are labeled by several kinds of lexical, syntactic, thematic or semantic relations (listed in Figure 1). Attribute Goal Possessor Cause Hypernym Purpose Co-Agent Location Size Color Manner Source Deep_Object Material Subclass Deep_Subject Means Synonym Domain Modifier Time Equivalent Part User Figure 1: Semantic relations in MindNet 5 Background knowledge at INRIA Other INRIA research groups (at Rocquencourt or at the other sites) may provide useful knowledge or tools in the context of this project, in particular with representing knowledge and associating documents to knowledge. 3 Verso 4 Databases and XML. They are involved in XYLEME 5 "The Data Warehouse for the XML Doc- uments of the WEB" and with C-Web 6 "Supporting Community-Webs". They already have collab- oration with CSIRO. Samie A very recent group led by Alain Michard, one of the main promoter of C-Web 7. References [1] Thomas Ahlswede and Martha Evens. Parsing vs, text processing in the analysis of dictionary defini- tions. In Proc. of ACL'88, pages 217­224, 1988. [2] H Alshawi. Analysing the dictionary definitions. In Computational lexicography for natural language processing, pages 153­169. Longman, 1989. [3] A. Analyti, N. Syratos, and P. Constantopoulos. On the definition of semantic network semantics. Technical Report FORTH-ICS-TR-187, FORTH, Hellas, 1997. [4] Doug Beeferman. Lexical discovery with an enriched semantic network. In Harabagiu [10], pages 135­141. [5] J. Chang and J. Chen. Acquisition of computational-semantic lexicons from machine readable re- sources. In ACL'96 Workshop on the Breadth and Depth of Semantic Lexicons, 1996. [6] Martin Chodorow, Roy J. Byrd, and George E. Heidorn. Extracting semantic hierarchies from a large on-line dictionary. In Proc. of ACL'85, pages 299­304, 1985. [7] Giuseppe De Giacomo and Maurizio Lenzerini. A uniform framework for concept definitions in description logics. Journal of Artificial Intelligence Research, 6:87­110, 1997. [8] Xavier Farreres, German Rigau, and Horacio Rodríguez. Using WordNets for building WordNets. In Harabagiu [10], pages 65­72. [9] Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. In Formal Ontology in Conceptual Analysis and knowledge representation. Kluwer Academic Press, 1993. also Technical Report KSL 93-04, Stanford University. [10] Sanda Harabagiu, editor. COLING-ACL'98 Workshop on "Usage of WordNet in Natural Language Processing Systems". Université de Montréal, 1998. [11] M. Doerr I. Dionysiadou. Mapping of material culture to a semantic network. In Proc. 1994 JOINT ANNUAL MEETING, International Council of Museums Documentation Committee and Computer Network, 1994. [12] N. Ide and J. Veronis. Extracting knowledge bases from machine-readable dictionaries. In Proc. of KB&KS'93, pages 257­266, 1993. [13] Daniel Kayser. La représentation des connaissances. Hermes, 1997. [14] J. Klavans, M. Chodorow, and N. Wacholder. From dictionary to knowledge base via taxinomy. In Proc. of the sixth conf. of the University of Waterloo, Canada, 1990. [15] Oi Yee Kwong. Bridging the gap between dictionary and thesaurus. In COLING-ACL'98 [26], pages 1487­1489. [16] Claudia Leacok, Martin Chodorow, and George A. Miller. Using corpus statistics and WordNet relations for sense identification. Computational Linguistics, 24(1):147­166, Mars 1998. 5http://www-rocq.inria.fr/verso/LEVEL1/Xyleme.html 6http://cweb.inria.fr/ 7http://cweb.inria.fr/ 4 [17] J. Markowitz, T. Ahlswede, and M. Evens. Semantically significant patterns in dictionary definitions. In Proc. of ACL'86, pages 112­119, 1986. [18] G. Miller. Five papers on WordNet. Special issue of Int. Journal of Lexicography, 3(4), 1990. [19] S. Montemagni and L. Vanderwende. Structural pattern vs. string pattern for extracting semantic information from dictionaries. In Proc. of ACL'92, pages 546­552, 1992. [20] Tom O'Hara, Kavi Mahes, and Sergei Nirenburg. Lexical acquisition with WordNet and the Mikrokosmos ontology. In Harabagiu [10], pages 94­101. [21] Éric Villemonte de la Clergerie. Multilingual terminology production through an intermediate kn owledge level: Knowledge acquisition methods and techniques. Tâche 3.3.2 du Projet LE4-8356 Term­IT, devant être inclus da ns le document D3.1, June 1999. [22] Stephen D. Richardson, William B. Dolan, and Lucy Vanderwende. MindNet: Acquiring and struc- turing semantinc information from text. In COLING-ACL'98 [26], pages 1098­1102. [23] C. Rigau, J. Atserias, and E. Agirre. Building acurate semantic taxinomies from MRDs. In COLING- ACL'98 [26], pages 1103­1109. [24] John Sowa. Lexical Structures and Conceptual Structures. Kluwer, 1989. [25] John Sowa, editor. Principles of Semantic Network. Morgan Kaufman, 1991. [26] Université de Montréal. COLING-ACL'98. Morgan Kaufmann Publishers, August 1998. [27] P. Vossen, P. Diez-Orzas, and W. Peters. The multilingual design of EuroWordNet. In Proc. of the ACL/EACL-97 workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP application, 1997. [28] Piek Vossen, editor. EuroWordNet: A multilingual database with lexical semantic networks. Kluwer Academic Publishers, 1998. [29] Pierre Zweingenbaum and Jacques Bouaud. Construction d'une représentation sémantique en graphe conceptuels à partir d'une analyse LFG. In Proc. of TALN'97, 1997. 5