Alpage: Deep syntactic modeling and parsing

Deep syntactic modeling and parsing

Although supra-sentential problematics received increasing attention in the last years, there is no satisfying solution to these problems. Among them, anaphora resolution as well as discourse and information structures have a far-reaching impact and are domains of expertise of Alpage (notably Talana) members. Their formal modeling has now reached a maturity which allows to integrate them, in a near future, inside future Alpage tools, including parsing systems inherited from Atoll. Once again, putting together the expertise of both Atoll and Talana members will allow significant breakthroughs.

Anaphora resolution

Our goal is to contribute to the resolution of most problems presented above about anaphoras, with a focus on automatic anaphora resolution. Concretely, this will be achieved in three steps:

Automatic resolution of pronominal anaphora: Several documented algorithms can be found in the literature, that should be integrated to the processing chain cited above, by relying, among others, on named entity detection and chunking modules.
Resolution of so-called non-individual anaphora: (references to events, propositions, facts). There is today very few literature on these questions, and we intend to develop both algorithms and tools relying on these algorithms, extending the previous step, so as to validate hypotheses on this problematic and supplement of our processing chain.
Generalization of anaphora resolution: to the problem of presupposition justification (Kamp). Since the seminal paper of van der Sandt [???], several works have been achieved to delimit this empirical domain of formal semantics, but there is still a strong need to deal with this question in terms of linguistic coverage.

Modeling and parsing discourse structures

The first part of the work of Alpage theoretical work on discourse structures will focus on the description and definition of connectors, i.e., words which lexicalize a discourse relation. The second part will study the discourse structures themselves, as induced by these discourse relations, and in particular the fundamental opposition between coordinating and subordinating relations. Indeed, this opposition is the basis on which important notions, such as the right frontier, can be defined. Moreover, most discourse theories (including D-STAG) suppose that discourse structures are trees, as it is the case for constituency structures at the syntactic level. But this has been shown by Laurence Danlos to be a simplistic approximation, and that DAGs (Direct Acyclic Graphs) are required, although with strong (although yet badly described constraints) [124]. This could lead to an improvement of the D-STAG model, so as to produce discourse "dependency structure" that are DAGs.

Moreover, not much is known about the linguistic resources other than discourse connectors for signalling coherence relations. Ongoing research by Laurence Danlos [44] aims to study non-discourse connector resources for marking coherence relations, namely "discourse verbs" and "discourse prepositions". "Discourse verbs" are verbs such as precede or cause which take as arguments eventualities or facts. An example of discourse preposition is with in John is crazy with grief.

This rather theoretic approach will allow to extend Alpage 's parsing systems so as to deal with the level of discourse, thanks to actual implementations of the D-STAG model. This will require the transcription of the linguistic descriptions of discourse connectors and relations in the form of a grammar of discourse. This is a non-trivial task, for example because of the mismatch between the syntactic and discourse levels that characterizes some discourse connectors (e.g., "ensuite" is unary at the syntactic level but binary at the discourse level).

Without trying to develop a full-featured text understanding system (which would be unrealistic with the current state of the art), we will extend the deep syntactic parsing systems described above thanks to discourse analysis that will indicate discourse relations, in particular thanks to the work on Synchronous TAG described in section 4.1.2. As said before, this will benefit to other parsing tasks such as anaphora resolution (e.g., it is reasonable to look for the antecedent of an anaphoric element in a searching space which is defined by the discourse structure) or information structure extraction, which is highly relevant for the syntactic and semantic levels (information structure studies the differences between Peter ate a cake and It's a cake that Peter ate, or between John fell; Mary pushed him and Mary pushed John; he fell).