Tabulation in Computational Linguistics and Logic Programming

This page presents brief descriptions of different aspects of my research work on tabulation in Computational Linguistics and Logic Programming. This is still under construction (and maybe forever) and far from being exhaustive. Please follow references to my publications for more information.

Tabulation

The idea of Tabulation is quite general and is present under different forms in many domains of application. The basic principle consists into saving traces of computation in some table. It become then possible for instance to check the repetition of a subcomputation to avoid looping or to reuse the result of a subcomputation in different contexts.

Maybe the most popular example of tabulation is the so called "memoization" technique used in imperative or functional programming, for instance to compute the function Fibonacci.

Computation Sharing provided by tabulating is even more interesting when dealing with highly non-deterministic computations where similar subcomputations may be found in the different alternative. However, techniques to use are also more difficult in this context, principally because the result of a computation is usually the least fix point of some operator. For instance, in Logic Programming, to compute the answers to some (sub)query q(X) where q is a recursive predicate, one has to propagate the already computed answers until reaching the fix-point.

Areas of Application

Tabulation techniques may be used to avoid a good deal of infinite looping in Logic Programming. They also provide a good support to implement extended negation, in particular for stratified programs.

They also have been used in Deductive Data Bases, here again for termination and completeness. In the context of Deductive Data Bases, they are generally related to program transformation techniques called Magic Set.

Abstract Interpretation is also a domain where tabulation is useful to compute complete least-fix point of some abstract computation for a program. Moreover, termination is important because Abstract Interpretation is ideally a phase in a compiling process (and compilers don't loop). Computation sharing induced by tabulation is also increased when computing a fix-point over an abstract domain, because this domain is much smaller than the original concrete one.

In computational linguistics, tabulation is needed to face the inherent ambiguity of language. A simple example of propositional attachment in sentences may lead to a combinatorial time explosion without some kind of tabulation (generally ensured by using a chart parser). Algorithm CKY is maybe the most basic form of tabular algorithm in parsing where the algorithm fills in a bottom-up way a 2 dimensional table, indexed by string position and constituent length. More sophisticated are Earley algorithm and chart parser, where are tabulated labeled edges between string positions. A tutorial (in French) presented at TALN99 tries to retrace the history of tabulation in NLP and to outline the current state of art.

Push-Down Automata

Push-Down Automata [PDA] are well-known devices used in Formal Language Theory to characterize for instance Context-Free Languages. Their main component is a stack, i.e. a last-in first-out data structure. PDAs are actually well-adapted to express the notion of procedure calls and recursion, where one has to suspend temporally the current computation to process a sub-computation.

There are many interesting extensions of the traditional PDAs working on stack of symbols. In particular, I've been using and refining the notion of Logic Push-Down Automata [LPDA] working on stack of first-order terms and have introduced the notion of Subsumption-Oriented PDAs over ordered abstract domain of stacks.

More exotic variations of PDAs such as 2-stack automata [2SA] or Embedded Push-Down Automata [EPDA] are also interesting to deal with Tree Adjoining Grammars and Linear Indexed Grammars.

In my opinion, PDAs (and more generally automata) are very useful to cleanly model the operational semantic to be associated with a declarative grammar or logic program.

Dynamic Programming

Dynamic Programming could actually be seen as a synonym for tabulation. This is a principle to design algorithms where one tries to break computation into sub-computations that may be used several times in different contexts. These sub-computations are tabulated in order to save re-computation.

Actually, Dynamic Programming is mostly used to solve optimization problems where sub-computations corresponds to local optimums that may be joined to compute a global one. A good example is the Knapsack problem.

Dynamic Programming may also be used to solve many graph traversal problems (such as shortest path or path collector between 2 points). Following derivations of a transition may also be seen as a kind of graph traversal and submitted to Dynamic Programming techniques. Actually, the topology of the graph associated to the derivations of a Push-Down Automata exhibits a lot of regularities, in other words, a lot of possibilities for computation sharing using Dynamic Programming techniques. In particular, a PDA derivation may ignore for a while the bottom part of the control stack.

Subsumption

Subsumption is an important issue when using tabulation to keep traces of monotonous subcomputations over ordered domains. Indeed, it is then only necessary to tabulate traces associated to most general sub-computations, other ones being just instances.

DyALog

DyALog has been developed during the last few years to show the validity of tabulation for logic programs and grammars.

There is no recent and complete description of DyALog, but one can browse the following papers (in French and English).

The current Release 1.11.0 of DyALog may be downloaded either as a source package or a binary rpm. The current version runs under Linux/i*86 and (possibly) SunOS. The documentation is still largely incomplete and (unfortunately) to be updated (PDF Version). The NEWS retraces the history of DyALog functionnalities.

Typed Feature Structures are available as well as bidirectional parsing for Definite Clause Grammars and Bound Movement Grammars, hilog notation, and some aggregate predicates. Release 1.1 also introduced the compilation of Tree Adjoining Grammars. An XTAG-like architecture for TAGs (with tree families, lemma, lexicons, and (co)anchors) is possible in Release 1.4.2 .

Release 1.5.0 offers the possiblity to declare interfaces to C functions.

Release 1.9.0 includes an experiemental implementation of Range Concatenation Grammars [RCG] extended with logical arguments.

Release 1.10.1 has introduced floats, namespaces, modules, and a toplevel. There is also a new class of more efficient tabulated predicates and support for left-corner parsing strategies for DCG. Support has also been added to start new projects with DyALog.

Release 1.10.2 is a minor release

Release 1.10.3 has introduced hybrid Tree Insertion Grammars (TIGs) and TAGs parsing (with analyze of TAGs to extract the TIG part). Multiple adjunction is possible for TIG left and right adjunctions. This version has also introduced a kleene star operator for DCGs, TAGs and TIGs as well as an interleave operator for DCGs.

Release 1.10.4 is a minor release that fixes a few bugs and improves support for Kleene star and interleave operators.

Papers and presentation slides are available, [TALN02] (in French), [TAPD98] (in English) [ATALA99] (in French).

DyALog may now be tried on line for differents grammars (TAG and BMG) and parsers.

Indexing

Efficient term searching in the table modulo unification or subsumption are one of the main problems yet to solve in DyALog. A system of tries is currently used where searching is done by following term arguments in a depth-first and left-to-right way (looking terms as trees), and ignoring variable correlations. This scheme does not work well when storing complex terms that only discriminate late on the indexing path. Moreover, this scheme only provide an approximation of searching modulo unification and must be completed by costly unification on each potential candidate. Note that is still better that what provided by most Prolog systems, where indexing is only done on the first predicate argument.

A better approach seems to use substitution trees to factorize unification along a tree whose leaves are the indexed terms. Searching is then close from a Prolog process with backtracking. There is no need for further unification when reaching leaves. This scheme should be tried in a near future in DyALog.

Structure Sharing

Reducing the memory cost of tabulating terms is a necessity, hence the need for some efficient structure-sharing mechanism (in opposition of copying mechanism found in most Prolog systems). The delicate point to address is the problem of variable renaming: indeed, tabulated terms must be considered as without common variable.

DyALog uses an efficient structure-sharing mechanism [POPL93].

Other Tabular Systems

Besides DyALog, there are different tabular systems:
XSB
A freely available Prolog system developed at Stony Brook University by David S. Warren's team. It provides tabulation for logic programs. It is based on a standard Warren Abstract Machine and implements tabulation as memoization.

B-Prolog
Another freely available Prolog system developed at Kyushu Institute of Technology (Japan) implementing Linear Tabling.

GALENA
A parser generator for Context-Free Grammars and Definite Clause Grammars developed by the COLE Research Group at La Coruna (Spain).

Typed Feature Structures

Feature structures have been known for quite a long time in Computational Linguistics (a great ancestor system being PATR-II). They are roughly equivalent to records in most Programming languages and are very useful to model properties of linguistic constituents (for instance, number and genre). Adding types provides a lot of flexibility to write grammars or lexicons, exploiting default and inheritance.

A seminal reference on Typed Feature Structures [TFS] is Carpenter's book. Based on this book, the system ALE is freely available and runs on top of most Prolog systems.

Given their linguistic importance, Typed Feature Structures have been added to DyALog [TFS]. Their integration has also shown that Structure-Sharing (as provided by DyALog) is a plus for TFS.

Tree Adjoining Grammars

Tree Adjoining Grammars are a very interesting class of linguistically motivated grammars introduced by Joshi. They are natural extensions of Context-Free Grammars and belong also to a larger class of formalisms parsable in polynomial space and time: Mildly Context Sensitive Grammars. They are also equivalent to Linear Indexed Grammars.

Tabular algorithms exists for TAGs and LIGs but are usually designed for a specific parsing strategy and have varying complexities. There also exist different kinds of automata useful for TAGs and LIGs, such as 2-stack automata [2SA] and Embedded Push-Down Automata [EPDA].

In a recent joint work, Miguel Alonso Pardo and I have proposed a restricted class of 2-SA that may be used to describe a wide range of parsing strategies for TAGs and LIGs. A Dynamic Programming interpretation of these 2-SA exists that gives the best known time and space general complexity. A more specialized version has also been proposed for bottom-up strategies that reduce the space complexity (but not the time complexity).

[ACL98] [TAG+4] [MOL99] [EACL99] [PhD Alonso] [TAG+5a]

Shared Parse Forest

An very interesting by-product of tabular parsers is the possibility to represent the whole set of parse trees for a sentence by a compact shared parse forest. For instance, such a forest may represent in cubic space an exponential set of parse trees, for a context-free grammar.

A parse forest may be used to share both common top or bottom parts of parse trees. Even more interestingly, they are actually formally equivalent to grammars. This property holds for logic based grammars (and logic programs) with a supplementary characteristic: parse forest may be used to enumerate all parse tree without failure during the enumeration process.

From a practical point of view, parse forest seem to be good candidate for post-parsing processes such as translation, where some ambiguities may cross unchanged over languages while easy identification of others will ease a man-machine dialogue.

DyALog can display parse shared forest. In the context of DyALog integration in the GATE platform, I have also added a small parse forest browser. Still tentative, parse forest may now be manipulated from within DyALog in order to be able to express forest transformation (for instance, to build a shared "semantic forest" from a parse forest).

Collaborations