This thesis is situated at the boundary between formal language theory, algebra, and natural language processing (NLP).
NLP knows a wide range of language models:
from the simple n-gram models to the recently successful large language models (LLM).
Formal approaches to NLP view natural languages as formal languages, i.e., infinite sets of strings, where each phrase is seen as a string, and they seek finite descriptions of these sets.
Beyond language modeling, NLP deals with tasks such as syntactic analysis (or parsing), translation, information retrieval, and many others.
Solving such tasks using language models involves two steps:
Given a phrase of natural language, the model first builds a representation of the phrase and then computes the solution from that representation.
Formal language models usually employ trees or similar structures as representations, whose evaluation to output values can be elegantly described using algebra.
Chomsky introduced phrase structure grammars, which describe a process of generating strings using rewriting rules.
For modeling natural language, these rules follow an important aspect of its syntax: constituency, i.e., the hierarchical structure of phrases.
The best known grammar formalism is given by context-free grammars (CFG).
However, CFG fail to model discontinuities in constituency, where several non-adjacent parts of a phrase form a subphrase.
For instance, the German sentence “ich war auch einkaufen” can be understood so that “ich auch” is a noun phrase; it is discontinuous because it is interleaved by the verb “war”.
This problem can be solved by employing more expressive grammar formalisms such as linear context-free rewriting systems (LCFRS).
There are also grammar formalisms that generate sets of trees, e.g., regular tree grammars (RTG).
A similar formalisms exists with finite-state tree automata (FTA) whose semantics is defined in terms of accepting an input rather than generating it, but FTA and RTG have the same expressiveness.
Universal algebra lets us view trees as elements of a term algebra, which can evaluated to values in another algebra by applying a unique homomorphism.
For instance, the strings generated by a CFG can be obtained by evaluating trees over the rules of the CFG in this way.
Parsing is the problem of computing the constituency structure of a given phrase. Due to the ambiguity of natural language, several such structures may exist.
This problem can be extended by weights such as probabilities in order to compute, for instance, the best constituency structure.
The framework of semiring parsing abstracts from particular weights and is instead parameterized by a semiring, whereby many NLP problems can be obtained by plugging in an appropriate semiring.
However, the semiring parsing algorithm is only applicable to some problem instances. Weighted deductive parsing is a similar framework that employs a different algorithm, and thus its applicability differs.
We introduce a very general language model in the form of the RTG-based language model (RTG-LM) which consists of an RTG and a language algebra.
The RTG generates the constituency structures of a language and, inspired by the initial algebra semantics, the language algebra evaluates these structures to elements of the modeled language; we call these elements syntactic objects.
Through the free choice of the language algebra, many common grammar formalisms, such as CFG and LCFRS, are covered.
We add multioperator monoids, a generalization of semirings, as a weight algebra to RTG-LM and obtain weighted RTG-based language models (wRTG-LM).
This lets us define an abstract weighted parsing problem, called the M-monoid parsing problem.
Its inputs are a wRTG-LM 𝐺 and a syntactic object 𝑎, and it states to compute all representations that 𝐺 has for 𝑎 using the language algebra.
Then, these representations are evaluated to values in the weight algebra, and the values of all these representations are summed to a single output value.
We propose the M-monoid parsing algorithm to solve this problem. It generalizes both the semiring parsing algorithm and the weighted deductive parsing algorithm in a way that is inspired by Mohri's single-source shortest distance algorithm.
We prove two sufficient conditions for the termination and correctness of our algorithm.
We show that our framework covers semiring parsing, weighted deductive parsing, and other problems from NLP and beyond.
In the second part of this thesis, we explore constituent tree automata (CTA), a generalization of FTA, as a language model that is tailored towards modeling discontinuitiy.
We show several properties of CTA, including that their constituency parsing problem is an instance of our M-monoid parsing problem and can, for a large class of CTA, be solved by the M-monoid parsing algorithm.
This thesis aims to contribute a unifying formal framework for the specification of language models and NLP tasks.
Through our general M-monoid parsing algorithm, we also provide a means of investigating the algorithmic solvability of problems within this field.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:94329 |
Date | 06 November 2024 |
Creators | Mörbitz, Richard |
Contributors | Vogler, Heiko, Nederhof, Mark-Jan, Technische Universität Dresden |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0019 seconds