The main task of a Lexical Analyzer such as Lex [20], Flex [26] and RE/Flex [34], is to perform tokenization of a given input file within reasonable time and with limited storage requirements. Hence, most lexical analyzers use Deterministic Finite Automata (DFA) to tokenize input to ensure that the running time of the lexical analyzer is linear (or close to linear) in the size of the input. However, DFA constructed from Regular Expressions (RE) are inadequate to indicate the positions and/or extents in a matching string of a given subexpression of the regular expression. This means that all implementations of trailing contexts in DFA-based lexical analyzers, including Lex, Flex and RE/Flex, produce incorrect results. For any matching string in the input (also called the lexeme) that matches a token is regular expression pattern, it is not always possible to tell the position of a part of the lexeme that matches a subexpression of the regular expression. For example, the string abba matches the pattern a b*/b a, but the position of the trailing context b a of the pattern in the string abba cannot be determined by a DFA-based matcher in the aforementioned lexical analyzers. There are algorithms based on Nondeterministic Finite Automata (NFA) that match subexpressions accurately. However, these algorithms are costly to execute and use backtracking or breadth-first search algorithms that run in non-linear time, with polynomial or even exponential worst-case time complexity. A tagged DFA-based approach (TDFA) was pioneered by Ville Laurikari [15] to efficiently match subexpressions. However, TDFA are not perfectly suitable for lexical analyzers since the tagged DFA edges require sets of memory updates, which hampers the performance of DFA edge traversals when matching input. I will introduce a new DFA-based algorithm for efficient subexpression matching that performs memory updates in DFA states. I propose, the Store-Transfer-Accept Deterministic Finite Automata (staDFA). In my proposed algorithm, the subexpression matching positions and/or extents are stored in a Marker Position Store (MPS). The MPS is updated while the input is tokenized to provide the positions/extents of the sub-match. Compression techniques for DFA, such as Hopcroft’s method [14], default transitions [18, 19], and other methods, can be applied to staDFA. For an instance, this thesis provide a modified Hopcroft’s method for the minimization of staDFA. / A Thesis submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science. / Summer Semester 2018. / July 20, 2018. / Includes bibliographical references. / Robert A. van Engelen, Professor Directing Thesis; David Whalley, Committee Member; An-I Andy Wang, Committee Member.
Identifer | oai:union.ndltd.org:fsu.edu/oai:fsu.digital.flvc.org:fsu_650269 |
Contributors | Chowdhury, Mohammad Imran (author), van Engelen, Robert A. (professor directing thesis), Whalley, David B. (committee member), Wang, An-I Andy (committee member), Florida State University (degree granting institution), College of Arts and Sciences (degree granting college), Department of Computer Science (degree granting departmentdgg) |
Publisher | Florida State University |
Source Sets | Florida State University |
Language | English, English |
Detected Language | English |
Type | Text, text, master thesis |
Format | 1 online resource (56 pages), computer, application/pdf |
Page generated in 0.0021 seconds