Global ETD Search

1	Process-based decomposition and multicore performance : case studies from Stringology Strauss, Marthinus David January 2017 (has links) Current computing hardware supports parallelism at various levels. Conventional programming techniques, however, do not utilise efficiently this growing resource. This thesis seeks a better fit between software and current hardware while following a hardware-agnostic software development approach. This allows the programmer to remain focussed on the problem domain. The thesis proposes process-based problem decomposition as a natural way to structure a concurrent implementation that may also improve multicore utilisation and, consequently, run-time performance. The thesis presents four algorithms as case studies from the domain of string pattern matching and finite automata. Each case study is conducted in the following manner. The particular sequential algorithm is decomposed into a number of communicating concurrent processes. This decomposition is described in the process algebra CSP. Hoare's CSP was chosen as one of the best known process algebras, for its expressive power, conciseness, and overall simplicity. Once the CSP-based process description has brought ideas to a certain level of maturity, the description is translated into a process-based implementation. The Go programming language was used for the implementation as its concurrency features were inspired by CSP. The performance of the process-based implementation is then compared against its conventional sequential version (also provided in Go). The goal is not to achieve maximal performance, but to compare the run-time performance of an ``ordinary'' programming effort that focussed on a process-based solution over a conventional sequential implementation. Although some implementations did not perform as well as others, some did significantly outperform their sequential counterparts. The thesis thus provides prima facie evidence that a process-based decomposition approach is promising for achieving a better fit between software and current multicore hardware. / Thesis (PhD)--University of Pretoria, 2017. / Computer Science / PhD / Unrestricted UCTD Regular expressions Finite automata Process based decomposition CSP
2	Modernizing the Syntax of Regular Expressions Andersson, Adam, Hansson, Ludwig January 2020 (has links) Context Writing and working with regular expressions could be a slow and tedious task,which is mainly because of its syntax, but also because there exist several different dialectswhich easily could cause confusion. Even though regular expression has been widely used forparsing and programming language design, they are now frequently used for input validationand seen in common applications such as text editors. Objectives The main objectives of our thesis are to determine whether or not a regularexpression language that is more like the surrounding programming language would increaseusability, readability, and maintainability. We will then investigate further into what kind ofimpact this would have regarding e.g, development speed, and what benefits and liabilities amore modernized syntax could introduce. Methods Two different methods were used to answer our research questions, exploratory in-terviews, and experiments. The data from the experiments were collected by screen recordingand a program in the environment we provided to the participant. Results.By doing interviews with developers that use traditional regular expressions on aregular basis, their stories confirm that its syntax is confusing even for developers with alot of experience. Our results from the experiment indicate that a language more like thesurrounding language increases both the overall ease of use and development speed. Conclusions From this research, we can conclude that a regular expression language thatis more like the surrounding programming language does increase usability, readability, andmaintainability. We could clearly see that it had a positive effect on the development speed aswell. Keywords — regular expressions, programming language design, readability regular expressions programming language design readability Software Engineering Programvaruteknik
3	On the Impact and Defeat of Regular Expression Denial of Service Davis, James Collins 28 May 2020 (has links) Regular expressions (regexes) are a widely-used yet little-studied software component. Engineers use regexes to match domain-specific languages of strings. Unfortunately, many regex engine implementations perform these matches with worst-case polynomial or exponential time complexity in the length of the string. Because they are commonly used in user-facing contexts, super-linear regexes are a potential denial of service vector known as Regular expression Denial of Service (ReDoS). Part I gives the necessary background to understand this problem. In Part II of this dissertation, I present the first large-scale empirical studies of super-linear regex use. Guided by case studies of ReDoS issues in practice (Chapter 3), I report that the risk of ReDoS affects up to 10% of the regexes used in practice (Chapter 4), and that these findings generalize to software written in eight popular programming languages (Chapter 5). ReDoS appears to be a widespread vulnerability, motivating the consideration of defenses. In Part III I present the first systematic comparison of ReDoS defenses. Based on the necessary conditions for ReDoS, a ReDoS defense can be erected at the application level, the regex engine level, or the framework/runtime level. In my experiments I report that application-level defenses are difficult and error prone to implement (Chapter 6), that finding a compatible higher-performing regex engine is unlikely (Chapter 7), that optimizing an existing regex engine using memoization incurs (perhaps acceptable) space overheads (Chapter 8), and that incorporating resource caps into the framework or runtime is feasible but faces barriers to adoption (Chapter 9). In Part IV of this dissertation, we reflect on our findings. By leveraging empirical software engineering techniques, we have exposed the scope of potential ReDoS vulnerabilities, and given strong motivation for a solution. To assist practitioners, we have conducted a systematic evaluation of the solution space. We hope that our findings assist in the elimination of ReDoS, and more generally that we have provided a case study in the value of data-driven software engineering. / Doctor of Philosophy / Software commonly performs pattern-matching tasks on strings. For example, when validating input in a Web form, software commonly tests whether an input fits the pattern of a credit card number or an email address. Software engineers often implement such string-based pattern matching using a tool called regular expressions (regexes). Regexes permit software engineers to succinctly describe the sequences of characters that make up common "languages" like the set of valid Visa credit card numbers (16 digits, starting with a 4) or the set of valid emails (some characters, an '@', and more characters including at least one'.'). Using regexes on untrusted user input in this manner may be a dangerous decision because some regexes take a long time to evaluate. These slow regexes can be exploited by attackers in order to carry out a denial of service attack known as Regular expression Denial of Service (ReDoS). To date, ReDoS has led to outages affecting hundreds of websites and tens of thousands of users. While the risk of ReDoS is well known in theory, in this dissertation I present the first large-scale empirical studies measuring the extent to which slow regular expressions are used in practice. I found that about 10% of real regular expressions extracted from hundreds of thousands of software projects can exhibit longer-than-expected worst-case behavior in popular programming languages including JavaScript, Python, and Ruby. Motivated by these findings, I then consider a range of ReDoS solution approaches: application refactoring, regex engine replacement, regex engine optimization, and resource caps. I report that application refactoring is error-prone, and that regex engine replacement seems unlikely due to incompatibilities between regex engines. Some resource caps are more successful than others, but all resource cap approaches struggle with adoption. My novel regex engine optimizations seem the most promising approach for protecting existing regex engines, offering significant time reductions with acceptable space overheads. Regular expressions denial of service ReDoS empirical software engineering software security
4	A teachable semi-automatic web information extraction system based on evolved regular expression patterns Siau, Nor Zainah January 2014 (has links) This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements. 006.3
5	Probabilistic Logic, Probabilistic Regular Expressions, and Constraint Temporal Logic Weidner, Thomas 29 August 2016 (has links) (PDF) The classic theorems of Büchi and Kleene state the expressive equivalence of finite automata to monadic second order logic and regular expressions, respectively. These fundamental results enjoy applications in nearly every field of theoretical computer science. Around the same time as Büchi and Kleene, Rabin investigated probabilistic finite automata. This equally well established model has applications ranging from natural language processing to probabilistic model checking. Here, we give probabilistic extensions Büchi\\\'s theorem and Kleene\\\'s theorem to the probabilistic setting. We obtain a probabilistic MSO logic by adding an expected second order quantifier. In the scope of this quantifier, membership is determined by a Bernoulli process. This approach turns out to be universal and is applicable for finite and infinite words as well as for finite trees. In order to prove the expressive equivalence of this probabilistic MSO logic to probabilistic automata, we show a Nivat-theorem, which decomposes a recognisable function into a regular language, homomorphisms, and a probability measure. For regular expressions, we build upon existing work to obtain probabilistic regular expressions on finite and infinite words. We show the expressive equivalence between these expressions and probabilistic Muller-automata. To handle Muller-acceptance conditions, we give a new construction from probabilistic regular expressions to Muller-automata. Concerning finite trees, we define probabilistic regular tree expressions using a new iteration operator, called infinity-iteration. Again, we show that these expressions are expressively equivalent to probabilistic tree automata. On a second track of our research we investigate Constraint LTL over multidimensional data words with data values from the infinite tree. Such LTL formulas are evaluated over infinite words, where every position possesses several data values from the infinite tree. Within Constraint LTL on can compare these values from different positions. We show that the model checking problem for this logic is PSPACE-complete via investigating the emptiness problem of Constraint Büchi automata. probabilistische automaten logik reguläre ausdrücke constraint ltl probabilistic automata logic regular expressions constraint ltl ddc:500
6	Automated retrieval and extraction of training course information from unstructured web pages Xhemali, Daniela January 2010 (has links) Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance.
7	Complexities of Order-Related Formal Language Extensions / Komplexiteter hos ordnings-relaterade utökningar av formella språk Berglund, Martin January 2014 (has links) The work presented in this thesis discusses various formal language formalisms that extend classical formalisms like regular expressions and context-free grammars with additional abilities, most relating to order. This is done while focusing on the impact these extensions have on the efficiency of parsing the languages generated. That is, rather than taking a step up on the Chomsky hierarchy to the context-sensitive languages, which makes parsing very difficult, a smaller step is taken, adding some mechanisms which permit interesting spatial (in)dependencies to be modeled. The most immediate example is shuffle formalisms, where existing language formalisms are extended by introducing operators which generate arbitrary interleavings of argument languages. For example, introducing a shuffle operator to the regular expressions does not make it possible to recognize context-free languages like anbn, but it does capture some non-context-free languages like the language of all strings containing the same number of as, bs and cs. The impact these additions have on parsing has many facets. Other than shuffle operators we also consider formalisms enforcing repeating substrings, formalisms moving substrings around, and formalisms that restrict which substrings may be concatenated. The formalisms studied here all have a number of properties in common. They are closely related to existing regular and context-free formalisms. They operate in a step-wise fashion, deriving strings by sequences of rule applications of individually limited power. Each step generates a constant number of symbols and does not modify parts that have already been generated. That is, strings are built in an additive fashion that does not explode in size (in contrast to e.g. Lindenmayer systems). All languages here will have a semi-linear Parikh image. They feature some interesting characteristic involving order or other spatial constraints. In the example of the shuffle multiple derivations are in a sense interspersed in a way that each is unaware of. All of the formalisms are intended to be limited enough to make an efficient parsing algorithm at least for some cases a reasonable goal. This thesis will give intuitive explanations of a number of formalisms fulfilling these requirements, and will sketch some results relating to the parsing problem for them. This should all be viewed as preparation for the more complete results and explanations featured in the papers given in the appendices. / Denna avhandling diskuterar utökningar av klassiska formalismer inom formella språk, till exempel reguljära uttryck och kontextfria grammatiker. Utökningarna handlar på ett eller annat sätt omordning, och ett särskilt fokus ligger på att göra utökningarna på ett sätt som dels har intressanta spatiala/ordningsrelaterade effekter och som dels bevarar den effektiva parsningen som är möjlig för de ursprungliga klassiska formalismerna. Detta står i kontrast till att ta det större steget upp i Chomsky-hierarkin till de kontextkänsliga språken, vilket medför ett svårt parsningsproblem. Ett omedelbart exempel på en sådan utökning är s.k. shuffle-formalismer. Dessa utökar existerande formalismer genom att introducera operatorer som godtyckligt sammanflätar strängar från argumentspråk. Om shuffle-operator introduceras till de reguljära uttrycken ger det inte förmågan att känna igen t.ex. det kontextfria språket anbn, men det fångar istället vissa språk som inte är kontextfria, till exempel språket som består av alla strängar som innehåller lika många a:n, b:n och c:n. Sättet på vilket dessa utökningar påverkar parsningsproblemet är mångfacetterat. Utöver dessa shuffle-operatorer tas också formalismer där delsträngar kan upprepas, formalismer där delsträngar flyttas runt, och formalismer som begränsar hur delsträngar får konkateneras upp. Formalismerna som tas upp här har dock vissa egenskaper gemensamma. De är nära besläktade med de klassiska reguljära och kontextfria formalismerna. De arbetar stegvis, och konstruerar strängar genom successiva applikationer av individuellt enkla regler. Varje steg genererar ett konstant antal symboler och modifierar inte det som redan genererats. Det vill säga, strängar byggs additivt och längden på dem kan inte explodera (i kontrast till t.ex. Lindenmayer-system). Alla språk som tas upp kommer att ha en semi-linjär Parikh-avbildning. De har någon instressant spatial/ordningsrelaterad egenskap. Exempelvis sättet på vilket shuffle-operatorer sammanflätar annars oberoende deriveringar. Alla formalismerna är tänkta att vara begränsade nog att det är resonabelt att ha effektiv parsning som mål. Denna avhandling kommer att ge intuitiva förklaring av ett antal formalismer som uppfyller ovanstående krav, och kommer att skissa en blandning av resultat relaterade till parsningsproblemet för dem. Detta bör ses som förberedande inför läsning av de mer djupgående och komplexa resultaten och förklaringarna i de artiklar som finns inkluderade som appendix. complexity automata languages order shuffle regular expressions context-free mildly context-sensitive parsing membership algorithm analysis
8	Implementace jednotky pro vyhledávání vzorů v FPGA / Implementation of the Pattern Matching Unit in the FPGA Košař, Vlastimil January 2010 (has links) This term project focuses on algorithms for pattern matching used in modern IDS. The main focus is on regular expression matching. It deals with methods based on deterministic and nondeterministic finite automata, hybrid methods and with method based on regular expressions as programing langue for specialised processors. Implementation of pattern matching units based on some of described methodologies is described in next part. Methodology for resource consumption estimation is also described. Developed software system for unit generation is described in the next part. In the final part results are presented and discused.
9	Incremental Matching on Word Chains Nilsson, Wilmer January 2024 (has links) Pattern matching, which is the process of finding a given pattern in a given text, is widely used in areas such as search and replace functions in text processing programs or in DNA sequence analysis, where the pattern can be a search term or a specific sequence of characters. Finding and analysing nucleic acid sequences in DNA data can in some cases require sequences to be found which in turn are made up of several specific sub sequences, where the nucleotides between them, as well as the number of them, are irrelevant. This pattern, also called a word chain, can more efficiently be found by pre-processing the pattern and text. This thesis explores, investigates and presents a data structure, used to match a word chain pattern, with the ability to incrementally alter this pre-computed information in order to more efficiently, time wise, handle text alterations such as split and concatenation operations. Computer Science Pattern Matching Word Chains Regular Expressions Computer Sciences Datavetenskap (datalogi)
10	Problems Related to Shortest Strings in Formal Languages Ang, Thomas January 2010 (has links) In formal language theory, studying shortest strings in languages, and variations thereof, can be useful since these strings can serve as small witnesses for properties of the languages, and can also provide bounds for other problems involving languages. For example, the length of the shortest string accepted by a regular language provides a lower bound on the state complexity of the language. In Chapter 1, we introduce some relevant concepts and notation used in automata and language theory, and we show some basic results concerning the connection between the length of the shortest string and the nondeterministic state complexity of a regular language. Chapter 2 examines the effect of the intersection operation on the length of the shortest string in regular languages. A tight worst-case bound is given for the length of the shortest string in the intersection of two regular languages, and loose bounds are given for two variations on the problem. Chapter 3 discusses languages that are defined over a free group instead of a free monoid. We study the length of the shortest string in a regular language that becomes the empty string in the free group, and a variety of bounds are given for different cases. Chapter 4 mentions open problems and some interesting observations that were made while studying two of the problems: finding good bounds on the length of the shortest squarefree string accepted by a deterministic finite automaton, and finding an efficient way to check if a finite set of finite words generates the free monoid. Some of the results in this thesis have appeared in work that the author has participated in \cite{AngPigRamSha,AngShallit}. automata theory algorithms and complexity finite state machine regular languages regular expressions formal languages theory of computation Computer Science

Search results