Spelling suggestions: "subject:"regularexpression"" "subject:"regularexpressions""
1 |
Simulators for formal languages, automata and theory of computation with focus on JFLAPFransson, Tobias January 2013 (has links)
This report discusses simulators in automata theory and which one should be best for use in laboratory assignments. Currently, the Formal Languages, Automata and Theory of Computation course (FABER) at Mälardalen University uses the JFLAP simulator for extra exercises. To see if any other simulators would be useful either along with JFLAP or standalone, tests were made with nine programs that are able to graphically simulate automata and formal languages. This thesis work started by making an overview of simulators currently available.After the reviews it has become clear to the author that JFLAP is the best choice for majority of cases. JFLAP is also the most popular simulator in automata theory courses worldwide.To support the use of JFLAP for the course a manual and course assignments are created to help the student to getting started with JFLAP. The assignments are expected to replace the current material in the FABER course and to help the uninitiated user to get more out of JFLAP.
|
2 |
Identifiera känslig data inom ramen för GDPR : Med K-Nearest NeighborsDarborg, Alex January 2018 (has links)
General Data Protection Regulation, GDPR, is a regulation coming into effect on May 25th 2018. Due to this, organizations face large decisions concerning how sensitive data, stored in databases, are to be identified. Meanwhile, there is an expansion of machine learning on the software market. The goal of this project has been to develop a tool which, through machine learning, can identify sensitive data. The development of this tool has been accomplished through the use of agile methods and has included comparisions of various algorithms and the development of a prototype. This by using tools such as Spyder and XAMPP. The results show that different types of sensitive data give variating results in the developed software solution. The kNN algorithm showed strong results in such cases when the sensitive data concerned Swedish Social Security numbers of 10 digits, and phone numbers in the length of ten or eleven digits, either starting with 46-, 070, 072 or 076 and also addresses. Regular expression showed strong results concerning e-mails and IP-addresses. / General Data Protection Regulation, GDPR, är en reglering som träder i kraft 25 maj 2018. I och med detta ställs organisationer inför stora beslut kring hur de ska finna känsliga data som är lagrad i databaser. Samtidigt expanderar maskininlärning på mjukvarumarknaden. Målet för detta projekt har varit att ta fram ett verktyg som med hjälp av maskininlärning kan identifiera känsliga data. Utvecklingen av detta verktyg har skett med hjälp av agila metoder och har innefattat jämförelser av olika algoritmer och en framtagning av en prototyp. Detta med hjälp av verktyg såsom Spyder och XAMPP. Resultatet visar på att olika typer av känsliga data ger olika starka resultat i den utvecklade programvaran. kNN-algoritmen visade starka resultat i de fall då den känsliga datan rörde svenska, tiosiffriga personnummer samt telefonnummer i tio- eller elva-siffrigt format, och antingen inleds med 46, 070, 072 eller 076 samt då den rörde adresser. Regular expression visade på starka resultat när det gällde e- mails och IP-adresser.
|
3 |
Automated analysis and validation of chemical literatureTownsend, Joseph A. January 2008 (has links)
Methods to automatically extract and validate data from the chemical literature in legacy formats to machine-understandable forms are examined. The work focuses of three types of data: analytical data reported in articles, computational chemistry output files and crystallographic information files (CIFs). It is shown that machines are capable of reading and extracting analytical data from the current legacy formats with high recall and precision. Regular expressions cannot identify chemical names with high precision or recall but non-deterministic methods perform significantly better. The lack of machine-understandable connection tables in the literature has been identified as the major issue preventing molecule-based data-driven science being performed in the area. The extraction of data from computational chemistry output files using parser-like approaches is shown to be not generally possible although such methods work well for input files. A hierarchical regular expression based approach can parse > 99.9% of the output files correctly although significant human input is required to prepare the templates. CIFs may be parsed with extremely high recall and precision, contain connection tables and the data is of high quality. The comparison of bond lengths calculated by two computational chemistry programs show good agreement in general but structures containing specific moieties cause discrepancies. An initial protocol for the high-throughput geometry optimisation of molecules extracted from the CIFs is presented and the refinement of this protocol is discussed. Differences in bond length between calculated and experimentally determined values from the CIFs of less than 0.03 Angstrom are shown to be expected by random error. The final protocol is used to find high-quality structures from crystallography which can be reused for further science.
|
4 |
Matching in MySQL : A comparison between REGEXP and LIKECarlsson, Emil January 2012 (has links)
When needing to search for data in multiple datasets there is a risk that not all da-tasets are of the same type. Some might be in XML-format; others might use a re-lational database. This could frighten developers from using two separate datasets to search for the data in, because of the fact that crafting different search methods for different datasets can be time consuming. One option that is greatly overlooked is the usage of regular expressions. If a search expression is created it can be used in a majority of database engines as a “WHERE” statement and also in other form of data sources such as XML. This option is however, at best, poorly documented and few tests have been made in how it performs against traditional search methods in databases such as “LIKE”. Multiple experiments comparing “LIKE” and “REGEXP” in MySQL have been performed for this paper. The results of these experiments show that the possible overhead by using regular expressions can be motivated when considering the gain of only using one search phrase over several data sources. / När behovet att söka over flertalet typer av datakällor finns det alltid en risk att inte alla datakällor är av samma typ. Några kan vara i XML-format; andra kan vara i form av en relationsdatabas. Detta kan avskräcka utvecklare ifrån att använda två oberoende datakällor för att söka efter data, detta för att det kan vara väldigt tidskrävande att utveckla två olika vis att skapa sökmetoderna. Ett alternativ som ofta är förbisett är att använda sig av reguljära uttryck. Om ett sökuttryck är skapat i reguljära uttryck så kan det användas i en majoritet av data-basmotorerna på marknaden som ett ”WHERE” påstående, men det kan även an-vändas i andra typer av datakällor så som XML. Detta alternativ är allt som ofta dåligt dokumenterat och väldigt få tester har ut-förts på prestandan i jämförelse med ”LIKE”. Som grund för denna uppsats har flertalet experiment utförs där ”LIKE” och ”REGEXP” jämförs i en MySQL databas. Försöken visar på att den eventuella försämringen i prestanda kan betala sig vid användande av multipla datatyper.
|
5 |
Developing a Compiler for a Regular Expression Based Policy Specification LanguageJuhlin, Cory Michael 28 October 2015 (has links)
Security policy specification languages are a response to today's complex and vulnerable software climate. These languages allow an individual or organization to restrict and modify the behavior of third-party applications such that they adhere to the rules specified in the policy. As software grows in complexity, so do the security policies that govern them. Existing policy specification languages have not adapted to the growing complexity of the software they govern and as a result do not scale well, often resulting in code that is overly complex or unreadable. Writing small, isolated policies as separate modules and combining them is known as policy composition, and is an area in which existing policy specification languages have a number of drawbacks. Policy composition is unpredictable and nonstandard with existing languages. PoCo is a new policy specification language that uses signed regular expressions to return sets of allowed and denied actions as output from its policies, allowing policies to be combined with standard set operations in an algebraic way. This thesis covers my contribution to the PoCo project in creating a formal grammar for the language, developing a static analysis tool for policy designers, and implementation of the first PoCo language compiler and runtime for the Java platform.
|
6 |
Flexible finite automata-based algorithms for detecting microsatellites in DNADe Ridder, Corne 17 August 2010 (has links)
Apart from contributing to Computer Science, this research also contributes to Bioinformatics, a subset of the subject discipline Computational Biology. The main focus of this dissertation is the development of a data-analytical and theoretical algorithm to contribute to the analysis of DNA, and in particular, to detect microsatellites. Microsatellites, considered in the context of this dissertation, refer to consecutive patterns contained by genomic sequences. A perfect tandem repeat is defined as a string of nucleotides which is repeated at least twice in a sequence. An approximate tandem repeat is a string of nucleotides repeated consecutively at least twice, with small differences between the instances. The research presented in this dissertation was inspired by molecular biologists who were discovered to be visually scanning genetic sequences in search of short approximate tandem repeats or so called microsatellites. The aim of this dissertation is to present three algorithms that search for short approximate tandem repeats. The algorithms comprise the implementation of finite automata. Thus the hypothesis posed is as follows: Finite automata can detect microsatellites effectively in DNA. "Effectively" includes the ability to fine-tune the detection process so that redundant data is avoided, and relevant data is not missed during search. In order to verify whether the hypothesis holds, three theoretical related algorithms have been proposed based on theorems from finite automaton theory. They are generically referred to as the FireìSat algorithms. These algorithms have been implemented, and the performance of FireìSat2 has been investigated and compared to other software packages. From the results obtained, it is clear that the performance of these algorithms differ in terms of attributes such as speed, memory consumption and extensibility. In respect of speed performance, FireìSat outperformed rival software packages. It will be seen that the FireìSat algorithms have several parameters that can be used to tune their search. It should be emphasized that these parameters have been devised in consultation with the intended user community, in order to enhance the usability of the software. It was found that the parameters of FireìSat can be set to detect more tandem repeats than rival software packages, but also tuned to limit the number of detected tandem repeats. Copyright / Dissertation (MSc)--University of Pretoria, 2010. / Computer Science / unrestricted
|
7 |
Finite state automaton construction through regular expression hashingCoetser, Rayner Johannes Lodewikus 25 August 2010 (has links)
In this study, the regular expressions forming abstract states in Brzozowski’s algorithm are not remapped to sequential state transition table addresses as would be the case in the classical approach, but are hashed to integers. Two regular expressions that are hashed to the same hash code are assigned the same integer address in the state transition table, reducing the number of states in the automaton. This reduction does not necessarily lead to the construction of a minimal automaton: no restrictions are placed on the hash function hashing two regular expressions to the same code. Depending on the quality of the hash function, a super-automaton, previously referred to as an approximate automaton, or an exact automaton can be constructed. When two regular expressions are hashed to the same state, and they do not represent the same regular language, a super-automaton is constructed. A super-automaton accepts the regular language of the input regular expression, in addition to some extra strings. If the hash function is bad, many regular expressions that do not represent the same regular language will be hashed together, resulting in a smaller automaton that accepts extra strings. In the ideal case, two regular expressions will only be hashed together when they represent the same regular language. In this case, an exact minimal automaton will be constructed. It is shown that, using the hashing approach, an exact or super-automaton is always constructed. Another outcome of the hashing approach is that a non-deterministic automaton may be constructed. A new version of the hashing version of Brzozowski’s algorithm is put forward which constructs a deterministic automaton. A method is also put forward for measuring the difference between an exact and a super-automaton: this takes the form of the k-equivalence measure: the k-equivalence measure measures the number of characters up to which the strings of two regular expressions are equal. The better the hash function, the higher the value of k, up to the point where the hash function results in regular expressions being hashed together if and only if they have the same regular language. Using the k-equivalence measure, eight generated hash functions and one hand coded hash function are evaluated for a large number of short regular expressions, which are generated using G¨odel numbers. The k-equivalence concept is extended to the average k-equivalence value in order to evaluate the hash functions for longer regular expressions. The hand coded hash function is found to produce good results. Copyright / Dissertation (MEng)--University of Pretoria, 2009. / Computer Science / unrestricted
|
8 |
An Analysis of Data Cleaning Tools : A comparative analysis of the performance and effectiveness of data cleaning toolsStenegren, Filip January 2023 (has links)
I en värld full av data är felaktiga eller inkonsekventa data oundvikliga, och datarensning, en process som rensar sådana skillnader, blir avgörande. Syftet med studien är att besvara frågan om vilka kriterier datarengöringsverktyg kan jämföras och utvärderas med. Samt att genomföra en jämförande analys av två datarengöringsverktyg, varav ett utvecklades för ändamålet med denna studie medan det andra tillhandahölls för studien. Analysens resultat bör svara på frågan om vilket av verktygen som är överlägset och i vilka avseenden. De resulterande kriterierna för jämförelse är exekveringstid, mängden RAM (Random Access Memory) och CPU (Central Processing Unit) som används, skalbarhet och användarupplevelse. Genom systematisk testning och utvärdering överträffade det utvecklade verktyget i effektivitetskriterier som tidmätning och skalbarhet, det har också en liten fördel när det gäller resursförbrukning. Men eftersom det tillhandahållna verktyget erbjuder ett GUI (Graphical User Interface) finns det inte ett definitivt svar på vilket verktyg som är överlägset eftersom användarupplevelse och behov kan väga över alla tekniska färdigheter. Således kan slutsatsen om vilket verktyg som är överlägset variera, beroende på användarens specifika behov. / In a world teeming with data, faulty or inconsistent data is inevitable, and data cleansing, a process that purges such discrepancies, becomes crucial. The purpose of the study is to answer the question of what criteria data cleaning tools can be compared and evaluated with. As well as undergoing a comparative analysis of two data cleansing tools, one of which is developed for the purpose of this study whereas the other was provided for the study. The result of the analysis should answer the question of which of the tools is superior and in what regard. The resulting criteria for comparison are execution time, amount of RAM (Random Access Memory) and CPU (Central Processing Unit) usage, scalability and user experience. Through systematic testing and evaluation, the developed tool outperformed in efficiency criteria like time measurement and scalability, it also has a slight edge over on resource consumption. However, because the provided tool offers a GUI (Graphical User Interface), there is no definitive answer as to which tool is superior as user experience and needs can outweigh any technical prowess. Thus, the conclusion as to which tool is superior may vary, depending on the specific needs of the user.
|
9 |
Testtäckningsstruktur för fälttestning av SDP3 : Skapande och visualisering av testtäckningsstruktur för SDP3 med hjälp av användardata / Test coverage framework for field testing of SDP3David, Samer January 2017 (has links)
A big part of software development is testing and quality assurance. At the department of service market, Scania R&D, the software Scania Diagnose and Programmer 3 (SDP3) is developed and tested. The quality assurance is conducted by internal and external testing. However, the external testing of SDP3 lacks guidelines for measuring the quality of a field test. The purpose of this project was to create and implement a framework for the field test process of SDP3. This framework is later used to determine the quality of a field test. To create the framework, literature study, interviews and workshops were conducted. The workshops laid the foundation of the framework, and the interviews were used to specify the parameters in the framework. For the implementation of the framework studies were done to analyse the available data, later the framework was implemented into the data base management system Splunk as a real time Dashboard. The results of this study describes a framework that can be used to determine the quality of a field test. Unfortunately the whole framework could not be implemented into Splunk since all data needed could not be accessed through Splunk, instead, recommendations were made.
|
10 |
Improve Data Quality By Using Dependencies And Regular ExpressionsFeng, Yuan January 2018 (has links)
The objective of this study has been to answer the question of finding ways to improve the quality of database. There exists a lot of problems of the data stored in the database, like missing or spelling errors. To deal with the dirty data in the database, this study adopts the conditional functional dependencies and regular expressions to detect and correct data. Based on the former studies of data cleaning methods, this study considers the more complex conditions of database and combines the efficient algorithms to deal with the data. The study shows that by using these methods, the database’s quality can be improved and considering the complexity of time and space, there still has a lot of things to do to make the data cleaning process more efficiency.
|
Page generated in 0.0857 seconds