Global ETD Search

1	Extending functional databases for use in text-intensive applications Sheldrake, Simon N. January 2002 (has links) This thesis continues research exploring the benefits of using functional databases based around the functional data model for advanced database applications-particularly those supporting investigative systems. This is a growing generic application domain covering areas such as criminal and military intelligence, which are characterised by significant data complexity, large data sets and the need for high performance, interactive use. An experimental functional database language was developed to provide the requisite semantic richness. However, heavy use in a practical context has shown that language extensions and implementation improvements are required-especially in the crucial areas of string matching and graph traversal. In addition, an implementation on multiprocessor, parallel architectures is essential to meet the performance needs arising from existing and projected database sizes in the chosen application area. 005 String matching
2	Private Record Linkage: A Comparison of Selected Techniques for Name Matching Grzebala, Pawel B. 06 May 2016 (has links) No description available. Computer Engineering Private Record Linkage String matching
3	Aplicação de autômatos finitos nebulosos no reconhecimento aproximado de cadeias. / The approximate string matching using fuzzy finite automata. Alexandre Maciel 02 June 2006 (has links) O reconhecimento aproximado de cadeias de texto é um problema recorrente em diversas aplicações onde o computador é utilizado como meio de processamento de uma massa de dados sujeita a imprecisões, erros e distorções. Existem inúmeras metodologias, técnicas e métricas criadas e empregadas na resolução deste tipo de problema, mas a maioria delas é inflexível em pelo menos um dos seguintes pontos: arquitetura, métrica utilizada para aferir o erro encontrado ou especificidade na aplicação. Esse trabalho propõe e analisa a utilização dos Autômatos Finitos Nebulosos para a resolução desse tipo de problema. A teoria nebulosa oferece uma base teórica sólida para o tratamento de informações inexatas ou sujeita a erros, enquanto o modelo matemático dos autômatos finitos é uma ferramenta consolidada para o problema de reconhecimento de cadeias de texto. Um modelo híbrido não só oferece uma solução flexível para a resolução do problema proposto, como serve de base para a resolução de inúmeros outros problemas que dependem do tratamento de informações imprecisas. / The approximate string matching problem is recurring in many applications where computer is used to process imprecise, fuzzy or spurious data. An uncountable number of methods, techniques and metrics to solve this class of problem are available, but many of them are inflexible at least in one of following: architecture, metric or application specifics. This work proposes and analyzes the use of Fuzzy Finite State Automata to solve this class of problems. The fuzzy theory grants a solid base to handle imprecise or fuzzy information; the finite state automata is a classic tool in string matching problems. A hybrid model offers a flexible solution for this class of problem and can be a base for other problems related with imprecise data processing. Autômato finito Fuzzy Reconhecimento de texto Finite automata Fuzzy String matching
4	Aplicação de autômatos finitos nebulosos no reconhecimento aproximado de cadeias. / The approximate string matching using fuzzy finite automata. Maciel, Alexandre 02 June 2006 (has links) O reconhecimento aproximado de cadeias de texto é um problema recorrente em diversas aplicações onde o computador é utilizado como meio de processamento de uma massa de dados sujeita a imprecisões, erros e distorções. Existem inúmeras metodologias, técnicas e métricas criadas e empregadas na resolução deste tipo de problema, mas a maioria delas é inflexível em pelo menos um dos seguintes pontos: arquitetura, métrica utilizada para aferir o erro encontrado ou especificidade na aplicação. Esse trabalho propõe e analisa a utilização dos Autômatos Finitos Nebulosos para a resolução desse tipo de problema. A teoria nebulosa oferece uma base teórica sólida para o tratamento de informações inexatas ou sujeita a erros, enquanto o modelo matemático dos autômatos finitos é uma ferramenta consolidada para o problema de reconhecimento de cadeias de texto. Um modelo híbrido não só oferece uma solução flexível para a resolução do problema proposto, como serve de base para a resolução de inúmeros outros problemas que dependem do tratamento de informações imprecisas. / The approximate string matching problem is recurring in many applications where computer is used to process imprecise, fuzzy or spurious data. An uncountable number of methods, techniques and metrics to solve this class of problem are available, but many of them are inflexible at least in one of following: architecture, metric or application specifics. This work proposes and analyzes the use of Fuzzy Finite State Automata to solve this class of problems. The fuzzy theory grants a solid base to handle imprecise or fuzzy information; the finite state automata is a classic tool in string matching problems. A hybrid model offers a flexible solution for this class of problem and can be a base for other problems related with imprecise data processing. Autômato finito Finite automata Fuzzy Fuzzy Reconhecimento de texto String matching
5	A Lexicon for Gene Normalization / Ett lexicon för gennormalisering Lingemark, Maria January 2009 (has links) <p>Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system's disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall.</p> Bioinformatics Gene Normalization String Matching Text Mining Bioinformatics Bioinformatik
6	A Lexicon for Gene Normalization / Ett lexicon för gennormalisering Lingemark, Maria January 2009 (has links) Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system's disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall. Bioinformatics Gene Normalization String Matching Text Mining Bioinformatics Bioinformatik
7	Design and Implementation of a Name Matching Algorithm for Persian Language Momeninasab, Leila January 2013 (has links) Name matching plays a vital and crucial role in many applications. They are for example used in information retrieval or deduplication systems to do comparisons among names to match them together or to find the names that refer to identical objects, persons, or companies. Since names in each application are subject to variations and errors that are unavoidable in any system and because of the importance of name matching, so far many algorithms have been developed to handle matching of names. These algorithms consider the name variations that may happen because of spelling, pattern or phonetic modifications. However most existing methods were developed for use with the English language and so cover the characteristics of this language. Up to now no specific one has been designed and implemented for the Persian language. The purpose of this thesis is to present a name matching algorithm for Persian. In this project, after consideration of all major algorithms in this area, we selected one of the basic methods for name matching that we then expanded to make it work particularly well for Persian names. This proposed algorithm, called Persian Edit Distance Algorithm or shortly PEDA, was built based on the characteristics of the Persian language and it compares Persian names with each other on three levels: phonetic similarity, character form similarity and keyboard distance, in order to give more accurate results for Persian names. The algorithm gets Persian names as its input and determines their similarity as a percentage in the output. In this thesis three series of experiments have been accomplished in order to evaluate the proposed algorithm. The f-measure average shows a value of 0.86 for the first series and a value of 0.80 for the second series results. The first series of experiments have been repeated with Levenshtein as well, and have 33.9% false negatives on average while PEDA has a false negative average of 6.4%. The third series of experiments shows that PEDA works well for one edit, two edits and three edits with true positive average values of 99%, 81%, and 69% respectively. Computer Sciences Datavetenskap (datalogi)
8	Temporal Graph Record Linkage and k-Safe Approximate Match Jupin, Joseph January 2016 (has links) Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines signature filtering, probabilistic hashing, length filtering and prefix pruning to increase the speed of edit-distance. It is also lossless because it does not lose any true positive matches. ALIM – Aggregate Link and Iterative Match is a graph-based record linkage methodology that uses a multi-graph to store demographic data about people. ALIM performs string matching as records are inserted into the graph. ALIM eliminates data redundancy and stores the relationships between data. We tested PSH for string comparison and found it to be approximately 6,000 times faster than DL. We tested it against the trie-join methods and found that they are up to 6.26 times faster but lose between 10 and 20 percent of true positives. We tested ALIM against a method currently in use by a local health and human services department and found ALIM to produce significantly more matches (even with more restrictive match criteria) and that ALIM ran more than twice as fast. ALIM handles the multi-data problem and PSH allows the use of edit-distance comparison in this RL model. ALIM is more efficient and effective than a currently implemented RL system. This model can also be expanded to perform social network analysis and temporal data modeling. For human services, temporal modeling can reveal how policy changes and treatments affect clients over time and social network analysis can determine the effects of these on whole families by facilitating family linkage. / Computer and Information Science Information Science Computer Science Entity Matching Record Linkage String Matching
9	Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation Dubois, Simon January 2013 (has links) Approximate string matching consists in identifying strings as similar even ifthere is a number of mismatch between them. This technique is one of thesolutions to reduce the exact matching strictness in data comparison. In manycases it is useful to identify stream variation (e.g. audio) or word declension (e.g.prefix, suffix, plural). Approximate string matching can be used to score terms in InformationRetrieval (IR) systems. The benefit is to return results even if query terms doesnot exactly match indexed terms. However, as approximate string matchingalgorithms only consider characters (nor context neither meaning), there is noguarantee that additional matches are relevant matches. This paper presents the effects of some approximate string matchingalgorithms on search results in IR systems. An experimental research design hasbeen conducting to evaluate such effects from two perspectives. First, resultrelevance is analysed with precision and recall. Second, performance is measuredthanks to the execution time required to compute matches. Six approximate string matching algorithms are studied. Levenshtein andDamerau-Levenshtein computes edit distance between two terms. Soundex andMetaphone index terms based on their pronunciation. Jaccard similarity calculatesthe overlap coefficient between two strings. Tests are performed through IR scenarios regarding to different context,information need and search query designed to query on a technicaldocumentation related to software development (man pages from Ubuntu). Apurposive sample is selected to assess document relevance to IR scenarios andcompute IR metrics (precision, recall, F-Measure). Experiments reveal that all tested approximate matching methods increaserecall on average, but, except Metaphone, they also decrease precision. Soundexand Jaccard Similarity are not advised because they fail on too many IR scenarios.Highest recall is obtained by edit distance algorithms that are also the most timeconsuming. Because Levenshtein-Damerau has no significant improvementcompared to Levenshtein but costs much more time, the last one is recommendedfor use with a specialised documentation. Finally some other related recommendations are given to practitioners toimplement IR systems on technical documentation. Algorithm comparison Approximate string matching Information retrieval Offline string matching Overlap coefficient Phonetic indexation String distance String metric String searching algorithm
10	Emprego de tÃcnicas de prÃ-processamento textual e algoritmos de comparaÃÃo como suporte Ã correÃÃo de questÃes dissertativas: experimentos, anÃlises e contribuiÃÃes / Employing texts preprocessing techniques and string-matching algorithms to support correction of essay questions: experiments, analyzes and contributions. Ricardo Lima Feitosa Ãvila 23 August 2013 (has links) CoordenaÃÃo de AperfeiÃoamento de Pessoal de NÃvel Superior / Esta dissertaÃÃo apresenta um estudo de tÃcnicas que podem ser empregadas como apoio para a correÃÃo de questÃes dissertativas com base na adaptaÃÃo de algoritmos de comparaÃÃo textual combinados a tÃcnicas de prÃ-processamento de textos. O principal desafio na concepÃÃo de uma ferramenta para este tipo de aplicaÃÃo Ã a ambiguidade da linguagem natural. Para analisar situaÃÃes de correÃÃo de questÃes subjetivas, foram efetuados testes com esses algoritmos, tendo-se desenvolvido uma ferramenta para tal propÃsito. Confrontando respostas de alunos ao padrÃo de resposta de questÃes propostas em provas subjetivas, foram analisados o desempenho individual dos algoritmos e de um conjunto de tÃcnicas de prÃ-processamento que sÃo encontrados na literatura, de maneira isolada e combinada. Buscando contornar situaÃÃes especÃficas de falso negativo e falso positivo, foram propostas algumas tÃcnicas auxiliares como contribuiÃÃo deste trabalho. ApÃs a anÃlise dos experimentos realizados, os resultados de Ãndice de similaridade entre respostas indicam o uso da soluÃÃo como suporte a correÃÃo de questÃes discursivas, podendo, ainda, ser aplicado na detecÃÃo de plÃgio e ser integrado a um ambiente virtual de ensino e aprendizagem. / This master thesis presents a study of techniques used as support for a correction of essay questions based in an adaptation of string-matching algorithms combined with text preprocessing techniques. The main challenge to design a tool like this is an ambiguity of natural language. To analyze a correction of subjective questions, tests were performed with these algorithms, and a tool have been developed for this purpose. Comparing student responses with response pattern of questions proposed in subjective tests, we analyzed the performance of individual algorithms and a set of pre-processing techniques that are found in the literature, in isolation and combined. Seeking to neutralize specific situations of false negative and false positive, some techniques have been proposed as auxiliary contribution of this work. After analyzing the experiments, the results of similarity index between responses indicate the use of the solution to support the correction of essay questions, and may also be applied in the detection of plagiarism and be integrated to a learning management system. Processamento de textos (ComputaÃÃo) TeleinformÃtica string-matching algorithms plagiarism detection. similarity preprocessing texts ENGENHARIAS

Search results