Global ETD Search

11	An efficient algorithm for an optimal modular compression. Application to the analysis of genetic sequences. /Un algorithme rapide pour une compression modulaire optimale. Application à l'analyse de séquences génétiques. Delgrange, Olivier 05 June 1997 (has links) Abstract : A lossless compression algorithm often applies the same coding scheme on the whole sequence to be compressed. Therefore, some factors of the sequence are shortened while others are lengthened. In this work, we propose an optimization algorithm of compression methods which breaks off the coding where it is not profitable, so that some segments of the initial sequence are copied as they are instead of being coded. The achieved compression is said modular, meaning that the compressed sequence is a sequel of compressed segments and copied segments. Under specific hypotheses, our algorithm computes an optimal modular compression in time O(n log n) where n is the length of the sequence. We show that our optimization method can be advantageously used to analyze data, and particularly genetic sequences. The Kolmogorov complexity theory brings to light the usefulness of compression when analyzing sequences. The work consists of three parts. The first one introduces the classical concepts of compression and coding, as well as the new concept of ICL codes for the integers. The second one presents the compression optimization algorithm by liftings that uses ICL codes. Finally, the third part presents applications of the compression optimization by liftings, especially in the context of genetic sequence analysis. With the specific problem of the localization of approximate tandem repeats, we show how the compression optimization algorithm by liftings can be used to localize regular segments and irregular segments of a sequence in a precise and optimal way. This comeback to experimentation makes it possible to analyze sequences that contain several thousands of symbols within the space of a few seconds. /Résumé : Une méthode de compression sans perte d'informations applique souvent le même schéma de codage d'un bout à l'autre de la séquence à comprimer. Certains facteurs de la séquence sont ainsi raccourcis mais malheureusement d'autres sont rallongés. Dans ce travail, nous proposons un algorithme d'optimisation de compression qui rompt le codage là ou il n'est pas intéressant en recopiant des morceaux de la séquence initiale. La compression obtenue est dite modulaire : la séquence comprimée est une succession de morceaux comprimés et de morceaux recopiés tels quels. Sous certaines hypothèses, notre algorithme fournit une compression modulaire optimale en temps O(n log n) où n est la longueur de la séquence. Nous montrons que notre méthode de compression peut avantageusement être utilisée pour analyser des données et plus particulièrement des séquences génétiques. La théorie de la complexité de Kolmogorov éclaire l'idée d'analyse de séquences par compression. Le travail comporte trois parties. La première introduit les concepts classiques de compression et de codage, ainsi que le concept nouveau de codage ICL d'entiers. La seconde développe l'algorithme d'optimisation de compression par liftings qui utilise les codes ICL. La dernière partie présente des applications de l'optimisation de compression par liftings, plus particulièrement dans le domaine de l'analyse de séquences génétiques. Nous montrons, à l'aide du problème spécifique de localisation de répétitions en tandem approximatives, comment l'algorithme d'optimisation par liftings peut être utilisé pour localiser précisément et de manière optimale les segments réguliers et les segments non réguliers des séquences. Il s'agit d'un retour à l'expérience qui permet l'analyse de séquences de plusieurs centaines de milliers de bases en quelques secondes. algorithmique bioinformatique string-matching algorithms répétitions en tandem/ bioinformatics compression analyse de séquences tandem repeats
12	Content Based Packet Filtering In Linux Kernel Using Deterministic Finite Automata Bilal, Tahir 01 September 2011 (has links) (PDF) In this thesis, we present a content based packet filtering Architecture in Linux using Deterministic Finite Automata and iptables framework. New generation firewalls and intrusion detection systems not only filter or inspect network packets according to their header fields but also take into account the content of payload. These systems use a set of signatures in the form of regular expressions or plain strings to scan network packets. This scanning phase is a CPU intensive task which may degrade network performance. Currently, the Linux kernel firewall scans network packets separately for each signature in the signature set provided by the user. This approach constitutes a considerable bottleneck to network performance. We implement a content based packet filtering architecture and a multiple string matching extension for the Linux kernel firewall that matches all signatures at once, and show that we are able to filter network traffic by consuming constant bandwidth regardless of the number of signatures. Furthermore, we show that we can do packet filtering in multi-gigabit rates. QA Computer Software 76.75-76.765
13	String Matching Techniques: An Empirical Assessment Based on Statistics Austria's Business Register Denk, Michaela, Hackl, Peter, Rainer, Norbert January 2005 (has links) (PDF) The maintenance and updating of Statistics Austria's business register requires a regularly matching of the register against other data sources; one of them is the register of tax units of the Austrian Federal Ministry of Finance. The matching process is based on string comparison via bigrams of enterprise names and addresses, and a quality class approach assigning pairs of register units into classes of different compliance (i.e., matching quality) based on bigram similarity values and the comparison of other matching variables, like the NACE code or the year of foundation. Based on methodological research concerning matching techniques carried out in the DIECOFIS project, an empirical comparison of the bigram method and other string matching techniques was conducted: the edit distance, the Jaro algorithm and the Jaro-Winkler algorithm, the longest common subsequence and the maximal match were selected as appropriate alternatives and evaluated in the study. This paper briefly introduces Statistics Austria's business register and the corresponding maintenance process and reports on the results of the empirical study.
14	Diagnostika chyb v počítačových sítích založená na překlepech / Diagnosing Errors inside Computer Networks Based on the Typo Errors Bohuš, Michal January 2020 (has links) The goal of this diploma thesis is to create system for network data diagnostics based on detecting and correcting spelling errors. The system is intended to be used by network administrators as next diagnostics tool. As opposed to the primary use of detection and correction spelling error in common text, these methods are applied to network data, which are given by the user. Created system works with NetFlow data, pcap files or log files. Context is modeled with different created data categories. Dictionaries are used to verify the correctness of words, where each category uses its own. Finding a correction only according to the edit distance leads to many results and therefore a heuristic for evaluating candidates was proposed for selecting the right candidate. The created system was tested in terms of functionality and performance.
15	Analyse de structures répétitives dans les séquences musicales / Repetitive structure analysis in music sequences Martin, Benjamin 12 December 2012 (has links) Cette thèse rend compte de travaux portant sur l’inférence de structures répétitives à partir du signal audio à l’aide d’algorithmes du texte. Son objectif principal est de proposer et d’évaluer des algorithmes d’inférence à partir d’une étude formelle des notions de similarité et de répétition musicale.Nous présentons d’abord une méthode permettant d’obtenir une représentation séquentielle à partir du signal audio. Nous introduisons des outils d’alignement permettant d’estimer la similarité entre de telles séquences musicales, et évaluons l’application de ces outils pour l’identiﬁcation automatique de reprises. Nous adaptons alors une technique d’indexation de séquences biologiques permettant une estimation eﬃcace de la similarité musicale au sein de bases de données conséquentes.Nous introduisons ensuite plusieurs répétitions musicales caractéristiques et employons les outils d’alignement pour identiﬁer ces répétitions. Une première structure, la répétition d’un segment choisi, est analysée et évaluée dans le cadre dela reconstruction de données manquantes. Une deuxième structure, la répétition majeure, est déﬁnie, analysée et évaluée par rapport à un ensemble d’annotations d’experts, puis en tant qu’alternative d’indexation pour l’identiﬁcation de reprises.Nous présentons enﬁn la problématique d’inférence de structures répétitives telle qu’elle est traitée dans la littérature, et proposons notre propre formalisation du problème. Nous exposons alors notre modélisation et proposons un algorithme permettant d’identiﬁer une hiérarchie de répétitions. Nous montrons la pertinence de notre méthode à travers plusieurs exemples et en l’évaluant par rapport à l’état de l’art. / The work presented in this thesis deals with repetitive structure inference from audio signal using string matching techniques. It aims at proposing and evaluating inference algorithms from a formal study of notions of similarity and repetition in music.We ﬁrst present a method for representing audio signals by symbolic strings. We introduce alignment tools enabling similarity estimation between such musical strings, and evaluate the application of these tools for automatic cover song identiﬁcation. We further adapt a bioinformatics indexing technique to allow eﬃcient assessments of music similarity in large-scale datasets. We then introduce several speciﬁc repetitive structures and use alignment tools to analyse these repetitions. A ﬁrst structure, namely the repetition of a chosen segment, is retrieved and evaluated in the context of automatic assignment of missingaudio data. A second structure, namely the major repetition, is deﬁned, retrieved and evaluated regarding expert annotations, and as an alternative indexing method for cover song identiﬁcation.We ﬁnally present the problem of repetitive structure inference as addressed in literature, and propose our own problem statement. We further describe our model and propose an algorithm enabling the identiﬁcation of a hierarchical music structure. We emphasize the relevance of our method through several examples and by comparing it to the state of the art. Recherche d'Informations Musicales Structure musicale Répétitions musicales Algorithmique du texte Music Information Retrieval Music structure Musical repetitions String matching
16	Uma medida de similaridade híbrida para correspondência aproximada de múltiplos padrões / A hybrid similarity measure for multiple approximate pattern matching Dezembro, Denise Gazotto 07 March 2019 (has links) A busca aproximada por múltiplos padrões similares é um problema encontrado em diversas áreas de pesquisa, tais como biologia computacional, processamento de sinais e recuperação de informação. Na maioria das vezes, padrões não possuem uma correspondência exata e, portanto, buscam-se padrões aproximados, de acordo com um modelo de erro. Em geral, o modelo de erro utiliza uma função de distância para determinar o quanto dois padrões são diferentes. As funções de distância são baseadas em medidas de similaridade, que são classificadas em medidas de similaridade baseadas em distância de edição, medidas de similaridade baseadas em token e medidas de similaridade híbridas. Algumas dessas medidas extraem um vetor de características de todos os termos que constituem o padrão. A similaridade entre os vetores pode ser calculada pela distância entre cossenos ou pela distância euclidiana, por exemplo. Essas medidas apresentam alguns problemas: tornam-se inviáveis conforme o tamanho do padrão aumenta, não realizam a correção ortográfica ou apresentam problemas de normalização. Neste projeto de pesquisa propõe-se uma nova medida de similaridade híbrida que combina TF-IDF Weighting e uma medida de similaridade baseada em distância de edição para estimar a importância de um termo dentro de um padrão na tarefa de busca textual. A medida DGD não descarta completamente os termos que não fazem parte do padrão, mas atribui um peso baseando-se na alta similaridade deste termo com outro que está no padrão e com a média de TF-IDF Weighting do termo na coleção. Alguns experimentos foram conduzidos mostrando o comportamento da medida proposta comparada com as outras existentes na literatura. Tem-se como recomendação geral o limiar de {tf-idf+cosseno, Jaccard, Soft tf-idf} 0,60 e {Jaro, Jaro-Winkler, Monge-Elkan} 0,90 para detecção de padrões similares. A medida de similaridade proposta neste trabalho (DGD+cosseno) apresentou um melhor desempenho quando comparada com tf idf+cosseno e Soft tf-idf na identificação de padrões similares e um melhor desempenho do que as medidas baseadas em distância de edição (Jaro e JaroWinkler) na identificação de padrões não similares. Atuando como classificador, em geral, a medida de similaridade híbrida proposta neste trabalho (DGD+cosseno) apresentou um melhor desempenho (embora não sinificativamente) do que todas as outras medidas de similaridade analisadas, o que se mostra como um resultado promissor. Além disso, é possível concluir que o melhor valor de a ser usado, onde corresponde ao limiar do valor da medida de similaridade secundária baseada em distância de edição entre os termos do padrão, corresponde a 0,875. / Multiple approximate pattern matching is a challenge found in many research areas, such as computational biology, signal processing and information retrieval. Most of the time, a pattern does not have an exact match in the text, and therefore an error model becomes necessary to search for an approximate pattern match. In general, the error model uses a distance function to determine how different two patterns are. Distance functions use similarity measures which can be classified in token-based, edit distance based and hybrid measures. Some of these measures extract a vector of characteristics from all terms in the pattern. Then, the similarity between vectors can be calculated by cosine distance or by euclidean distance, for instance. These measures present some problems: they become infeasible as the size of the pattern increases, do not perform the orthographic correction or present problems of normalization. In this research, we propose a new hybrid similarity metric, named DGD, that combines TF-IDF Weighting and a edit distance based measure to estimate the importance of a term within patterns. The DGD measure doesnt completely rule out terms that are not part of the pattern, but assigns a weight based on the high similarity of this term to another that is in the pattern and with the TF-IDF Weighting mean of the term in the collection. Experiment were conducted showing the soundness of the proposed metric compared to others in the literature. The general recommendation is the threshold of {tf-idf+cosseno, Jaccard, Soft tf-idf} 0.60 and {Jaro, Jaro-Winkler, Monge-Elkan} 0.90 for detection of similar patterns. The similarity measure proposed in this work (DGD + cosine) presented a better performance when compared with tf-idf+cosine and Soft tf-idf in the identification of similar patterns and a better performance than the edit distance based measures (Jaro and Jaro-Winkler) in identifying non-similar patterns. As a classifier, in general, the hybrid similarity measure proposed in this work (DGD+cosine) performed better (although not significantly) than all other similarity measures analyzed, which is shown as a promising result . In addition, it is possible to conclude that the best value of to be used, where is the theshold of the value of the secondary similarity measure based on edit distance between the terms of the pattern, corresponds to 0.875. Busca aproximada de múltiplos padrões Correspondência entre padrões Medidas de similaridade entre padrões Multiple approximate string matching Pattern matching String metrics
17	Improving the Efficiency and Robustness of Intrusion Detection Systems Fogla, Prahlad 20 August 2007 (has links) With the increase in the complexity of computer systems, existing security measures are not enough to prevent attacks. Intrusion detection systems have become an integral part of computer security to detect attempted intrusions. Intrusion detection systems need to be fast in order to detect intrusions in real time. Furthermore, intrusion detection systems need to be robust against the attacks which are disguised to evade them. We improve the runtime complexity and space requirements of a host-based anomaly detection system that uses q-gram matching. q-gram matching is often used for approximate substring matching problems in a wide range of application areas, including intrusion detection. During the text pre-processing phase, we store all the q-grams present in the text in a tree. We use a tree redundancy pruning algorithm to reduce the size of the tree without losing any information. We also use suffix links for fast linear-time q-gram search during query matching. We compare our work with the Rabin-Karp based hash-table technique, commonly used for multiple q-gram matching. To analyze the robustness of network anomaly detection systems, we develop a new class of polymorphic attacks called polymorphic blending attacks, that can effectively evade payload-based network anomaly IDSs by carefully matching the statistics of the mutated attack instances to the normal profile. Using PAYL anomaly detection system for our case study, we show that these attacks are practically feasible. We develop a formal framework which is used to analyze polymorphic blending attacks for several network anomaly detection systems. We show that generating an optimal polymorphic blending attack is NP-hard for these anomaly detection systems. However, we can generate polymorphic blending attacks using the proposed approximation algorithms. The framework can also be used to improve the robustness of an intrusion detector. We suggest some possible countermeasures one can take to improve the robustness of an intrusion detection system against polymorphic blending attacks. Computer security Network security Intrusion detection system Approximate string matching Q-gram matching Anomaly detection (Computer security)
18	以數值高程模型辨識地形之研究宋秉憲, Soong,Bing Shang Unknown Date (has links) 本研究所要討論的是如何以局部區域的數值高程模型資料辨識出所在整體地形的相對應位置。數值高程模型是以網格式的方式描述地表上連續性的起伏變化，以二維陣列儲存地表高度的資料，包含三度空間的特性。我們從區域地形萃取出線性特徵與點特徵，分別為水系河段與地形上較明顯的凸點與凹點，以水系作為識別每一區域地形的“指紋”，對於地形變化小或河段特徵不明顯之區域尋找其特徵點，配合相關地形參數與整體地形進行比對，並對不同之特徵採用不同比對演算法。我們以物件化的方式表達水系河段與特徵點，將許多圖層的資訊整合於物件中，除了方便資料的管理，也加快了比對的效率。實驗結果顯示，應用此兩種特徵值作為辨識地形依據，可有效辨識出正確位置，也節省許多不必要的比對時間。 / The main objective of this thesis is to identify a terrain using partial Digital Elevation Model (DEM) information. DEM is one of the most commonly used data representation models used in Geographical Information Systems. It is a digital model with an array of uniformly spaced elevation data in raster format. One can use DEM to analyze terrain measures including slope, aspect, and other features. In the thesis, we use hydrology analysis to extract the stream networks and use terrain parameter analysis to compute terrain features from the DEM of a small region. This information can be used as the “fingerprints” of the terrain and then compare them with the “fingerprints” in the whole data base in order to identify or to locate the correct location of the region. The KMP string matching algorithm is used to speed up the matching process. Measurements extracted from DEM through hydrology analysis may not provide significant terrain information for the identification purpose. In this case, other mechanism such as VIP node and algorithm are used to facilitate the identification process. We embed object oriented concepts in actual implementation. The experimental results show that our mechanism works successfully and the time used in the identification process reduced significantly. 地文分析字串比對高通濾波器 hydrologic analysis string matching high pass filter
19	Efficient fuzzy type-ahead search on big data using a ranked trie data structure Bergman, John January 2018 (has links) The efficiency of modern search engines depends on how well they present typo-corrected results to a user while typing. So-called fuzzy type-ahead search combines fuzzy string matching and search-as-you-type functionality, and creates a powerful tool for exploring indexed data. Current fuzzy type-ahead search algorithms work well on small data sets, but for big data of social networking services such as Facebook, e-commerce sites such as Amazon, or media streaming services such as YouTube, responsive fuzzy type-ahead search remains a great challenge. This thesis describes a method that enables responsive type-ahead search combined with fuzzy string matching on big data by keeping the search time optimal for human interaction at the expense of lower accuracy for less popular records when a query contains typos. This makes the method effective for e-commerce and media services where the popularity of search terms is a result of human behaviour and thus often follow a power-law distribution. / Effektiviteten hos moderna sökmotorer beror på hur väl de presenterar rättstavade resultat för en användare medan en sökning skrivs. Så kallad fuzzy type-ahead sök kombinerar approximativ strängmatchning och sök-medan-du-skriver funktionalitet, vilket skapar ett kraftfullt verktyg för att utforska data. Dagens algoritmer för fuzzy type-ahead sök fungerar väl för små mängder data, men för data i storleksordningen “big data” från t.ex sociala nätverkstjänster så som Facebook, e-handelssidor så som Amazon, eller media tjänster så som YouTube, är en responsiv fuzzy type-ahead sök ännu en stor utmaning. Denna avhandling beskriver en metod som möjliggör responsiv type-ahead sök kombinerat med approximativ strängmatchning för big data genom att hålla söktiden optimal för mänsklig interaktion på bekostnad av lägre precision för mindre populär information när en sök-förfrågan innehåller felstavningar. Detta gör metoden effektiv för e-handel och mediatjänster där populariteten av sök-termer är ett resultat av mänskligt beteende vilket ofta följer en potens-lag distribution. Approximate string matching Fuzzy search Type-ahead search String similarity Computer Sciences Datavetenskap (datalogi) Information Systems
20	Smart Clustering System for Filtering and Cleaning User Generated Content : Creating a profanity filter for Truecaller / System för filtrering och sanering av oönskad text i användarskapat innehåll Moradi, Arvin January 2013 (has links) This thesis focuses on investigating and creating an application for filtering user-generated content. The method was to examine how profanity and racist expressions are used and manipulated to evade filtering processes in similar systems. Focus also went on to study different algorithms to get this process to be quick and efficient, i.e., to process as many names in the shortest amount of time possible. This is because the client needs to filter millions of new uploads every day. The result shows that the application detects profanity and manipulated profanity. Data from the customer’s database was also used for testing purposes, and the result showed that the application also works in practice. The performance test shows that the application has a fast execution time. We could see this by approximating it to a linear func-tion with respect to time and the number of names entered. The conclusion was that the filter works and discovers profanity not detected earlier. Future updates to strengthen the decision process could be to introduce a third-party service, or a web interface where you can manually control decisions. Execution time is good and shows that 10 million names can be pro-cessed in about 6 hours. In the future, one can parallelize queries to the database so that multiple names can be processed simultaneously. / Denna avhandling fokuserar på att utreda och skapa en applikation för filtrering av användargenererat innehåll. Metoden gick ut på att undersöka hur svordomar samt rasistiska uttryck används och manipuleras för att undgå filtrerings processer i liknande system. Fokus gick även ut på att studera olika algoritmer för att få denna process att vara snabb och effektiv, dvs kunna bearbeta så många namn på kortast möjliga tid. Detta beror på att kunden i detta sammanhang får in miljontals nya uppladdningar varje dag, som måste filtreras innan använding. Resultatet visar att applikationen upptäcker svordomar i olika former. Data från kundens databas användes också för test syfte, och resultatet visade att applikationen även fungerar i praktiken. Prestanda testet visar att applikationen har en snabb exekveringstid. Detta kunde vi se genom att estimera den till en linjär funktion med hänsyn till tid och antal namn som matats in. Slutsatsen blev att filtret fungerar och upptäcker svordomar som inte upptäckts tidigare i kundens databas. För att stärka besluten i processen kan man i framtida uppdateringar införa tredje parts tjänster, eller ett web interface där man manuelt kan styra beslut. Exekverings tiden är bra och visar att 10 miljoner namn kan bearbetas på cirka 6 timmar. I framtiden kan man parallellisera förfrågningarna till databasen så att flera namn kan bearbetas samtidigt. Java REST Jersey filter linear function MongoDB Maven String matching algorithm B-Tree Hashmap Aho-Corasick Engineering and Technology Teknik och teknologier

Search results