Global ETD Search

11	Detecting opinion spam and fake news using n-gram analysis and semantic similarity Ahmed, Hadeer 14 November 2017 (has links) In recent years, deceptive contents such as fake news and fake reviews, also known as opinion spams, have increasingly become a dangerous prospect, for online users. Fake reviews affect consumers and stores a like. Furthermore, the problem of fake news has gained attention in 2016, especially in the aftermath of the last US presidential election. Fake reviews and fake news are a closely related phenomenon as both consist of writing and spreading false information or beliefs. The opinion spam problem was formulated for the first time a few years ago, but it has quickly become a growing research area due to the abundance of user-generated content. It is now easy for anyone to either write fake reviews or write fake news on the web. The biggest challenge is the lack of an efficient way to tell the difference between a real review or a fake one; even humans are often unable to tell the difference. In this thesis, we have developed an n-gram model to detect automatically fake contents with a focus on fake reviews and fake news. We studied and compared two different features extraction techniques and six machine learning classification techniques. Furthermore, we investigated the impact of keystroke features on the accuracy of the n-gram model. We also applied semantic similarity metrics to detect near-duplicated content. Experimental evaluation of the proposed using existing public datasets and a newly introduced fake news dataset introduced indicate improved performances compared to state of the art. / Graduate Classification Fake content Fake news n-gram Machine learning Semantic Similarity Keystrokes pattern
12	An Unsupervised Approach to Detecting and Correcting Errors in Text Islam, Md Aminul January 2011 (has links) In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths. Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction
13	Using random projections for dimensionality reduction in identifying rogue applications Atkison, Travis Levestis 08 August 2009 (has links) In general, the consumer must depend on others to provide their software solutions. However, this outsourcing of software development has caused it to become more and more abstract as to where the software is actually being developed and by whom, and it poses a potentially large security problem for the consumer as it opens up the possibility for rogue functionality to be injected into an application without the consumer’s knowledge or consent. This begs the question of ‘How do we know that the software we use can be trusted?’ or ‘How can we have assurance that the software we use is doing only the tasks that we ask it to do?’ Traditional methods for thwarting such activities, such as virus detection engines, are far too antiquated for today’s adversary. More sophisticated research needs to be conducted in this area to combat these more technically advanced enemies. To combat the ever increasing problem of rogue applications, this dissertation has successfully applied and extended the information retrieval techniques of n-gram analysis and document similarity and the data mining techniques of dimensionality reduction and attribute extraction. This combination of techniques has generated a more effective Trojan horse, rogue application detection capability tool suite that can detect not only standalone rogue applications but also those that are embedded within other applications. This research provides several major contributions to the field including a unique combination of techniques that have provided a new tool for the administrator’s multi-pronged defense to combat the infestation of rogue applications. Another contribution involves a unique method of slicing the potential rogue applications that has proven to provide a more robust rogue application classifier. Through experimental research this effort has shown that a viable and worthy rogue application detection tool suite can be developed. Experimental results have shown that in some cases as much as a 28% increase in overall accuracy can be achieved when comparing the accepted feature selection practice of mutual information with the feature extraction method presented in this effort called randomized projection. rogue application detection information retrieval data mining n-gram analysis clustering dimensionality reduction randomized projection
14	Protection des contenus des images médicales par camouflage d'informations secrètes pour l'aide à la télémédecine / Medical image content protection by secret information hiding to support telemedicine Al-Shaikh, Mu'ath 22 April 2016 (has links) La protection de l’image médicale numérique comporte au moins deux aspects principaux: la sécurité et l’authenticité. Afin d’assurer la sécurité, l’information doit être protégée vis-à-vis des utilisateurs non autorisés. L’authenticité permet quant à elle de s’assurer que la donnée reçue n’est pas modifiée, n’est pas altérée, et qu’elle est bien envoyée par l’expéditeur supposé. La « technique » cryptographique garantit la sécurité en faisant l’hypothèse que l’expéditeur et le destinataire ont des clés permettant respectivement de crypter et de décrypter le message. De cette manière, seule la personne possédant la bonne clé peut décrypter le message et accéder au contenu de la donnée médicale. Dans cette thèse, nous avons apporté plusieurs contributions. La principale contribution est la proposition de solutions de tatouage d'images médicales robustes et réversibles dans le domaine spatial basées respectivement sur l’analyse de concepts formels (FCA) et le diagramme de décision binaire par suppression des zéros (ZBDD). La seconde est une approche de tatouage d’image médicale semi-aveugle pour la détection de modifications malveillantes. Une autre contribution est la proposition d'un système de chiffrement symétrique sécurisé basé sur les N-grams. La dernière contribution est un système hybride de tatouage et de cryptographie d’image médicale qui s’appuie sur une nouvelle forme de carte chaotique (chaotic map) pour générer des clés ayant des propriétés spécifiques, et qui permet d'obtenir une meilleure efficacité, une grande robustesse et une faible complexité par rapport aux approches existantes. / The protection of digital medical image comprises at least two main aspects: security and authentication. In order to ensure the security, the information has to be protected from the unauthorized users while the authentication confirms that the received data is not affected or modified and is sent by the intended sender (watermarking). The cryptography technique proves the security issues by assuming the intended sender and intended receiver have some security aspects called keys. So, after encryption of the digital material from the sender side, the person who has the key (receiver) can decrypt and access the content of the digital material. In this thesis, we have brought several contributions. The main one is the provision of robust and reversible medical image watermarking solutions in the spatial domain based respectively on FCA and ZBDD. The second one is a semi-blind medical image watermarking approach for the tamper detection. Another contribution is the proposal of a secure symmetric encryption system based on N-gram. The last contribution is a hybrid watermarking and cryptography medical image system which focuses on a new form of chaotic map to generate keys with specific properties, and achieves better efficiency, high robustness and low complexity than the existing approaches. Tatouage d’image Cryptographie DICOM Image médicale Attaques Attaques Robustesse Imperceptibilité ZBDD FCA Weber Carte chaotique OTP N-gram Watermarking Cryptography DICOM Medical Image Attacks Robustness Imperceptibility ZBDD FCA Weber Chaotic Map OTP N-gram 005.82
15	Lingvistisk knäckning av lösenordsfraser / Linguistical passphrase cracking Sparell, Peder January 2015 (has links) För att minnas långa lösenord är det inte ovanligt att användare rekommenderas att skapa en mening som sedan sätts ihop till ett långt lösenord, en lösenordsfras. Informationsteoretiskt sett är dock ett språk väldigt begränsat och förutsägbart, varför enligt Shannons definition av informationsteori en språkriktig lösenordsfras bör vara relativt lätt att knäcka. Detta arbete riktar in sig på knäckning av språkriktiga lösenordsfraser, dels i syfte att avgöra i vilken grad det är tillrådligt att basera en lösenordspolicy på lösenordsfraser för skydd av data, dels för att allmänt tillgängliga effektiva metoder idag saknas för att knäcka så långa lösenord. Inom arbetet genererades fraser för vidare användning av tillgängliga knäckningsprogram, och språket i fraserna modelleras med hjälp av en Markov-process. I denna process byggs fraserna upp genom att det används antal observerade förekomster av följder av bokstäver eller ord i en källtext, så kallade n-gram, för att avgöra möjliga/troliga nästkommande bokstav/ord i fraserna. Arbetet visar att genom att skapa modeller över språket kan språkriktiga lösenordsfraser knäckas på ett praktiskt användbart sätt jämfört med uttömmande sökning. / In order to remember long passwords, it is not uncommon users are recommended to create a sentence which then is assembled to form a long password, a passphrase. However, theoretically a language is very limited and predictable, why a linguistically correct passphrase according to Shannon's definition of information theory should be relatively easy to crack. This work focuses on cracking linguistically correct passphrases, partly to determine to what extent it is advisable to base a password policy on such phrases for protection of data, and partly because today, widely available effective methods to crack these long passwords are missing. Within the work of this thesis, phrases were generated for further processing by available cracking applications, and the language of the phrases were modeled using a Markov process. In this process, phrases were built up by using the number of observed instances of subsequent characters or words in a source text, known as n-grams, to determine the possible/probable next character/word in the phrases. The work shows that by creating models of language, linguistically correct passphrases can be broken in a practical way compared to an exhaustive search. passwords information security cyber security IT forensics passphrases markov chains n-gram entropy cracking lösenord informationssäkerhet it-säkerhet it-forensik lösenordsfraser markovkedjor n-gram entropi knäckning Computer Sciences Datavetenskap (datalogi)
16	EXTRACTION AND PREDICTION OF SYSTEM PROPERTIES USING VARIABLE-N-GRAM MODELING AND COMPRESSIVE HASHING Muthukumarasamy, Muthulakshmi 01 January 2010 (has links) In modern computer systems, memory accesses and power management are the two major performance limiting factors. Accesses to main memory are very slow when compared to operations within a processor chip. Hardware write buffers, caches, out-of-order execution, and prefetch logic, are commonly used to reduce the time spent waiting for main memory accesses. Compiler loop interchange and data layout transformations also can help. Unfortunately, large data structures often have access patterns for which none of the standard approaches are useful. Using smaller data structures can significantly improve performance by allowing the data to reside in higher levels of the memory hierarchy. This dissertation proposes using lossy data compression technology called ’Compressive Hashing’ to create “surrogates”, that can augment original large data structures to yield faster typical data access. One way to optimize system performance for power consumption is to provide a predictive control of system-level energy use. This dissertation creates a novel instruction-level cost model called the variable-n-gram model, which is closely related to N-Gram analysis commonly used in computational linguistics. This model does not require direct knowledge of complex architectural details, and is capable of determining performance relationships between instructions from an execution trace. Experimental measurements are used to derive a context-sensitive model for performance of each type of instruction in the context of an N-instruction sequence. Dynamic runtime power prediction mechanisms often suffer from high overhead costs. To reduce the overhead, this dissertation encodes the static instruction-level predictions into a data structure and uses compressive hashing to provide on-demand runtime access to those predictions. Genetic programming is used to evolve compressive hash functions and performance analysis of applications shows that, runtime access overhead can be reduced by a factor of ~3x-9x. Electrical and Computer Engineering Engineering
17	Text-based language identification for the South African languages Botha, Gerrit Reinier 04 September 2008 (has links) We investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n, for different amounts of training data. For a fixed value of n the support vector machines generally outperforms the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available. We find that it is more difficult to discriminate languages within language families then those across families. The accuracy on small input strings is low due to this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy. The relationship between the amount of training data and the accuracy achieved is found to depend on the window size – for the largest window (300 characters) about 400 000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data. Finally, we show that the confusions between the different languages in our set can be used to derive informative graphical representations of the relationships between the languages. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted Naïve bayesian classification Support vector machine N-gram statistics Text-based language identification Difference-in-frequency classification UCTD
18	Spell checkers and correctors : a unified treatment Liang, Hsuan Lorraine 25 June 2009 (has links) The aim of this dissertation is to provide a unified treatment of various spell checkers and correctors. Firstly, the spell checking and correcting problems are formally described in mathematics in order to provide a better understanding of these tasks. An approach that is similar to the way in which denotational semantics used to describe programming languages is adopted. Secondly, the various attributes of existing spell checking and correcting techniques are discussed. Extensive studies on selected spell checking/correcting algorithms and packages are then performed. Lastly, an empirical investigation of various spell checking/correcting packages is presented. It provides a comparison and suggests a classification of these packages in terms of their functionalities, implementation strategies, and performance. The investigation was conducted on packages for spell checking and correcting in English as well as in Northern Sotho and Chinese. The classification provides a unified presentation of the strengths and weaknesses of the techniques studied in the research. The findings provide a better understanding of these techniques in order to assist in improving some existing spell checking/correcting applications and future spell checking/correcting package designs and implementations. / Dissertation (MSc)--University of Pretoria, 2009. / Computer Science / unrestricted Performance N-gram Fsa Formal concept analysis Edit distance Dictionary lookup Classification Spell checking Spell correcting UCTD
19	コーパスにおけるモーラ情報を用いた日本の方言分類分析 / コーパスニオケルモーラジョウホウオモチイタニホンノホウゲンブンルイブンセキ入江さやか, Sayaka Irie 22 March 2020 (has links) 本研究は，日本語方言学において，これまで種々の案が出されている，「東西分類」というトピックに対して，自然談話におけるモーラn-gramの頻度データと統計的手法を用いて，各地方言を分類し，東西境界線がどのように考えられるかを検討したものである。また，分類の際に重要なモーラを形態音韻論的観点からまとめ，東西方言における特徴として挙げる。 / In this study, we classified Japanese dialects and considered where to set the east-west dialect border by applying statistical methods to frequency data on mora n-grams in natural discourse, for the topic "East-West Classification", which has been proposed in Japanese dialectology. In addition, the important mora at the time of classification are summarized from a morphological point of view and listed as a criterial feature of the East-West dialect. / 博士(文化情報学) / Doctor of Culture and Information Science / 同志社大学 / Doshisha University 方言分類自然談話モーラn-gram 系統樹線形判別分析
20	Long-term vehicle movement prediction using Machine Learning methods / Långsiktig fordonsrörelseförutsägelse med maskininlärningsmetoder Yus, Diego January 2018 (has links) The problem of location or movement prediction can be described as the task of predicting the future location of an item using the past locations of that item. It is a problem of increasing interest with the arrival of location-based services and autonomous vehicles. Even if short term prediction is more commonly studied, especially in the case of vehicles, long-term prediction can be useful in many applications like scheduling, resource managing or traffic prediction. In this master thesis project, I present a feature representation of movement that can be used for learning of long-term movement patterns and for long-term movement prediction both in space and time. The representation relies on periodicity in data and is based on weighted n-grams of windowed trajectories. The algorithm is evaluated on heavy transport vehicles movement data to assess its ability to from a search index retrieve vehicles that with high probability will move along a route that matches a desired transport mission. Experimental results show the algorithm is able to achieve a consistent low prediction distance error rate across different transport lengths in a limited geographical area under business operation conditions. The results also indicate that the total population of vehicles in the index is a critical factor in the algorithm performance and therefore in its real-world applicability. / Lokaliserings- eller rörelseprognosering kan beskrivas som uppgiften att förutsäga ett objekts framtida placering med hjälp av de tidigare platserna för objektet. Intresset för problemet ökar i och med införandet av platsbaserade tjänster och autonoma fordon. Även om det är vanligare att studera kortsiktiga förutsägelser, särskilt när det gäller fordon, kan långsiktiga förutsägelser vara användbara i många applikationer som schemaläggning, resurshantering eller trafikprognoser. I detta masterprojekt presenterar jag en feature-representation av rörelse som kan användas för att lära in långsiktiga rörelsemönster och för långsiktig rörelseprediktion både i rymden och tiden. Representationen bygger på periodicitet i data och är baserad på att dela upp banan i fönster och sedan beräkna viktade n-grams av banorna från de olika fönstren. Algoritmen utvärderas på transportdata för tunga transportfordon för att bedöma dess förmåga att från ett sökindex hämta fordon som med stor sannolikhet kommer att röra sig längs en rutt som matchar ett önskat transportuppdrag. Experimentella resultat visar att algoritmen kan uppnå ett konsekvent lågt fel i relativt predikterat avstånd över olika transportlängder i ett begränsat geografiskt område under verkliga förhållanden. Resultaten indikerar även att den totala populationen av fordon i indexet är en kritisk faktor för algoritmens prestanda och därmed även för dess applicerbarhet för verklig användning. long-term movement prediction machine learning transport vehicle feature representation n-gram geohash periodicity Computer Sciences Datavetenskap (datalogi)

Search results