1 |
Une nouvelle approche pour la détection des spams se basant sur un traitement des données catégoriellesParakh Ousman, Yassine Zaralahy January 2012 (has links)
Le problème des spams connaît depuis ces 20 dernières années un essor considérable. En effet, le pollupostage pourrait représenter plus de 72% de l'ensemble du trafic de courrier électronique. Au-delà de l'aspect intrusif des spams, ceux-ci peuvent comporter des virus ou des scripts néfastes ; d'où l'intérêt de les détecter afin de les supprimer.Le coût d'un envoi de courriels par un spammeur étant infime, ce dernier peut se permettre de transmettre le spam au plus d'adresse de messagerie électronique. Pour le spammeur qui arrive à récupérer même une petite partie d'utilisateurs, son opération devient commercialement viable. Imaginant un million de courriels envoyés et seul 0,1% de personnes qui se font appâtées [i.e. appâter], cela représente tout de même 1 millier de personnes ; et ce chiffre est très réaliste. Nous voyons que derrière la protection de la vie privée et le maintien d'un environnement de travail sain se cachent également des enjeux économiques. La détection des spams est une course constante entre la mise en place de nouvelles techniques de classification du courriel et le contournement de celles-ci par les spammeurs. Jusqu'alors, ces derniers avaient une avance dans cette lutte. Cette tendance s'est inversée avec l'apparition de techniques basées sur le filtrage du contenu. Ces filtres pour la plupart sont basés sur un classificateur bayésien naïf. Nous présentons dans ce mémoire une approche nouvelle de cette classification en utilisant une méthode basée sur le traitement de données catégorielles. Cette méthode utilise les N-grams pour identifier les motifs significatifs afin de limiter l'impact du morphisme des courriers indésirables.
|
2 |
Τεχνικές για την εξαγωγή γνώσης από την πλατφόρμα του TwitterΔήμας, Αναστάσιος 12 October 2013 (has links)
Η χρήση του Twitter από ολοένα και περισσότερους ανθρώπους έχει ως
συνέπεια την παραγωγή μεγάλου όγκου «υποκειμενικών» δεδομένων. Η ανάγκη για
εξεύρεση τυχόν πολύτιμης κρυμμένης πληροφορίας σε αυτά τα δεδομένα, έδωσε
ώθηση στην ανάπτυξη ενός νέου πεδίου έρευνας, του Sentiment Analysis, που έχει
ως αντικείμενο τον εντοπισμό του συναισθήματος ενός χρήστη (ή μιας ομάδας
χρηστών) ως προς κάποιο θέμα. Οι παραδοσιακοί αλγόριθμοι και μέθοδοι
εντοπισμού συναισθήματος στηρίζονται στην λεκτική ανάλυση φράσεων ή
προτάσεων σε «επίσημα» κείμενα και καλούνται word based approaches. Ωστόσο,
το μικρό μέγεθος των κειμένων του Twitter, σε συνδυασμό με την χαλαρότητα της
χρησιμοποιούμενης γλώσσας (από πλευράς χρηστών), δεν επιτρέπει την
αποτελεσματική χρήση αυτών των τεχνικών. Για τον λόγο αυτό, προτιμάται η χρήση
τεχνικών που βασίζονται σε χαρακτήρες (αντί για λέξεις) και καλούνται character
based approaches.
Στόχος της διπλωματικής εργασίας είναι η εφαρμογή της character based
μεθόδου στην ανάλυση tweets πολιτικού περιεχομένου. Συγκεκριμένα,
χρησιμοποιήθηκαν δεδομένα από την πολιτική σκηνή των Η.Π.Α., με σκοπό να
εντοπιστεί η προτίμηση ενός χρήστη ως προς το Ρεπουμπλικανικό ή το Δημοκρατικό
κόμμα μέσω σχετικών tweets. Για την ανάλυση χρησιμοποιήθηκε επιβλεπόμενη
μάθηση με την βοήθεια του Naive Bayes ταξινομητή.
Αρχικά, συλλέχθηκε ένα σύνολο από 7904 tweets, προερχόμενα από τους
επίσημους λογαριασμούς Twitter 48 γερουσιαστών. Το σύνολο αυτό χωρίσθηκε σε
δυο επιμέρους σύνολα, το σύνολο εκπαίδευσης και το σύνολο ελέγχου, ελέγχοντας
για κάθε μια από τις δυο μεθόδους ανάλυσης (την word based και character based
μέθοδο) την ακρίβεια της ταξινόμησης. Από τα πειράματα πρόεκυψε πως η
character based μέθοδος ταξινομεί τα tweets με μεγαλύτερη ακρίβεια. Στην
συνέχεια συλλέξαμε δυο νέα σύνολα έλεγχου, ένα από τον επίσημο λογαριασμό
Twitter του Ρεπουμπλικανικού κόμματος και ένα από τον επίσημο λογαριασμό
Twitter του Δημοκρατικού κόμματος. Αυτή την φορά, ως σύνολο εκπαίδευσης
χρησιμοποιήθηκε ολόκληρο το αρχικό σύνολο από τα tweets των γερουσιαστών και
ελέγχθηκε η ακρίβεια ταξινόμησης για την character based μέθοδο στα δυο νέα
σύνολα ελέγχου. Αν και στην περίπτωση του Democratic Twitter account τα
αποτελέσματα μπορούν να χαρακτηριστούν ως «ικανοποιητικά», μιας και η
ακρίβεια της ταξινόμησης πλησίασε το 80%, για την περίπτωση του Republican
Twitter account κάτι τέτοιο δεν ισχύει. Για το λόγο αυτό, προχωρήσαμε σε μια πιο
διεξοδική μελέτη της δομής και του περιεχομένου αυτών tweets. Από την ανάλυση
προέκυψαν ορισμένα ενδιαφέροντα αποτελέσματα για την προέλευση των
χαμηλών ποσοστών στην ακρίβεια ταξινόμησης. Συγκεκριμένα, πρόεκυψε πως στην
πλειοψηφία των tweets που έγιναν από τους Ρεπουμπλικάνους γερουσιαστές, δεν
περιέχονταν κάποια προσωπική τους άποψη. Ήταν απλά μια αναφορά σε κάποιο
άρθρο ή video που είδαν στον διαδίκτυο. Άρα, η πλειοψηφία των tweets αυτών
περιέχουν «αντικειμενική» αντί για «υποκειμενική» πληροφορία. Συνεπώς, δεν
είναι δυνατόν να εξαχθούν τα χαρακτηριστικά εκείνα που θα βοηθήσουν στον
εντοπισμό της πολικότητας των χρηστών. / As more people enter the “social web”, social media platforms are becoming an increasingly valuable source of subjective information. The large volume of social media content available requires automatic techniques in order to process and extract any valuable information. This need recently gave rise to the field of Sentiment Analysis, also known as Opinion Mining. The goal of sentiment analysis is to identify the position of a user (or a group of users – a crowd), with respect to a particular issue or topic. Existing sentiment analysis systems aim at extracting patterns mainly from formal documents with respect to a particular language (most techniques concern English). They either search for discriminative series of words or use dictionaries that assess the meaning and sentiment of specific words and phrases. The limited size of Twitter posts in conjunction with the non-standard vocabulary and shortened words (used by its users) inserts a great deal of noise, making word based approaches ineffective. For all of the above reasons, a new approach was recommended in the literature. This new approach is not based on the study of words but rather on the study of consecutive character sequences (namely character-based approaches).
In this work, we demonstrate the superiority of the character based approach over the word based one in determining political sentiment. We argue that this approach can be used in order to efficiently determine the political preference (e.g. Republican or Democrat) of voters or to identify the importance that particular issues have on particular voters. This type of feedback can be useful in the organization of political campaigns or policies.
We created a corpus consisting of 7904 tweets, collected from the Twitter accounts of 48 U.S. senators. This corpus was then separated into two sets, the training set and the test set, in order to measure for each method (word and character based) the accuracy of the classification. From the experiments it was found that the character based method classified the tweets with greater accuracy. In the next test, we used two new test sets, one from the official Twitter account of the Republican Party and one from the official Twitter account of the Democratic Party. The main difference, with respect to the previous test, was the use of the total set of tweets collected from the senators’ Twitter accounts as a training set and the use of the tweets from the official Twitter accounts of each party as a test set. Even though from the official Democrat Twitter account, 80% of the tweets were correctly classified as Democrat, for the official Republican Twitter account this is not the case (56.7% accuracy).
This was found to be partly because the majority of the Republican account tweets were references to online articles or videos and not the personal opinions or views of the users. In other words, such tweets cannot be characterized as personal (subjective), in order to classify the respective user as leaning towards one party or the other, but rather should be considered as objective.
|
3 |
The textcat Package for n-Gram Based Text Categorization in RFeinerer, Ingo, Buchta, Christian, Geiger, Wilhelm, Rauch, Johannes, Mair, Patrick, Hornik, Kurt 02 1900 (has links) (PDF)
Identifying the language used will typically be the first step in most natural language
processing tasks. Among the wide variety of language identification methods discussed
in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text
categorization based on character n-gram frequencies have been particularly successful.
This paper presents the R extension package textcat for n-gram based text categorization
which implements both the Cavnar and Trenkle approach as well as a reduced n-gram
approach designed to remove redundancies of the original approach. A multi-lingual
corpus obtained from the Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the provided language
identification methods. (authors' abstract)
|
4 |
The Processing of Lexical SequencesShaoul, Cyrus Unknown Date
No description available.
|
5 |
Lexical Chains and Sliding Locality Windows in Content-based Text Similarity DetectionNahnsen, Thade, Uzuner, Ozlem, Katz, Boris 19 May 2005 (has links)
We present a system to determine content similarity of documents. More specifically, our goal is to identify book chapters that are translations of the same original chapter; this task requires identification of not only the different topics in the documents but also the particular flow of these topics. We experiment with different representations employing n-grams of lexical chains and test these representations on a corpus of approximately 1000 chapters gathered from books with multiple parallel translations. Our representations include the cosine similarity of attribute vectors of n-grams of lexical chains, the cosine similarity of tf*idf-weighted keywords, and the cosine similarity of unweighted lexical chains (unigrams of lexical chains) as well as multiplicative combinations of the similarity measures produced by these approaches. Our results identify fourgrams of unordered lexical chains as a particularly useful representation for text similarity evaluation.
|
6 |
A comparative study of data transformations for efficient XML and JSON data compression : an in-depth analysis of data transformation techniques, including tag and capital conversions, character and word N-gram transformations, and domain-specific data transforms using SMILES data as a case studyScanlon, Shagufta Anjum January 2015 (has links)
XML is a widely used data exchange format. The verbose nature of XML leads to the requirement to efficiently store and process this type of data using compression. Various general-purpose transforms and compression techniques exist that can be used to transform and compress XML data. More compact alternatives to XML data have been developed, namely JSON due to the verbosity of XML data. Similarly, there is a requirement to efficiently store and process SMILES data used in Chemoinformatics. General-purpose transforms and compressors can be used to compress this type of data to a certain extent, however, these techniques are not specific to SMILES data. The primary contribution of this research is to provide developers that use XML, JSON or SMILES data, with key knowledge of the best transformation techniques to use with certain types of data, and which compression techniques would provide the best compressed output size and processing times, depending on their requirements. The main study in this thesis, investigates the extent of which using data transforms prior to data compression can further improve the compression of XML and JSON data. It provides a comparative analysis of applying a variety of data transform and data transform variations, to a number of different types of XML and JSON equivalent datasets of various sizes, and applying different general-purpose compression techniques over the transformed data. A case study is also conducted, to investigate data transforms prior to compression to improve the compression of data within a data-specific domain.
|
7 |
A Comparative Study of Data Transformations for Efficient XML and JSON Data Compression. An In-Depth Analysis of Data Transformation Techniques, including Tag and Capital Conversions, Character and Word N-Gram Transformations, and Domain-Specific Data Transforms using SMILES Data as a Case StudyScanlon, Shagufta A. January 2015 (has links)
XML is a widely used data exchange format. The verbose nature of XML leads to the requirement to efficiently store and process this type of data using compression. Various general-purpose transforms and compression techniques exist that can be used to transform and compress XML data. More compact alternatives to XML data have been developed, namely JSON due to the verbosity of XML data.
Similarly, there is a requirement to efficiently store and process SMILES data used in Chemoinformatics. General-purpose transforms and compressors can be used to compress this type of data to a certain extent, however, these techniques are not specific to SMILES data.
The primary contribution of this research is to provide developers that use XML, JSON or SMILES data, with key knowledge of the best transformation techniques to use with certain types of data, and which compression techniques would provide the best compressed output size and processing times, depending on their requirements.
The main study in this thesis, investigates the extent of which using data transforms prior to data compression can further improve the compression of XML and JSON data. It provides a comparative analysis of applying a variety of data transform and data transform variations, to a number of different types of XML and JSON equivalent datasets of various sizes, and applying different general-purpose compression techniques over the transformed data.
A case study is also conducted, to investigate data transforms prior to compression to improve the compression of data within a data-specific domain. / The files of software accompanying this thesis are unable to be presented online with the thesis.
|
8 |
Efektivní metody detekce plagiátů v rozsáhlých dokumentových skladech / Effective methods of plagiarism detectios in large document repositoriesPřibil, Jiří January 2009 (has links)
The work focuses on issues of plagiarism detection in large document repositories. Taking into account real situation that needs to be addressed now in the university environment in the Czech Republic and proposes a system that will be able to carry out this analysis in real time and yet be able to capture the widest possible range of plagiarism methods. The main contribution of this work is taking the definition of so-called unordered n-grams - {n}-grams - which can be used just to detect some forms of advanced plagiarism methods. All cited recommendations that relate to the various components of the system to detect plagiarism - preprocessing the document before document insertion into the corpus, the representation of documents in document storage, identification of potential sources of plagiarism to calculate rates of similarity; visualization analysis of plagiarism - are subject to discussion and appropriately quantified. The result is a set of design parameters of the system so that it can in detect plagiarism in the Czech language language quickly, accurately and yet in most forms.
|
9 |
Neural Networks for the Web Services ClassificationSilva, Jesús, Senior Naveda, Alexa, Solórzano Movilla, José, Niebles Núẽz, William, Hernández Palma, Hugo 07 January 2020 (has links)
This article introduces a n-gram-based approach to automatic classification of Web services using a multilayer perceptron-type artificial neural network. Web services contain information that is useful for achieving a classification based on its functionality. The approach relies on word n-grams extracted from the web service description to determine its membership in a category. The experimentation carried out shows promising results, achieving a classification with a measure F=0.995 using unigrams (2-grams) of words (characteristics composed of a lexical unit) and a TF-IDF weight.
|
10 |
Automatic Analysis of Blend Words / Analyse automatique de mots mélangésWarintarawej, Pattaraporn 04 April 2013 (has links)
Mélanger des parties de mots est une façon qui peut sembler étonnante pour produire de nouvelles formes linguistiques. Cela est devenu une manière très utilisée pour inventer des noms pour le quotidien, les noms de marque, les noms utilisés dans les codes informatiques des logiciels, par exemple avec alicament (aliment and médicament), aspivenin (aspirer and venin). Il existe plusieurs façon de mélanger des mots pour en former d'autres, ce qui rend difficile l'analyse des mots produits. Dans cette thèse, nous nous proposons une approche d'analyse automatique des évocations de mots produits à l'aide de mélanges, en considérant des méthodes de classification de type top-k. Nous comparons trois méthodes d'analyse des parties d'un mot : n-grammes, syllabes et cellules morpho-phonologiques. Nous proposons deux algorithmes d'extraction des syllables ainsi que des méthodes d'évaluation.L'algorithme Enqualitum est proposé pour identifier les mots étant évoqués par le mot analysé. Notre proposition a été utilisée en particulier dans le domaine de l'analyse automatique en génie logiciel pour lequel nous avons proposé l'algorithme Sword pour produire un découpage pertinent des noms apparaissant dans les programmes. Les expérimentations ont démontré l'intérêt de nos propositions. / Lexical blending is amazing in the sense of morphological productivity, involving the coinage of a new lexeme by fusing parts of at least two source words. Since new things need new words, blending has become a frequent productive word creation such as smog (smoke and fog), or alicament (aliment and médicament) (a French blend word), etc. The challenge is to design methods to discover how the first source word and the second source word are combined. The thesis aims at automatic analysis blend words in order to find the source words they evoke. The contributions of the thesis can divided into two main parts. First, the contribution to automatic blend word analysis, we develop top-k classification and its evaluation framework to predict concepts of blend words. We investigate three different features of words: character N-grams, syllables and morpho-phonological stems. Moreover, we propose a novel approach to automatically identify blend source words, named Enqualitum. The experiments are conducted on both synthetic French blend words and words from a French thesaurus. Second, the contribution to software engineering application, we apply the idea of learning character patterns of identifiers to predict concepts of source codes and also introduce a method to automate semantic context in source codes. The experiments are conducted on real identifier names from open source software packages. The results show the usefulness and the effectiveness of our proposed approaches.
|
Page generated in 0.0904 seconds