• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 78
  • 29
  • 21
  • 15
  • 11
  • 9
  • 8
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 208
  • 83
  • 51
  • 42
  • 32
  • 31
  • 30
  • 29
  • 27
  • 26
  • 25
  • 22
  • 22
  • 21
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Rétablir la confiance dans les messages électroniques : Le traitement des causes du "spam" / Restoring confidence in electronic mails

Laurent-Ricard, Eric 09 December 2011 (has links)
L'utilisation grandissante de la messagerie électronique dans les échanges dématérialisés, aussi bien pour les entreprises que pour les personnes physiques, et l'augmentation du nombre de courriers indésirables, nommés « spams » (pourriels) génèrent une perte de temps importante de traitement manuel, et un manque de confiance à la fois dans les informations transmises et dans les émetteurs de ces messages. Quels sont les solutions pour rétablir ou établir la confiance dans ces échanges ? Comment traiter et faire diminuer le nombre grandissant de « spams » ? Les solutions existantes sont parfois lourdes à mettre en oeuvre ou relativement peu efficaces et s’occupent essentiellement de traiter les effets du « spam », en oubliant d’analyser et de traiter les causes. L'identification, si ce n'est l'authentification de l'émetteur et des destinataires, est un des points clés permettant de valider l'origine d'un message et d’en garantir le contenu, aussi bien qu’un niveau important de traçabilité, mais ce n’est pas le seul, et les mécanismes de base mêmes de la messagerie électronique, plus précisément au niveau des protocoles de communication sont également en jeu. Le contenu de cette thèse portera plus spécifiquement sur les possibilités liées aux modifications de certains protocoles de l'Internet, en particulier le protocole SMTP, la mise en oeuvre de spécifications peu utilisées, et les outils et méthodes envisageables pour garantir l’identification des parties de façon simple et transparente pour les utilisateurs. L’objectif est de définir, d'une part une méthodologie d'utilisation de la messagerie pouvant assurer fiabilité et confiance, et d'autre part de rédiger les bases logiques de programmes clients et serveurs pour la mise en application de cette méthodologie. / The growing use of email in dematerialized exchanges, for both businesses and individuals, and the increase of undesirable mails, called "spam" (junk emails) generate a significant loss of time of manual processing And a lack of confidence both in the information transmitted and the issuers of such messages. What are the solutions to restore or build confidence in these exchanges? How to treat and reduce the growing number of «spam»?Existing solutions are often cumbersome to implement or relatively ineffective and are primarily concerned with treating the effects of "«spam»", forgetting to analyze and address the causes.The identification, if not the authentication, of the sender and recipients, is a key point to validate the origin of a message and ensure the content, as well as a significant level of traceability, but it is not the only one, and the basic mechanisms, themselves, of the email system, more precisely in terms of communication protocols are also at stake.The content of this thesis will focus primarily on opportunities related to changes in some Internet protocols, in particular SMTP, implementation specifications rarely used, and the tools and possible methods to ensure the identification of parties in a simple and transparent way for users.The objective is to define, firstly a methodology for using the mail with reliability and confidence, and secondly to draw the logical foundations of client and server programs for the implementation of this methodology.
92

SPAM = do surgimento à extinção / SPAM : from the rise to the extinction

Almeida, Tiago Agostinho de 09 October 2010 (has links)
Orientador: Akedo Yamakami / Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-16T13:44:58Z (GMT). No. of bitstreams: 1 Almeida_TiagoAgostinhode_D.pdf: 1582584 bytes, checksum: 8a444adaf46219a5200a75deb26be781 (MD5) Previous issue date: 2010 / Resumo: Nos últimos anos, spams têm se tornado um importante problema com enorme impacto na sociedade. A filtragem automática de tais mensagens impõem um desafio especial em categorização de textos, no qual a característica mais marcante é que os filtros enfrentam um adversário ativo, que constantemente procura evadir as técnicas de filtragem. Esta tese apresenta um estudo abrangente sobre o problema do spamming. Dentre as contribuições oferecidas, destacam-se: o levantamento histórico e estatístico do fenômeno do spamming e as suas consequências, o estudo sobre a legalidade do spam e os recursos jurídicos adotados por alguns países, a análise de medidas de desempenho utilizadas na avaliação dos filtros de spams, o estudo dos métodos mais empregados para realizar a filtragem de spams, a proposta de melhorias dos filtros Bayesianos através da adoção de técnicas de redução de dimensionalidade e, principalmente, a proposta de um novo método de classificação baseado no princípio da descrição mais simples auxiliado por fatores de confidência. Vários experimentos são apresentados e os resultados indicam que a técnica proposta 'e superior aos melhores filtros anti-spams presentes tanto comercialmente quanto na literatura. / Abstract: Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this thesis, we present a comprehensive study of the spamming problem. Among many offered contributions we present: the statistical and historical survey of spamming and its consequences, a study regarding the legality of spams and the main juridic methods adopted by some countries, the study and proposal of new performance measures used for the evaluation of the spam classifiers, the proposals for improving the accuracy of Naive Bayes filters by using dimensionality reduction techniques and a novel approach to spam filtering based on the minimum description length principle and confidence factors. Furthermore, we have conducted an empirical experiments which indicate that the proposed classifier outperforms the state-of-the-art spam filters. / Doutorado / Automação / Doutor em Engenharia Elétrica
93

Automatická identifikace šablony generující spam kampaně / Automatic Template Pattern Recognition

Kovařík, David January 2018 (has links)
Spam se typicky nevyskytuje ve formě samostatných zpráv, ale často bývá sdružován do takzvaných kampaní. Ty bývají automaticky generovány pomocí šablon. Díky tomu jsou jednotlivé zprávy sémanticky, ale ne syntakticky, ekvivalentní. Cílem práce je navrhnout algoritmus schopný z množiny zpráv jedné kampaně zpětně extrahovat šablonu, ze které tyto zprávy byly generovány. Práce se zaměřuje na spam v SMS komunikaci, ale navržené postupy jsou dostatečně obecné pro širší použití. Algoritmus je postaven na metodě zarovnávání dvou sekvencí, používané v bioinformatice pro nalezení podobných oblastí proteinových řetězců. Výstupem je regulární výraz popisující šablonu dané kampaně. Součástí řešení je také nástroj pro vizualizaci šablony pomocí HTML.Řešení bylo ověřeno na přibližně třech stovkách skutečných kampaní z celého světa. V naprosté většině případů je poskytnutý výsledek postačující pro identifikaci kampaně.
94

System för att upptäcka Phishing : Klassificering av mejl

Karlsson, Nicklas January 2008 (has links)
<p>Denna rapport tar en titt på phishing-problemet, något som många har råkat ut för med bland annat de falska Nordea eller eBay mejl som på senaste tiden har dykt upp i våra inkorgar, och ett eventuellt sätt att minska phishingens effekt. Fokus i rapporten ligger på klassificering av mejl och den huvudsakliga frågeställningen är: ”Är det, med hög träffsäkerhet, möjligt att med hjälp av ett klassificeringsverktyg sortera ut mejl som har med phishing att göra från övrig skräppost.” Det visade sig svårare än väntat att hitta phishing mejl att använda i klassificeringen. I de klassificeringar som genomfördes visade det sig att både metoden Naive Bayes och med Support Vector Machine kan hitta upp till 100 % av phishing mejlen. Rapporten pressenterar arbetsgången, teori om phishing och resultaten efter genomförda klassificeringstest.</p> / <p>This report takes a look at the phishing problem, something that many have come across with for example the fake Nordea or eBay e-mails that lately have shown up in our e-mail inboxes, and a possible way to reduce the effect of phishing. The focus in the report lies on classification of e-mails and the main question is: “Is it, with high accuracy, possible with a classification tool to sort phishing e-mails from other spam e-mails.” It was more difficult than expected to find phishing e-mails to use in the classification. The classifications that were made showed that it was possible to find up to 100 % of the phishing e-mails with both Naive Bayes and with Support Vector Machine. The report presents the work done, facts about phishing and the results of the classification tests made.</p>
95

Filtragem automática de opiniões falsas: comparação compreensiva dos métodos baseados em conteúdo / Automatic filtering of false opinions: comprehensive comparison of content-based methods

Cardoso, Emerson Freitas 04 August 2017 (has links)
Submitted by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T17:30:32Z No. of bitstreams: 1 CARDOSO_Emerson_2017.pdf: 3299853 bytes, checksum: bda5605a1fb8e64f503215e839d2a9a6 (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T17:30:45Z (GMT) No. of bitstreams: 1 CARDOSO_Emerson_2017.pdf: 3299853 bytes, checksum: bda5605a1fb8e64f503215e839d2a9a6 (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T17:32:37Z (GMT) No. of bitstreams: 1 CARDOSO_Emerson_2017.pdf: 3299853 bytes, checksum: bda5605a1fb8e64f503215e839d2a9a6 (MD5) / Made available in DSpace on 2017-10-09T17:32:49Z (GMT). No. of bitstreams: 1 CARDOSO_Emerson_2017.pdf: 3299853 bytes, checksum: bda5605a1fb8e64f503215e839d2a9a6 (MD5) Previous issue date: 2017-08-04 / Não recebi financiamento / Before buying a product or choosing for a trip destination, people often seek other people’s opinions to obtain a vision of the quality of what they want to acquire. Given that, opinions always had great influence on the purchase decision. Following the enhancements of the Internet and a huge increase in the volume of data traffic, social networks were created to help users post and view all kinds of information, and this caused people to also search for opinions on the Web. Sites like TripAdvisor and Yelp make it easier to share online reviews, since they help users to post their opinions from anywhere via smartphones and enable product manufacturers to gain relevant feedback quickly in a centralized way. As a result, most people nowadays trust personal recommendations as much as online reviews. However, competition between service providers and product manufacturers have also increased in social media, leading to the first cases of spam reviews: deceptive opinions published by hired people that try to promote or defame products or businesses. These reviews are carefully written in order to look like authentic ones, making it difficult to be detected by humans or automatic methods. Thus, they are used, in a misleading way, in attempt to control the general opinion, causing financial harm to business owners and users. Several approaches have been proposed for spam review detection and most of them use techniques involving machine learning and natural language processing. However, despite all progress made, there are still relevant questions that remain open, which require a criterious analysis in order to be properly answered. For instance, there is no consensus whether the performance of traditional classification methods can be affected by incremental learning or changes in reviews’ features over time; also, there is no consensus whether there is statistical difference between performances of content-based classification methods. In this scenario, this work offers a comprehensive comparison between traditional machine learning methods applied in spam review detection. This comparison is made in multiple setups, employing different types of learning and data sets. The experiments performed along with statistical analysis of the results corroborate offering appropriate answers to the existing questions. In addition, all results obtained can be used as baseline for future comparisons. / Antes de comprar um produto ou escolher um destino de viagem, muitas pessoas costumam buscar por opiniões alheias para obter uma visão da qualidade daquilo que se deseja adquirir. Assim, as opiniões sempre exerceram grande influência na decisão de compra. Com o avanço da Internet e aumento no volume de informações trafegadas, surgiram redes sociais que possibilitam compartilhar e visualizar informações de todo o tipo, fazendo com que pessoas passassem a buscar também por opiniões na Web. Atualmente, sites especializados, como TripAdvisor e Yelp, oferecem um sistema de compartilhamento de opiniões online (reviews) de maneira fácil, pois possibilitam que usuários publiquem suas opiniões de qualquer lugar através de smartphones, assim como também permitem que fabricantes de produtos e prestadores de serviços obtenham feedbacks relevantes de maneira centralizada e rápida. Em virtude disso, estudos indicam que atualmente a maioria dos usuários confia tanto em recomendações pessoais quanto em reviews online. No entanto, a competição entre prestadores de serviços e fabricantes de produtos também aumentou nas redes sociais, o que levou aos primeiros casos de spam reviews: opiniões enganosas publicadas por pessoas contratadas que tentam promover ou difamar produtos ou serviços. Esses reviews são escritos cuidadosamente para parecerem autênticos, o que dificulta sua detecção por humanos ou por métodos automáticos. Assim, eles são usados para tentar, de maneira enganosa, controlar a opinião geral, podendo causar prejuízos para empresas e usuários. Diversas abordagens para a detecção de spam reviews vêm sendo propostas, sendo que a grande maioria emprega técnicas de aprendizado de máquina e processamento de linguagem natural. No entanto, apesar dos avanços já realizados, ainda há questionamentos relevantes que permanecem em aberto e demandam uma análise criteriosa para serem respondidos. Por exemplo, não há um consenso se o desempenho de métodos tradicionais de classificação pode ser afetado em cenários que demandam aprendizado incremental ou por mudanças nas características dos reviews devido ao fator cronológico, assim como também não há um consenso se existe diferença estatística entre os desempenhos dos métodos baseados no conteúdo das mensagens. Neste cenário, esta dissertação oferece uma análise e comparação compreensiva dos métodos tradicionais de aprendizado de máquina, aplicados na detecção de spam reviews. A comparação é realizada em múltiplos cenários, empregando-se diferentes tipos de aprendizado e bases de dados. Os experimentos realizados, juntamente com análise estatística dos resultados, corroboram a oferecer respostas adequadas para os questionamentos existentes. Além disso, os resultados obtidos podem ser usados como baseline para comparações futuras.
96

System för att upptäcka Phishing : Klassificering av mejl

Karlsson, Nicklas January 2008 (has links)
Denna rapport tar en titt på phishing-problemet, något som många har råkat ut för med bland annat de falska Nordea eller eBay mejl som på senaste tiden har dykt upp i våra inkorgar, och ett eventuellt sätt att minska phishingens effekt. Fokus i rapporten ligger på klassificering av mejl och den huvudsakliga frågeställningen är: ”Är det, med hög träffsäkerhet, möjligt att med hjälp av ett klassificeringsverktyg sortera ut mejl som har med phishing att göra från övrig skräppost.” Det visade sig svårare än väntat att hitta phishing mejl att använda i klassificeringen. I de klassificeringar som genomfördes visade det sig att både metoden Naive Bayes och med Support Vector Machine kan hitta upp till 100 % av phishing mejlen. Rapporten pressenterar arbetsgången, teori om phishing och resultaten efter genomförda klassificeringstest. / This report takes a look at the phishing problem, something that many have come across with for example the fake Nordea or eBay e-mails that lately have shown up in our e-mail inboxes, and a possible way to reduce the effect of phishing. The focus in the report lies on classification of e-mails and the main question is: “Is it, with high accuracy, possible with a classification tool to sort phishing e-mails from other spam e-mails.” It was more difficult than expected to find phishing e-mails to use in the classification. The classifications that were made showed that it was possible to find up to 100 % of the phishing e-mails with both Naive Bayes and with Support Vector Machine. The report presents the work done, facts about phishing and the results of the classification tests made.
97

Spam Analysis and Detection for User Generated Content in Online Social Networks

Tan, Enhua 23 July 2013 (has links)
No description available.
98

Prediction games : machine learning in the presence of an adversary

Brückner, Michael January 2012 (has links)
In many applications one is faced with the problem of inferring some functional relation between input and output variables from given data. Consider, for instance, the task of email spam filtering where one seeks to find a model which automatically assigns new, previously unseen emails to class spam or non-spam. Building such a predictive model based on observed training inputs (e.g., emails) with corresponding outputs (e.g., spam labels) is a major goal of machine learning. Many learning methods assume that these training data are governed by the same distribution as the test data which the predictive model will be exposed to at application time. That assumption is violated when the test data are generated in response to the presence of a predictive model. This becomes apparent, for instance, in the above example of email spam filtering. Here, email service providers employ spam filters and spam senders engineer campaign templates such as to achieve a high rate of successful deliveries despite any filters. Most of the existing work casts such situations as learning robust models which are unsusceptible against small changes of the data generation process. The models are constructed under the worst-case assumption that these changes are performed such to produce the highest possible adverse effect on the performance of the predictive model. However, this approach is not capable to realistically model the true dependency between the model-building process and the process of generating future data. We therefore establish the concept of prediction games: We model the interaction between a learner, who builds the predictive model, and a data generator, who controls the process of data generation, as an one-shot game. The game-theoretic framework enables us to explicitly model the players' interests, their possible actions, their level of knowledge about each other, and the order at which they decide for an action. We model the players' interests as minimizing their own cost function which both depend on both players' actions. The learner's action is to choose the model parameters and the data generator's action is to perturbate the training data which reflects the modification of the data generation process with respect to the past data. We extensively study three instances of prediction games which differ regarding the order in which the players decide for their action. We first assume that both player choose their actions simultaneously, that is, without the knowledge of their opponent's decision. We identify conditions under which this Nash prediction game has a meaningful solution, that is, a unique Nash equilibrium, and derive algorithms that find the equilibrial prediction model. As a second case, we consider a data generator who is potentially fully informed about the move of the learner. This setting establishes a Stackelberg competition. We derive a relaxed optimization criterion to determine the solution of this game and show that this Stackelberg prediction game generalizes existing prediction models. Finally, we study the setting where the learner observes the data generator's action, that is, the (unlabeled) test data, before building the predictive model. As the test data and the training data may be governed by differing probability distributions, this scenario reduces to learning under covariate shift. We derive a new integrated as well as a two-stage method to account for this data set shift. In case studies on email spam filtering we empirically explore properties of all derived models as well as several existing baseline methods. We show that spam filters resulting from the Nash prediction game as well as the Stackelberg prediction game in the majority of cases outperform other existing baseline methods. / Eine der Aufgabenstellungen des Maschinellen Lernens ist die Konstruktion von Vorhersagemodellen basierend auf gegebenen Trainingsdaten. Ein solches Modell beschreibt den Zusammenhang zwischen einem Eingabedatum, wie beispielsweise einer E-Mail, und einer Zielgröße; zum Beispiel, ob die E-Mail durch den Empfänger als erwünscht oder unerwünscht empfunden wird. Dabei ist entscheidend, dass ein gelerntes Vorhersagemodell auch die Zielgrößen zuvor unbeobachteter Testdaten korrekt vorhersagt. Die Mehrzahl existierender Lernverfahren wurde unter der Annahme entwickelt, dass Trainings- und Testdaten derselben Wahrscheinlichkeitsverteilung unterliegen. Insbesondere in Fällen in welchen zukünftige Daten von der Wahl des Vorhersagemodells abhängen, ist diese Annahme jedoch verletzt. Ein Beispiel hierfür ist das automatische Filtern von Spam-E-Mails durch E-Mail-Anbieter. Diese konstruieren Spam-Filter basierend auf zuvor empfangenen E-Mails. Die Spam-Sender verändern daraufhin den Inhalt und die Gestaltung der zukünftigen Spam-E-Mails mit dem Ziel, dass diese durch die Filter möglichst nicht erkannt werden. Bisherige Arbeiten zu diesem Thema beschränken sich auf das Lernen robuster Vorhersagemodelle welche unempfindlich gegenüber geringen Veränderungen des datengenerierenden Prozesses sind. Die Modelle werden dabei unter der Worst-Case-Annahme konstruiert, dass diese Veränderungen einen maximal negativen Effekt auf die Vorhersagequalität des Modells haben. Diese Modellierung beschreibt die tatsächliche Wechselwirkung zwischen der Modellbildung und der Generierung zukünftiger Daten nur ungenügend. Aus diesem Grund führen wir in dieser Arbeit das Konzept der Prädiktionsspiele ein. Die Modellbildung wird dabei als mathematisches Spiel zwischen einer lernenden und einer datengenerierenden Instanz beschrieben. Die spieltheoretische Modellierung ermöglicht es uns, die Interaktion der beiden Parteien exakt zu beschreiben. Dies umfasst die jeweils verfolgten Ziele, ihre Handlungsmöglichkeiten, ihr Wissen übereinander und die zeitliche Reihenfolge, in der sie agieren. Insbesondere die Reihenfolge der Spielzüge hat einen entscheidenden Einfluss auf die spieltheoretisch optimale Lösung. Wir betrachten zunächst den Fall gleichzeitig agierender Spieler, in welchem sowohl der Lerner als auch der Datengenerierer keine Kenntnis über die Aktion des jeweils anderen Spielers haben. Wir leiten hinreichende Bedingungen her, unter welchen dieses Spiel eine Lösung in Form eines eindeutigen Nash-Gleichgewichts besitzt. Im Anschluss diskutieren wir zwei verschiedene Verfahren zur effizienten Berechnung dieses Gleichgewichts. Als zweites betrachten wir den Fall eines Stackelberg-Duopols. In diesem Prädiktionsspiel wählt der Lerner zunächst das Vorhersagemodell, woraufhin der Datengenerierer in voller Kenntnis des Modells reagiert. Wir leiten ein relaxiertes Optimierungsproblem zur Bestimmung des Stackelberg-Gleichgewichts her und stellen ein mögliches Lösungsverfahren vor. Darüber hinaus diskutieren wir, inwieweit das Stackelberg-Modell bestehende robuste Lernverfahren verallgemeinert. Abschließend untersuchen wir einen Lerner, der auf die Aktion des Datengenerierers, d.h. der Wahl der Testdaten, reagiert. In diesem Fall sind die Testdaten dem Lerner zum Zeitpunkt der Modellbildung bekannt und können in den Lernprozess einfließen. Allerdings unterliegen die Trainings- und Testdaten nicht notwendigerweise der gleichen Verteilung. Wir leiten daher ein neues integriertes sowie ein zweistufiges Lernverfahren her, welche diese Verteilungsverschiebung bei der Modellbildung berücksichtigen. In mehreren Fallstudien zur Klassifikation von Spam-E-Mails untersuchen wir alle hergeleiteten, sowie existierende Verfahren empirisch. Wir zeigen, dass die hergeleiteten spieltheoretisch-motivierten Lernverfahren in Summe signifikant bessere Spam-Filter erzeugen als alle betrachteten Referenzverfahren.
99

TubeSpam: Filtragem Automática de Comentários Indesejados Postados no YouTube / TubeSpam: automatic undesired comments filtering on YouTube

Alberto, Túlio Casagrande 03 February 2017 (has links)
Submitted by Milena Rubi (milenarubi@ufscar.br) on 2017-10-03T19:06:58Z No. of bitstreams: 1 ALBERTO_Tulio_2017.pdf: 2422402 bytes, checksum: 127bff2089f3d274b1abaa58c3d32578 (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-03T19:07:11Z (GMT) No. of bitstreams: 1 ALBERTO_Tulio_2017.pdf: 2422402 bytes, checksum: 127bff2089f3d274b1abaa58c3d32578 (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-03T19:07:27Z (GMT) No. of bitstreams: 1 ALBERTO_Tulio_2017.pdf: 2422402 bytes, checksum: 127bff2089f3d274b1abaa58c3d32578 (MD5) / Made available in DSpace on 2017-10-03T19:07:37Z (GMT). No. of bitstreams: 1 ALBERTO_Tulio_2017.pdf: 2422402 bytes, checksum: 127bff2089f3d274b1abaa58c3d32578 (MD5) Previous issue date: 2017-02-03 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / YouTube has become an important video sharing platform. Several users regularly produce video content and make this task their main livelihood. However, such success is also drawing the attention of malicious users propagating undesired comments and videos, looking for self-promotion or disseminating malicious links which may have malwares and viruses. Since YouTube offers limited tools for blocking spam, the volume of such messages is shockingly increasing and harming users and channels owners. In addition to the problem being naturally online, comment spam filtering on YouTube is different than the traditional email spam filtering, since the messages are very short and often rife with spelling errors, slangs, symbols and abbreviations. This manuscript presents a performance evaluation of traditional online classification methods, aided by lexical normalization and semantic indexing techniques when applied to automatic filter YouTube comment spam. It was also evaluated the performance of MDLText, a promising text classification method based on the minimum description length principle. The statistical analysis of the results indicates that MDLText, Passive-Aggressive, Naïve Bayes, MDL and Online Gradient Descent obtained statistically equivalent performances. The results also indicate that the lexical normalization and semantic indexing techniques are effective to be applied to the problem. Based on the results, it is proposed and designed TubeSpam, an online tool to automatic filter undesired comments posted on YouTube. / O YouTube tem se tornado uma importante plataforma de compartilhamento de vídeos. Muitos usuários produzem regularmente conteúdo em vídeo e fazem desta tarefa seu principal meio de vida. Contudo, esse sucesso também vem despertando a atenção de usuários mal-intencionados, que propagam comentários e vídeos indesejados para se autopromoverem ou para disseminar links maliciosos que podem conter vírus e malwares. Visto que o YouTube atualmente oferece recursos limitados para bloquear spam, o volume dessas mensagens está impactando muitos usuários e proprietários de canais. Além da característica inerentemente online do problema, filtrar spam nos comentários do YouTube é uma tarefa que difere-se da tradicional filtragem de spam em emails, pois as mensagens costumam ser muito mais curtas e repletas de erros de digitação, gírias, símbolos e abreviações que podem dificultar a tarefa de classificação. Assim, nesta dissertação é apresentada a avaliação de desempenho obtido por métodos tradicionais de classificação online auxiliados por técnicas de normalização léxica e indexação semântica, quando aplicados na filtragem automática de comentários indesejados postados no YouTube. Foi avaliado também o desempenho do MDLText, um promissor método de classificação de texto baseado no princípio da descrição mais simples. A análise estatística dos resultados indica que os métodos MDLText, Passivo-Agressivo, Naïve Bayes, MDL e Gradiente Descendente Online obtiveram desempenhos equivalentes. Além disso, os resultados também indicam que o uso de técnicas de normalização léxica e indexação semântica são eficazes para atenuar os problemas de representação de texto e, consequentemente, aumentar o poder de predição dos métodos de classificação. Baseado nos resultados dos experimentos, foi proposto e desenvolvido o TubeSpam, uma ferramenta online para filtrar automaticamente comentários indesejados postados no YouTube.
100

E‐Shape Analysis

Sroufe, Paul 12 1900 (has links)
The motivation of this work is to understand E-shape analysis and how it can be applied to various classification tasks. It has a powerful feature to not only look at what information is contained, but rather how that information looks. This new technique gives E-shape analysis the ability to be language independent and to some extent size independent. In this thesis, I present a new mechanism to characterize an email without using content or context called E-shape analysis for email. I explore the applications of the email shape by carrying out a case study; botnet detection and two possible applications: spam filtering and social-context based finger printing. The second part of this thesis takes what I apply E-shape analysis to activity recognition of humans. Using the Android platform and a T-Mobile G1 phone I collect data from the triaxial accelerometer and use it to classify the motion behavior of a subject.

Page generated in 0.0238 seconds