1 |
Discourse-givenness of noun phrases : theoretical and computational modelsRitz, Julia January 2013 (has links)
This thesis gives formal definitions of discourse-givenness, coreference and reference, and reports on experiments with computational models of discourse-givenness of noun phrases for English and German.
Definitions are based on Bach's (1987) work on reference, Kibble and van Deemter's (2000) work on coreference, and Kamp and Reyle's Discourse Representation Theory (1993).
For the experiments, the following corpora with coreference annotation were used: MUC-7, OntoNotes and ARRAU for Englisch, and TueBa-D/Z for German. As for classification algorithms, they cover J48 decision trees, the rule based learner Ripper, and linear support vector machines. New features are suggested, representing the noun phrase's specificity as well as its context, which lead to a significant improvement of classification quality. / Die vorliegende Arbeit gibt formale Definitionen der Konzepte Diskursgegebenheit, Koreferenz und Referenz. Zudem wird über Experimente berichtet, Nominalphrasen im Deutschen und Englischen hinsichtlich ihrer Diskursgegebenheit zu klassifizieren.
Die Definitionen basieren auf Arbeiten von Bach (1987) zu Referenz, Kibble und van Deemter (2000) zu Koreferenz und der Diskursrepräsentationstheorie (Kamp und Reyle, 1993).
In den Experimenten wurden die koreferenzannotierten Korpora MUC-7, OntoNotes und ARRAU (Englisch) und TüBa-D/Z (Deutsch) verwendet. Sie umfassen die Klassifikationsalgorithmen J48 (Entscheidungsbäume), Ripper (regelbasiertes Lernen) und lineare Support Vector Machines. Mehrere neue Klassifikationsmerkmale werden vorgeschlagen, die die Spezifizität der Nominalphrase messen, sowie ihren Kontext abbilden. Mit Hilfe dieser Merkmale kann eine signifikante Verbesserung der Klassifikation erreicht werden.
|
2 |
Significant Feature ClusteringWhissell, John January 2006 (has links)
In this thesis, we present a new clustering algorithm we call <em>Significance Feature Clustering</em>, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is <em>significantly negative</em>, <em>neutral</em>, or <em>significantly positive</em>, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as <em>internal similarity</em>, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call <em>maximum tree internal similarity</em>, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.
|
3 |
Significant Feature ClusteringWhissell, John January 2006 (has links)
In this thesis, we present a new clustering algorithm we call <em>Significance Feature Clustering</em>, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is <em>significantly negative</em>, <em>neutral</em>, or <em>significantly positive</em>, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as <em>internal similarity</em>, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call <em>maximum tree internal similarity</em>, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.
|
4 |
Tag Clouds para investigadores de Ciencias de la ComputaciónRíos Araya, Paula Andrea January 2018 (has links)
Memoria para optar al título de Ingeniera Civil en Computación / Actualmente, existen millones de publicaciones de investigadores en distintas áreas de las Ciencias de la Computación, y estas continúan aumentando día a día. En los perfiles de cada investigador del área en sitios web como DBLP o Google Scholar, se puede encontrar un listado con sus publicaciones. Sin embargo, con esta información por sí sola es difícil captar cuáles son los tópicos de interés de cada investigador a simple vista, y podría ser necesario en un ámbito de colaboración entre académicos o entre académicos y estudiantes.
Este trabajo busca facilitar la información resumida de los tópicos de investigación de académicos de Ciencias de la Computación mediante la generación de visualizaciones como nubes de palabras, o tag clouds, a partir de las palabras y frases clave mencionadas en las publicaciones encontradas en repositorios bibliográficos online, como los mencionados anteriormente.
El sistema desarrollado en esta memoria consiste en una herramienta que permite la creación de tag clouds para perfiles de DBLP. Esta herramienta se encarga de la obtención de las publicaciones encontradas en el perfil, la extracción de potenciales keywords y la selección de las keywords más relevantes según cuatro modelos de ordenamiento. Por cada uno de estos modelos se crea una variante de tag cloud. Además, se crea un sitio web que permite el uso de la herramienta para cualquier usuario.
El trabajo se enfoca principalmente en la investigación de modelos de learning to rank y la comparación de su desempeño en la tarea de definir las keywords más relevantes para un investigador de Ciencias de la Computación. Dado que existen tres enfoques distintos para resolver la tarea de ordenamiento, se utilizan cuatro modelos de learning to rank, teniendo al menos uno por cada enfoque. Estos son regresión lineal, RankSVM, LambdaMART y AdaRank.
De las evaluaciones a las tag clouds creadas por la herramienta se observa que no habría una preferencia absoluta por un método por sobre los demás, sino que varía según cada persona, pero en la mayoría de los casos se le asigna el puntaje máximo a al menos una de las tag clouds generadas. Esto podría deberse a que los modelos tienden a diferir en su enfoque, en algunos casos seleccionando keywords más técnicas y en otros más genéricas. De esta forma la apreciación de un método por sobre el otro se ve afectada por las preferencias de cada uno. De esto se concluye la importancia de dar la posibilidad de elegir a los usuarios entre distintas variantes.
|
5 |
Extraction automatique de caractéristiques malveillantes et méthode de détection de malware dans un environnement réel / Automatic extraction of malicious features and method for detecting malware in a real environmentAngoustures, Mark 14 December 2018 (has links)
Pour faire face au volume considérable de logiciels malveillants, les chercheurs en sécurité ont développé des outils dynamiques automatiques d’analyse de malware comme la Sandbox Cuckoo. Ces types d’analyse sont partiellement automatiques et nécessite l’intervention d’un expert humain en sécurité pour détecter et extraire les comportements suspicieux. Afin d’éviter ce travail fastidieux, nous proposons une méthodologie pour extraire automatiquement des comportements dangereux données par les Sandbox. Tout d’abord, nous générons des rapports d’activités provenant des malware depuis la Sandbox Cuckoo. Puis, nous regroupons les malware faisant partie d’une même famille grâce à l’algorithme Avclass. Cet algorithme agrège les labels de malware donnés par VirusTotal. Nous pondérons alors par la méthode TF-IDF les comportements les plus singuliers de chaque famille de malware obtenue précédemment. Enfin, nous agrégeons les familles de malware ayant des comportements similaires par la méthode LSA.De plus, nous détaillons une méthode pour détecter des malware à partir du même type de comportements trouvés précédemment. Comme cette détection est réalisée en environnement réel, nous avons développé des sondes capables de générer des traces de comportements de programmes en exécution de façon continue. A partir de ces traces obtenues, nous construisons un graphe qui représente l’arbre des programmes en exécution avec leurs comportements. Ce graphe est mis à jour de manière incrémentale du fait de la génération de nouvelles traces. Pour mesurer la dangerosité des programmes, nous exécutons l’algorithme PageRank thématique sur ce graphe dès que celui-ci est mis à jour. L’algorithme donne un classement de dangerosité des processus en fonction de leurs comportements suspicieux. Ces scores sont ensuite reportés sur une série temporelle pour visualiser l’évolution de ce score de dangerosité pour chaque programme. Pour finir, nous avons développé plusieurs indicateurs d’alertes de programmes dangereux en exécution sur le système. / To cope with the large volume of malware, researchers have developed automatic dynamic tools for the analysis of malware like the Cuckoo sandbox. This analysis is partially automatic because it requires the intervention of a human expert in security to detect and extract suspicious behaviour. In order to avoid this tedious work, we propose a methodology to automatically extract dangerous behaviors. First of all, we generate activity reports from malware from the sandbox Cuckoo. Then, we group malware that are part of the same family using the Avclass algorithm. We then weight the the most singular behaviors of each malware family obtained previously. Finally, we aggregate malware families with similar behaviors by the LSA method.In addition, we detail a method to detect malware from the same type of behaviors found previously. Since this detection isperformed in real environment, we have developed probes capable of generating traces of program behaviours in continuous execution. From these traces obtained, we let’s build a graph that represents the tree of programs in execution with their behaviors. This graph is updated incrementally because the generation of new traces. To measure the dangerousness of programs, we execute the personalized PageRank algorithm on this graph as soon as it is updated. The algorithm gives a dangerousness ranking processes according to their suspicious behaviour. These scores are then reported on a time series to visualize the evolution of this dangerousness score for each program. Finally, we have developed several alert indicators of dangerous programs in execution on the system.
|
6 |
A Framework for Evaluating Recommender SystemsBean, Michael Gabriel 01 December 2016 (has links)
Prior research on text collections of religious documents has demonstrated that viable recommender systems in the area are lacking, if not non-existent, for some datasets. For example, both www.LDS.org and scriptures.byu.edu are websites designed for religious use. Although they provide users with the ability to search for documents based on keywords, they do not provide the ability to discover documents based on similarity. Consequently, these systems would greatly benefit from a recommender system. This work provides a framework for evaluating recommender systems and is flexible enough for use with either website. Such a framework would identify the best recommender system that provides users another way to explore and discover documents related to their current interests, given a starting document. The framework created for this thesis, RelRec, is attractive because it compares two different recommender systems. Documents are considered relevant if they are among the nearest neighbors, where "nearest" is defined by a particular system's similarity formula. We use RelRec to compare output of two particular recommender systems on our selected data collection. RelRec shows that LDA recommeder outperforms the TF-IDF recommender in terms of coverage, making it preferable for LDS-based document collections.
|
7 |
Semantic Text Matching Using Convolutional Neural NetworksWang, Run Fen January 2018 (has links)
Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters.
|
8 |
Propojení sociální sítě Twitter s televizním vysíláním / Twitter Connection with a TV BroadcastFiala, Marek January 2018 (has links)
This master thesis focuses on the possible connection between the digital television broadcasting DVB and the Twitter social network. The target platform is the Hybrid Broadcast Broadband TV platform, which combines television broadcasting with a data received from broadband. The created system is composed of a HbbTV application and a server, that connects the application with the Twitter and searchs for additional data about the current video content. Usage of the resulting solution in real television broadcasting could potentially increase the amound of tweets related to the television broadcasting, increase the knowledge about HbbTV technology and attrack young generation of viewers. All that can result into slight increased number of viewers.
|
9 |
Zlepšení předpovědi sociálních značek využitím Data Mining / Improved Prediction of Social Tags Using Data MiningHarár, Pavol January 2015 (has links)
This master’s thesis deals with using Text mining as a method to predict tags of articles. It describes the iterative way of handling big data files, parsing the data, cleaning the data and scoring of terms in article using TF-IDF. It describes in detail the flow of program written in programming language Python 3.4.3. The result of processing more than 1 million articles from Wikipedia database is a dictionary of English terms. By using this dictionary one is capable of determining the most important terms from article in corpus of articles. Relevancy of consequent tags proves the method used in this case.
|
10 |
Pokročilý porovnávač produktovPrexta, Dávid January 2019 (has links)
This thesis deals with the problem of mining structured information concerning the features of the products from the open text, using open information extraction. These features will make it easier for customers to choose their product. In the beginning, it deals with existing solutions, their shortcomings and analysis of available systems for open information extraction. Furthermore, the theoretical background and technology used in the creation of the system, the design of the system itself and its implementation are discussed. At the end, the system testing, its results and extensions that could be implemented in the future are described.
|
Page generated in 0.0283 seconds