Global ETD Search

101	Feature selection and clustering for malicious and benign software characterization Chhabra, Dalbir Kaur R 13 August 2014 (has links) Malware or malicious code is design to gather sensitive information without knowledge or permission of the users or damage files in the computer system. As the use of computer systems and Internet is increasing, the threat of malware is also growing. Moreover, the increase in data is raising difficulties to identify if the executables are malicious or benign. Hence, we have devised a method that collects features from portable executable file format using static malware analysis technique. We have also optimized the important or useful features by either normalizing or giving weightage to the feature. Furthermore, we have compared accuracy of various unsupervised learning algorithms for clustering huge dataset of samples. So once the clusters are created we can use antivirus (AV) to identify one or two file and if they are detected by AV then all the files in cluster are malicious even if the files contain novel or unknown malware; otherwise all are benign. Static malware analysis Portable Executable unsupervised learning algorithm malicious or benign samples feature selection clustering Information Security
102	Resolução de correferência em múltiplos documentos utilizando aprendizado não supervisionado / Co-reference resolution in multiples documents through unsupervised learning Silva, Jefferson Fontinele da 05 May 2011 (has links) Um dos problemas encontrados em sistemas de Processamento de Línguas Naturais (PLN) é a dificuldade de se identificar que elementos textuais referem-se à mesma entidade. Esse fenômeno, no qual o conjunto de elementos textuais remete a uma mesma entidade, é denominado de correferência. Sistemas de resolução de correferência podem melhorar o desempenho de diversas aplicações do PLN, como: sumarização, extração de informação, sistemas de perguntas e respostas. Recentemente, pesquisas em PLN têm explorado a possibilidade de identificar os elementos correferentes em múltiplos documentos. Neste contexto, este trabalho tem como foco o desenvolvimento de um método aprendizado não supervisionado para resolução de correferência em múltiplos documentos, utilizando como língua-alvo o português. Não se conhece, até o momento, nenhum sistema com essa finalidade para o português. Os resultados dos experimentos feitos com o sistema sugerem que o método desenvolvido é superior a métodos baseados em concordância de cadeias de caracteres / One of the problems found in Natural Language Processing (NLP) systems is the difficulty of identifying textual elements that refer to the same entity. This phenomenon, in which the set of textual elements refers to a single entity, is called coreference. Coreference resolution systems can improve the performance of various NLP applications, such as automatic summarization, information extraction systems, question answering systems. Recently, research in NLP has explored the possibility of identifying the coreferent elements in multiple documents. In this context, this work focuses on the development of an unsupervised method for coreference resolution in multiple documents, using Portuguese as the target language. Until now, it is not known any system for this purpose for the Portuguese. The results of the experiments with the system suggest that the developed method is superior to methods based on string matching Aprendizado não supervisionado Coreference Correferência Multiple documents Múltiplos documentos Natural language processing Processamento de línguas naturais Unsupervised learning
103	Reducing Wide-Area Satellite Data to Concise Sets for More Efficient Training and Testing of Land-Cover Classifiers Tommy Y. Chang (5929568) 10 June 2019 (has links) Obtaining an accurate estimate of a land-cover classifier's performance over a wide geographic area is a challenging problem due to the need to generate the ground truth that covers the entire area that may be thousands of square kilometers in size. The current best approach constructs a testing dataset by drawing samples randomly from the entire area --- with a human supplying the true label for each such sample --- with the hope that the selections thus made statistically capture all of the data diversity in the area. A major shortcoming of this approach is that it is difficult for a human to ensure that the information provided by the next data element chosen by the random sampler is non-redundant with respect to the data already collected. In order to reduce the annotation burden, it makes sense to remove any redundancies from the entire dataset before presenting its samples to a human for annotation. This dissertation presents a framework that uses a combination of clustering and compression to create a concise-set representation of the land-cover data for a large geographic area. Whereas clustering is achieved by applying Locality Sensitive Hashing (LSH) to the data elements, compression is achieved through choosing a single data element to represent a given cluster. This framework reduces the annotation burden on the human and makes it more likely that the human would persevere during the annotation stage. We validate our framework experimentally by comparing it with the traditional random sampling approach using WorldView2 satellite imagery. Computer Engineering satellite data sets big database mining Human factors engineering Unsupervised learning image analysis techniques landcover map groundtruthing
104	A comparative study of social bot classification techniques Örnbratt, Filip, Isaksson, Jonathan, Willing, Mario January 2019 (has links) With social media rising in popularity over the recent years, new so called social bots are infiltrating by spamming and manipulating people all over the world. Many different methods have been presented to solve this problem with varying success. This study aims to compare some of these methods, on a dataset of Twitter account metadata, to provide helpful information to companies when deciding how to solve this problem. Two machine learning algorithms and a human survey will be compared on the ability to classify accounts. The algorithms used are the supervised algorithm random forest and the unsupervised algorithm k-means. There will also be an evaluation of two ways to run these algorithms, using the machine learning as a service BigML and the python library Scikit-learn. Additionally, what metadata features are most valuable in the supervised and human survey will be compared. Results show that supervised machine learning is the superior technique for social bot identification with an accuracy of almost 99%. To conclude, it depends on the expertise of the company and if a relevant training dataset is available but in most cases supervised machine learning is recommended. manual bot classification social bot metadata machine learning supervised learning unsupervised learning random forest k-means Computer Sciences Datavetenskap (datalogi)
105	Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos / Evaluation of unsupervised feature selection methods for Text Mining Nogueira, Bruno Magalhães 27 March 2009 (has links) Selecionar atributos é, por vezes, uma atividade necessária para o correto desenvolvimento de tarefas de aprendizado de máquina. Em Mineração de Textos, reduzir o número de atributos em uma base de textos é essencial para a eficácia do processo e a compreensibilidade do conhecimento extraído, uma vez que se lida com espaços de alta dimensionalidade e esparsos. Quando se lida com contextos nos quais a coleção de textos é não-rotulada, métodos não-supervisionados de redução de atributos são utilizados. No entanto, não existe forma geral predefinida para a obtenção de medidas de utilidade de atributos em métodos não-supervisionados, demandando um esforço maior em sua realização. Assim, este trabalho aborda a seleção não-supervisionada de atributos por meio de um estudo exploratório de métodos dessa natureza, comparando a eficácia de cada um deles na redução do número de atributos em aplicações de Mineração de Textos. Dez métodos são comparados - Ranking porTerm Frequency, Ranking por Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Método de Luhn, Método LuhnDF, Método de Salton e Zone-Scored Term Frequency - sendo dois deles aqui propostos - Método LuhnDF e Zone-Scored Term Frequency. A avaliação se dá em dois focos, supervisionado, pelo medida de acurácia de quatro classificadores (C4.5, SVM, KNN e Naïve Bayes), e não-supervisionado, por meio da medida estatística de Expected Mutual Information Measure. Aos resultados de avaliação, aplica-se o teste estatístico de Kruskal-Wallis para determinação de significância estatística na diferença de desempenho dos diferentes métodos de seleção de atributos comparados. Seis bases de textos são utilizadas nas avaliações experimentais, cada uma relativa a um grande domínio e contendo subdomínios, os quais correspondiam às classes usadas para avaliação supervisionada. Com esse estudo, este trabalho visa contribuir com uma aplicação de Mineração de Textos que visa extrair taxonomias de tópicos a partir de bases textuais não-rotuladas, selecionando os atributos mais representativos em uma coleção de textos. Os resultados das avaliações mostram que não há diferença estatística significativa entre os métodos não-supervisionados de seleção de atributos comparados. Além disso, comparações desses métodos não-supervisionados com outros supervisionados (Razão de Ganho e Ganho de Informação) apontam que é possível utilizar os métodos não-supervisionados em atividades supervisionadas de Mineração de Textos, obtendo eficiência compatível com os métodos supervisionados, dado que não detectou-se diferença estatística nessas comparações, e com um custo computacional menor / Feature selection is an activity sometimes necessary to obtain good results in machine learning tasks. In Text Mining, reducing the number of features in a text base is essential for the effectiveness of the process and the comprehensibility of the extracted knowledge, since it deals with high dimensionalities and sparse contexts. When dealing with contexts in which the text collection is not labeled, unsupervised methods for feature reduction have to be used. However, there aren\'t any general predefined feature quality measures for unsupervised methods, therefore demanding a higher effort for its execution. So, this work broaches the unsupervised feature selection through an exploratory study of methods of this kind, comparing their efficacies in the reduction of the number of features in the Text Mining process. Ten methods are compared - Ranking by Term Frequency, Ranking by Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Luhn\'s Method, LuhnDF Method, Salton\'s Method and Zone-Scored Term Frequency - and two of them are proposed in this work - LuhnDF Method and Zone-Scored Term Frequency. The evaluation process is done in two ways, supervised, through the accuracy measure of four classifiers (C4.5, SVM, KNN and Naïve Bayes), and unsupervised, using the Expected Mutual Information Measure. The evaluation results are submitted to the statistical test of Kruskal-Wallis in order to determine the statistical significance of the performance difference of the different feature selection methods. Six text bases are used in the experimental evaluation, each one related to one domain and containing sub domains, which correspond to the classes used for supervised evaluation. Through this study, this work aims to contribute with a Text Mining application that extracts topic taxonomies from unlabeled text collections, through the selection of the most representative features in a text collection. The evaluation results show that there is no statistical difference between the unsupervised feature selection methods compared. Moreover, comparisons of these unsupervised methods with other supervised ones (Gain Ratio and Information Gain) show that it is possible to use unsupervised methods in supervised Text Mining activities, obtaining an efficiency compatible with supervised methods, since there isn\'t any statistical difference the statistical test detected in these comparisons, and with a lower computational effort Aprendizado de máquina Aprendizado não-supervisionado Feature selection Machine learning Mineração de textos Seleção de atributos Text mining Unsupervised learning
106	Image-based Process Monitoring via Generative Adversarial Autoencoder with Applications to Rolling Defect Detection January 2019 (has links) abstract: Image-based process monitoring has recently attracted increasing attention due to the advancement of the sensing technologies. However, existing process monitoring methods fail to fully utilize the spatial information of images due to their complex characteristics including the high dimensionality and complex spatial structures. Recent advancement of the unsupervised deep models such as a generative adversarial network (GAN) and generative adversarial autoencoder (AAE) has enabled to learn the complex spatial structures automatically. Inspired by this advancement, we propose an anomaly detection framework based on the AAE for unsupervised anomaly detection for images. AAE combines the power of GAN with the variational autoencoder, which serves as a nonlinear dimension reduction technique with regularization from the discriminator. Based on this, we propose a monitoring statistic efficiently capturing the change of the image data. The performance of the proposed AAE-based anomaly detection algorithm is validated through a simulation study and real case study for rolling defect detection. / Dissertation/Thesis / Masters Thesis Industrial Engineering 2019 Industrial engineering Information technology Computer science adversarial autoencoder anomaly detection generative adversarial networks machine learning statistic unsupervised learning
107	Novelty Detection Of Machinery Using A Non-Parametric Machine Learning Approach Angola, Enrique 01 January 2018 (has links) A novelty detection algorithm inspired by human audio pattern recognition is conceptualized and experimentally tested. This anomaly detection technique can be used to monitor the health of a machine or could also be coupled with a current state of the art system to enhance its fault detection capabilities. Time-domain data obtained from a microphone is processed by applying a short-time FFT, which returns time-frequency patterns. Such patterns are fed to a machine learning algorithm, which is designed to detect novel signals and identify windows in the frequency domain where such novelties occur. The algorithm presented in this paper uses one-dimensional kernel density estimation for different frequency bins. This process eliminates the need for data dimension reduction algorithms. The method of "pseudo-likelihood cross validation" is used to find an independent optimal kernel bandwidth for each frequency bin. Metrics such as the "Individual Node Relative Difference" and "Total Novelty Score" are presented in this work, and used to assess the degree of novelty of a new signal. Experimental datasets containing synthetic and real novelties are used to illustrate and test the novelty detection algorithm. Novelties are successfully detected in all experiments. The presented novelty detection technique could greatly enhance the performance of current state-of-the art condition monitoring systems, or could also be used as a stand-alone system. Audio Processing Condition Monitoring Machine Learning Novelty Detection Statistical Learning Unsupervised Learning Computer Sciences Mathematics Mechanical Engineering
108	Bayesian non-parametric parsimonious mixtures for model-based clustering / Modèles de mélanges Bayésiens non-paramétriques parcimonieux pour la classification automatique Bartcus, Marius 26 October 2015 (has links) Cette thèse porte sur l’apprentissage statistique et l’analyse de données multi-dimensionnelles. Elle se focalise particulièrement sur l’apprentissage non supervisé de modèles génératifs pour la classiﬁcation automatique. Nous étudions les modèles de mélanges Gaussians, aussi bien dans le contexte d’estimation par maximum de vraisemblance via l’algorithme EM, que dans le contexte Bayésien d’estimation par Maximum A Posteriori via des techniques d’échantillonnage par Monte Carlo. Nous considérons principalement les modèles de mélange parcimonieux qui reposent sur une décomposition spectrale de la matrice de covariance et qui oﬀre un cadre ﬂexible notamment pour les problèmes de classiﬁcation en grande dimension. Ensuite, nous investiguons les mélanges Bayésiens non-paramétriques qui se basent sur des processus généraux ﬂexibles comme le processus de Dirichlet et le Processus du Restaurant Chinois. Cette formulation non-paramétrique des modèles est pertinente aussi bien pour l’apprentissage du modèle, que pour la question diﬃcile du choix de modèle. Nous proposons de nouveaux modèles de mélanges Bayésiens non-paramétriques parcimonieux et dérivons une technique d’échantillonnage par Monte Carlo dans laquelle le modèle de mélange et son nombre de composantes sont appris simultanément à partir des données. La sélection de la structure du modèle est eﬀectuée en utilisant le facteur de Bayes. Ces modèles, par leur formulation non-paramétrique et parcimonieuse, sont utiles pour les problèmes d’analyse de masses de données lorsque le nombre de classe est indéterminé et augmente avec les données, et lorsque la dimension est grande. Les modèles proposés validés sur des données simulées et des jeux de données réelles standard. Ensuite, ils sont appliqués sur un problème réel diﬃcile de structuration automatique de données bioacoustiques complexes issues de signaux de chant de baleine. Enﬁn, nous ouvrons des perspectives Markoviennes via les processus de Dirichlet hiérarchiques pour les modèles Markov cachés. / This thesis focuses on statistical learning and multi-dimensional data analysis. It particularly focuses on unsupervised learning of generative models for model-based clustering. We study the Gaussians mixture models, in the context of maximum likelihood estimation via the EM algorithm, as well as in the Bayesian estimation context by maximum a posteriori via Markov Chain Monte Carlo (MCMC) sampling techniques. We mainly consider the parsimonious mixture models which are based on a spectral decomposition of the covariance matrix and provide a ﬂexible framework particularly for the analysis of high-dimensional data. Then, we investigate non-parametric Bayesian mixtures which are based on general ﬂexible processes such as the Dirichlet process and the Chinese Restaurant Process. This non-parametric model formulation is relevant for both learning the model, as well for dealing with the issue of model selection. We propose new Bayesian non-parametric parsimonious mixtures and derive a MCMC sampling technique where the mixture model and the number of mixture components are simultaneously learned from the data. The selection of the model structure is performed by using Bayes Factors. These models, by their non-parametric and sparse formulation, are useful for the analysis of large data sets when the number of classes is undetermined and increases with the data, and when the dimension is high. The models are validated on simulated data and standard real data sets. Then, they are applied to a real diﬃcult problem of automatic structuring of complex bioacoustic data issued from whale song signals. Finally, we open Markovian perspectives via hierarchical Dirichlet processes hidden Markov models. Apprentissage non-supervisé Modèles de mélange Mélanges parcimonieux Unsupervised learning Mixture models Parsimonious mixtures Bayesian non-parametric learning
109	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data
110	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data

Search results