Global ETD Search

491	Dimensionality reduction and representation for nearest neighbour learning Payne, Terry R. January 1999 (has links) An increasing number of intelligent information agents employ Nearest Neighbour learning algorithms to provide personalised assistance to the user. This assistance may be in the form of recognising or locating documents that the user might find relevant or interesting. To achieve this, documents must be mapped into a representation that can be presented to the learning algorithm. Simple heuristic techniques are generally used to identify relevant terms from the documents. These terms are then used to construct large, sparse training vectors. The work presented here investigates an alternative representation based on sets of terms, called set-valued attributes, and proposes a new family of Nearest Neighbour learning algorithms that utilise this set-based representation. The importance of discarding irrelevant terms from the documents is then addressed, and this is generalised to examine the behaviour of the Nearest Neighbour learning algorithm with high dimensional data sets containing such values. A variety of selection techniques used by other machine learning and information retrieval systems are presented, and empirically evaluated within the context of a Nearest Neighbour framework. The thesis concludes with a discussion of ways in which attribute selection and dimensionality reduction techniques may be used to improve the selection of relevant attributes, and thus increase the reliability and predictive accuracy of the Nearest Neighbour learning algorithm. 006.3
492	Automated Attacks on Compression-Based Classifiers Burago, Igor 29 September 2014 (has links) Methods of compression-based text classification have proven their usefulness for various applications. However, in some classification problems, such as spam filtering, a classifier confronts one or many adversaries willing to induce errors in the classifier's judgment on certain kinds of input. In this thesis, we consider the problem of finding thrifty strategies for character-based text modification that allow an adversary to revert classifier's verdict on a given family of input texts. We propose three statistical statements of the problem that can be used by an attacker to obtain transformation models which are optimal in some sense. Evaluating these three techniques on a realistic spam corpus, we find that an adversary can transform a spam message (detectable as such by an entropy-based text classifier) into a legitimate one by generating and appending, in some cases, as few additional characters as 20% of the original length of the message. Adversarial machine learning Compression-based classification
493	Enhanced root extraction and document classification algorithm for Arabic text Alsaad, Amal January 2016 (has links) Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate. 006.3
494	Recognition and Classification of Aggressive Motion Using Smartwatches Franck, Tchuente 10 September 2018 (has links) Aggressive motion can occur in clinical and elderly care settings with people suffering from dementia, mental disorders, or other conditions that affect memory. Since identifying the nature of the event can be difficult with people who have memory and communication issues, other methods to identify and record aggressive motion would be useful for care providers to reduce re-occurrences of this activity. A wearable technology approach for human activity recognition was explored in this thesis to detect aggressive movements. This approach aims to provide a means to identify the person that initiated aggressive motion and to categorize the aggressive action. The main objective of this thesis was to determine the effectiveness of smartwatch accelerometer and gyroscope sensor data for classifying aggressive and non-aggressive activities. 30 able-bodied participants donned two Microsoft Bands 2 smartwatches and performed an activity circuit of similar aggressive and non-aggressive movements. Statistical and physical features were extracted from the smartwatch sensors signals, and subsequently used by multiple classifiers to determine on a machine learning platform six performance metrics (accuracy, sensitivity, specificity, precision, F-score, Matthews correlation coefficient). This thesis demonstrated: 1) the best features for a binary classification; 2) the best and most practical machine learning classifier and feature selector model; 3) the evaluation metrics differences between unilateral smartwatch and bilateral smartwatches; 4) the most suitable machine learning algorithm for a multinomial classification. Aggressive movements Smartwatches Sensors Machine learning Classification
495	Semi-supervised learning for biological sequence classification Stanescu, Ana January 1900 (has links) Doctor of Philosophy / Department of Computing and Information Sciences / Doina Caragea / Successful advances in biochemical technologies have led to inexpensive, time-efficient production of massive volumes of data, DNA and protein sequences. As a result, numerous computational methods for genome annotation have emerged, including machine learning and statistical analysis approaches that practically and efficiently analyze and interpret data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data in order to build quality classifiers. The process of labeling data can be expensive and time consuming, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on semi-supervised learning approaches for biological sequence classification. Although an attractive concept, semi-supervised learning does not invariably work as intended. Since the assumptions made by learning algorithms cannot be easily verified without considerable domain knowledge or data exploration, semi-supervised learning is not always "safe" to use. Advantageous utilization of the unlabeled data is problem dependent, and more research is needed to identify algorithms that can be used to increase the effectiveness of semi-supervised learning, in general, and for bioinformatics problems, in particular. At a high level, we aim to identify semi-supervised algorithms and data representations that can be used to learn effective classifiers for genome annotation tasks such as cassette exon identification, splice site identification, and protein localization. In addition, one specific challenge that we address is the "data imbalance" problem, which is prevalent in many domains, including bioinformatics. The data imbalance phenomenon arises when one of the classes to be predicted is underrepresented in the data because instances belonging to that class are rare (noteworthy cases) or difficult to obtain. Ironically, minority classes are typically the most important to learn, because they may be associated with special cases, as in the case of splice site prediction. We propose two main techniques to deal with the data imbalance problem, namely a technique based on "dynamic balancing" (augmenting the originally labeled data only with positive instances during the semi-supervised iterations of the algorithms) and another technique based on ensemble approaches. The results show that with limited amounts of labeled data, semisupervised approaches can successfully leverage the unlabeled data, thereby surpassing their completely supervised counterparts. A type of semi-supervised learning, known as "transductive" learning aims to classify the unlabeled data without generalizing to new, previously not encountered instances. Theoretically, this aspect makes transductive learning particularly suitable for the task of genome annotation, in which an entirely sequenced genome is typically available, sometimes accompanied by limited annotation. We study and evaluate various transductive approaches (such as transductive support vector machines and graph based approaches) and sequence representations for the problems of cassette exon identification. The results obtained demonstrate the effectiveness of transductive algorithms in sequence annotation tasks. Computer science Artificial intelligence Bioformatics Machine learning
496	Algorithm and Hardware Co-design for Learning On-a-chip January 2017 (has links) abstract: Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm. For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability. Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance. From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2017 Electrical engineering algorithm hardware machine learning
497	MLID : A multilabelextension of the ID3 algorithm Starefors, Henrik, Persson, Rasmus January 2016 (has links) AbstractMachine learning is a subfield within artificial intelligence that revolves around constructingalgorithms that can learn from, and make predictions on data. Instead of following strict andstatic instruction, the system operates by adapting and learning from input data in order tomake predictions and decisions. This work will focus on a subcategory of machine learningcalled “MultilabelClassification”, which is the concept of where items introduced to thesystem is categorized by an analytical model, learned through supervised learning, whereeach instance of the dataset can belong to multiple labels, or classes.This paper presents the task of implementing a multilabelclassifier based on the ID3algorithm, which we call MLID (MultilabelIterative Dichotomiser). The solution is presentedboth in a sequentially executed version as well as an parallelized one.We also presents acomparison based on accuracy and execution time, that is performed against algorithms of asimilar nature in order to evaluate the viability of using ID3 as a base to further expand andbuild upon in regards of multi label classification.In order to evaluate the performance of the MLID algorithm, we have measured theexecution time, accuracy, and made a summarization of precision and recall into what iscalled Fmeasure,which is the harmonic mean of both precision and sensitivity of thealgorithm. These results are then compared to already defined and established algorithms,on a range of datasets of varying sizes, in order to assess the viability of the MLID algorithm.The results produced when comparing MLID against other multilabelalgorithms such asBinary relevance, Classifier Chains and Random Trees shows that MLID can compete withother classifiers in term of accuracy and Fmeasure,but in terms of training the algorithm,the time required is proven inferior. Through these results, we can conclude that MLID is aviable option to use as a multilabelclassifier. Although, some constraints inherited from theoriginal ID3 algorithm does impede the full utility of the algorithm, we are certain thatfollowing the same path of development and improvement as ID3 experienced would allowMLID to develop towards a suitable choice of algorithm for a diverse range of multilabelclassification problems. ID3 Multilabel Machine learning Software Engineering Programvaruteknik
498	Traffic identification in IP networks de Castro Callado, Arthur 31 January 2009 (has links) Made available in DSpace on 2014-06-12T15:49:18Z (GMT). No. of bitstreams: 1 license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2009 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / A análise e identificação de tráfego em redes IP ainda é algo muito dependente da interação e expertise humana. A compreensão da composição e dinâmica do tráfego Internet são essenciais para o gerenciamento de redes IP, especialmente para planejamento de capacidade, engenharia de tráfego, diagnóstico de falhas, detecção de anomalias e caracterização do desempenho de serviços. A grande mudança nas aplicações predominantes nos últimos anos, de Web para compartilhamento de arquivos Peer-to-Peer e atualmente de Peer-to-Peer para streaming de vídeo requer uma atenção especial dos administradores de redes, mas não foi completamente prevista por ferramentas de gerência. Ainda hoje, na prática, operadores de rede somente detectam streaming de vídeo baseado no endereço IP de servidores de streaming de vídeo conhecidos. Mas novas aplicações, como Joost, Babelgum and TVU, estão oferecendo um tipo de serviço de streaming de vídeo peer-to-peer em que não é factível fazer a identificação por endereço IP. Algumas redes bloqueiam o acesso a aplicações baseado no endereço IP ou no número de portas bem conhecidas, dois métodos já considerados inviáveis para a identificação de aplicação. Isto é um incentivo a uma briga de gato e rato entre os desenvolvedores de tais aplicações tentando criar aplicações que trocam tráfego mesmo em redes hostis utilizando-se de técnicas de evasão e redes que consideram as algumas aplicações prejudiciais ao negócio ou objetivos e tentam bloqueá-las. Dessa forma, a identificação das aplicações que compõem o tráfego independentemente de configuração de rede é valiosa para operadores de rede. Ela permite uma predição mais efetiva das demandas de tráfego de usuário; a oferta de serviços de valor agregado baseada na demanda dos clientes por outros serviços; a cobrança baseada em aplicação; e no caso de identificação online, também permite Qualidade de Serviço (QoS) baseada na aplicação, formatação de tráfego (shaping) e filtragem de tráfego (firewall). Nos últimos anos, algumas técnicas baseadas em inferência foram propostas como alternativas de identificação de tráfego não-baseadas em portas conhecidas. Entretanto, nenhuma se mostrou adequada a alcançar alta eficiência na identificação de vários tipos de aplicação ao mesmo tempo, usando tráfego real. Portanto, a combinação de técnicas parece ser uma abordagem razoável para lidar com as deficiências de cada técnica e a periódica reconfiguração dos parâmetros de combinação pode mostrar-se uma idéia interessante paralidar com a evolução natural das aplicações e as técnicas de evasão usadas pelas aplicações que geram grande volume de tráfego indesejado. Este trabalho provê um entendimento profundo das deficiências comuns em identificação de tráfego e traz algumas contribuições práticas à área. Após um cuidadoso estudo de desempenho dos principais algoritmos de identificação de tráfego em quatro redes diferentes, esta tese lista várias recomendações para a utilização de algoritmos de identificação de tráfego. Para atingir este objetivo, alguns pré-requisitos para a criação de um ambiente adequado de identificação de tráfego são detalhados. Além disso, são propostos métodos originais para melhorar o desempenho dos algoritmos de identificação de tráfego através da combinação de resultados, sem restrições sobre o tipo de algoritmos de identificação que podem ser usados. Tais métodos são avaliados em um estudo de caso realizado com a utilização dos mesmos cenários de rede Computer networks traffic identification machine learning
499	Low false positive learning with support vector machines = Máquina de vetores de suporte com restrição de falsos positivos / Máquina de vetores de suporte com restrição de falsos positivos Moraes, Daniel Bastos, 1987- 24 August 2018 (has links) Orientadores: Anderson de Rezende Rocha, Jacques Wainer / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-24T22:07:46Z (GMT). No. of bitstreams: 1 Moraes_DanielBastos_M.pdf: 2454286 bytes, checksum: e130cab62fb4ac89706094d28e14ebb8 (MD5) Previous issue date: 2014 / Resumo: A maioria dos sistemas de aprendizado de máquina para classificação binaria é treinado usando algoritmos que maximizam a acurácia e assume que falsos positivos e falsos negativos sao igualmente ruins. Entretanto, em muitas aplicações, estes dois tipos de erro podem ter custos bem diferentes. Por exemplo, em aplicações de triagem médica, determinar erroneamente que um paciente é saudavel e muito mais sério que determinar erroneamente que ele tem uma certa condição médica. Neste trabalho, nós abordamos o problema de controlar a taxa de falsos positivos em Máquinas de Vetores de Suporte (SVMs), uma vez que sua formulação tradicional não provê garantias desse tipo. Para resolver esse problema, definimos uma area sensível no espaço de características onde a probabilidade de falsos positivos é mais alta e usamos um segundo classificador (k-vizinhos mais próximos) nesta área para melhor filtrar os erros e melhorar o processo de tomada de decisão. Nós comparamos a solução proposta com outros métodos do estado da arte para classificação com baixa taxa de falsos positivos usando 33 conjuntos de dados comuns na literatura. A solução proposta mostra melhor performance na grande maioria dos casos usando a métrica padrão de Neyman-Pearson / Abstract: Most machine learning systems for binary classification are trained using algorithms that maximize the accuracy and assume that false positives and false negatives are equally bad. However, in many applications, these two types of errors may have very different costs. For instance, in medical screening applications, falsely determining that a patient is healthy is much more serious than falsely determining that she has a certain medical condition. In this work, we consider the problem of controlling the false positive rate on Support Vector Machines, since its traditional formulation does not offer such assurance. To solve this problem, we define a feature space sensitive area, where the probability of having false positives is higher, and use a second classifier (k-Nearest Neighbors) in this area to better filter errors and improve the decision-making process. We compare the proposed solution to other state-of-the-art methods for low false positive classification using 33 standard datasets in the literature. The solution we propose shows better performance in the vast majority of the cases using the standard Neyman-Pearson measure / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Aprendizado de máquina Algoritmos Machine learning Algorithms
500	Software based fingerprint liveness detection = Detecção de vivacidade de impressões digitais baseada em software / Detecção de vivacidade de impressões digitais baseada em software Nogueira, Rodrigo Frassetto, 1986- 26 August 2018 (has links) Orientador: Roberto de Alencar Lotufo / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-26T03:01:45Z (GMT). No. of bitstreams: 1 Nogueira_RodrigoFrassetto_M.pdf: 3122263 bytes, checksum: e6333eb55b8b4830e318721882159cd1 (MD5) Previous issue date: 2014 / Resumo: Com o uso crescente de sistemas de autenticação por biometria nos últimos anos, a detecção de impressões digitais falsas tem se tornado cada vez mais importante. Neste trabalho, nós implementamos e comparamos várias técnicas baseadas em software para detecção de vivacidade de impressões digitais. Utilizamos como extratores de características as redes convolucionais, que foram usadas pela primeira vez nesta área, e Local Binary Patterns (LBP). As técnicas foram usadas em conjunto com redução de dimensionalidade através da Análise de Componentes Principais (PCA) e um classificador Support Vector Machine (SVM). O aumento artificial de dados foi usado de forma bem sucedida para melhorar o desempenho do classificador. Testamos uma variedade de operações de pré-processamento, tais como filtragem em frequência, equalização de contraste e filtragem da região de interesse. Graças aos computadores de alto desempenho disponíveis como serviços em nuvem, foi possível realizar uma busca extensa e automática para encontrar a melhor combinação de operações de pré-processamento, arquiteturas e hiper-parâmetros. Os experimentos foram realizados nos conjuntos de dados usados nas competições Liveness Detection nos anos de 2009, 2011 e 2013, que juntos somam quase 50.000 imagens de impressões digitais falsas e verdadeiras. Nosso melhor método atinge uma taxa média de amostras classificadas corretamente de 95,2%, o que representa uma melhora de 59% na taxa de erro quando comparado com os melhores resultados publicados anteriormente / Abstract: With the growing use of biometric authentication systems in the past years, spoof fingerprint detection has become increasingly important. In this work, we implemented and compared various techniques for software-based fingerprint liveness detection. We use as feature extractors Convolutional Networks with random weights, which are applied for the first time for this task, and Local Binary Patterns. The techniques were used in conjunction with dimensionality reduction through Principal Component Analysis (PCA) and a Support Vector Machine (SVM) classifier. Dataset Augmentation was successfully used to increase classifier¿s performance. We tested a variety of preprocessing operations such as frequency filtering, contrast equalization, and region of interest filtering. An automatic and extensive search for the best combination of preprocessing operations, architectures and hyper-parameters was made, thanks to the fast computers available as cloud services. The experiments were made on the datasets used in The Liveness Detection Competition of years 2009, 2011 and 2013 that comprise almost 50,000 real and fake fingerprints¿ images. Our best method achieves an overall rate of 95.2% of correctly classified samples - an improvement of 59% in test error when compared with the best previously published results / Mestrado / Energia Eletrica / Mestre em Engenharia Elétrica Datiloscopia Aprendizado de máquina Dactylography Machine learning

Search results