81 |
Ensemble classification and signal image processing for genus Gyrodactylus (Monogenea)Ali, Rozniza January 2014 (has links)
This thesis presents an investigation into Gyrodactylus species recognition, making use of machine learning classification and feature selection techniques, and explores image feature extraction to demonstrate proof of concept for an envisaged rapid, consistent and secure initial identification of pathogens by field workers and non-expert users. The design of the proposed cognitively inspired framework is able to provide confident discrimination recognition from its non-pathogenic congeners, which is sought in order to assist diagnostics during periods of a suspected outbreak. Accurate identification of pathogens is a key to their control in an aquaculture context and the monogenean worm genus Gyrodactylus provides an ideal test-bed for the selected techniques. In the proposed algorithm, the concept of classification using a single model is extended to include more than one model. In classifying multiple species of Gyrodactylus, experiments using 557 specimens of nine different species, two classifiers and three feature sets were performed. To combine these models, an ensemble based majority voting approach has been adopted. Experimental results with a database of Gyrodactylus species show the superior performance of the ensemble system. Comparison with single classification approaches indicates that the proposed framework produces a marked improvement in classification performance. The second contribution of this thesis is the exploration of image processing techniques. Active Shape Model (ASM) and Complex Network methods are applied to images of the attachment hooks of several species of Gyrodactylus to classify each species according to their true species type. ASM is used to provide landmark points to segment the contour of the image, while the Complex Network model is used to extract the information from the contour of an image. The current system aims to confidently classify species, which is notifiable pathogen of Atlantic salmon, to their true class with high degree of accuracy. Finally, some concluding remarks are made along with proposal for future work.
|
82 |
Discovering Compact and Informative Structures through Data PartitioningFiterau, Madalina 01 September 2015 (has links)
In many practical scenarios, prediction for high-dimensional observations can be accurately performed using only a fraction of the existing features. However, the set of relevant predictive features, known as the sparsity pattern, varies across data. For instance, features that are informative for a subset of observations might be useless for the rest. In fact, in such cases, the dataset can be seen as an aggregation of samples belonging to several low-dimensional sub-models, potentially due to different generative processes. My thesis introduces several techniques for identifying sparse predictive structures and the areas of the feature space where these structures are effective. This information allows the training of models which perform better than those obtained through traditional feature selection. We formalize Informative Projection Recovery, the problem of extracting a set of low-dimensional projections of data which jointly form an accurate solution to a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to a number of machine learning problems, offering solutions to classification, clustering and regression tasks. Experiments show that our method can discover and leverage low-dimensional structure, yielding accurate and compact models. Our method is particularly useful in applications involving multivariate numeric data in which expert assessment of the results is of the essence. Additionally, we developed an active learning framework which works with the obtained compact models in finding unlabeled data deemed to be worth expert evaluation. For this purpose, we enhance standard active selection criteria using the information encapsulated by the trained model. The advantage of our approach is that the labeling effort is expended mainly on samples which benefit models from the hypothesis class we are considering. Additionally, the domain experts benefit from the availability of informative axis aligned projections at the time of labeling. Experiments show that this results in an improved learning rate over standard selection criteria, both for synthetic data and real-world data from the clinical domain, while the comprehensible view of the data supports the labeling process and helps preempt labeling errors.
|
83 |
An exploration of BMSF algorithm in genome-wide association mappingJiang, Dayou January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / Motivation: Genome-wide association studies (GWAS) provide an important avenue for investigating many common genetic variants in different individuals to see if any variant is associated with a trait. GWAS is a great tool to identify genetic factors that influence health and disease. However, the high dimensionality of the gene expression dataset makes GWAS challenging. Although a lot of promising machine learning methods, such as Support Vector Machine (SVM), have been investigated in GWAS, the question of how to improve the accuracy of the result has drawn increased attention of many researchers A lot of the studies did not apply feature selection to select a parsimonious set of relevant genes. For those that performed gene selections, they often failed to consider the possible interactions among genes. Here we modify a gene selection algorithm BMSF originally developed by Zhang et al. (2012) for improving the accuracy of cancer classification with binary responses. A continuous response version of BMSF algorithm is provided in this report so that it can be applied to perform gene selection for continuous gene expression dataset. The algorithm dramatically reduces the dimension of the gene markers under concern, thus increases the efficiency and accuracy of GWAS.
Results: We applied the continuous response version of BMSF on the wheat phenotypes dataset to predict two quantitative traits based on the genotype marker data. This wheat dataset was previously studied in Long et al. (2009) for the same purpose but used only direct application of SVM regression methods. By applying our gene selection method, we filtered out a large portion of genes which are less relevant and achieved a better prediction result for the test data by building SVM regression model using only selected genes on the training data. We also applied our algorithm on simulated datasets which was generated following the setting of an example in Fan et al. (2011). The continuous response version of BMSF showed good ability to identify active variables hidden among high dimensional irrelevant variables. In comparison to the smoothing based methods in Fan et al. (2011), our method has the advantage of no ambiguity due to difference choices of the smoothing parameter.
|
84 |
Planetary navigation activity recognition using wearable accelerometer dataSong, Wen January 1900 (has links)
Master of Science / Department of Electrical & Computer Engineering / Steve Warren / Activity recognition can be an important part of human health awareness. Many benefits can be generated from the recognition results, including knowledge of activity intensity as it relates to wellness over time. Various activity-recognition techniques have been presented in the literature, though most address simple activity-data collection and off-line analysis. More sophisticated real-time identification is less often addressed. Therefore, it is promising to consider the combination of current off-line, activity-detection methods with wearable, embedded tools in order to create a real-time wireless human activity recognition system with improved accuracy.
Different from previous work on activity recognition, the goal of this effort is to focus on specific activities that an astronaut may encounter during a mission. Planetary navigation field test (PNFT) tasks are designed to meet this need. The approach used by the KSU team is to pre-record data on the ground in normal earth gravity and seek signal features that can be used to identify, and even predict, fatigue associated with these activities. The eventual goal is to then assess/predict the condition of an astronaut in a reduced-gravity environment using these predetermined rules.
Several classic machine learning algorithms, including the k-Nearest Neighbor, Naïve Bayes, C4.5 Decision Tree, and Support Vector Machine approaches, were applied to these data to identify recognition algorithms suitable for real-time application. Graphical user interfaces (GUIs) were designed for both MATLAB and LabVIEW environments to facilitate recording and data analysis. Training data for the machine learning algorithms were recorded while subjects performed each activity, and then these identification approaches were applied to new data sets with an identification accuracy of around 86%. Early results indicate that a single three-axis accelerometer is sufficient to identify the occurrence of a given PNFT activity.
A custom, embedded acceleration monitoring system employing ZigBee transmission is under development for future real-time activity recognition studies. A different GUI has been implemented for this system, which uses an on-line algorithm that will seek to identify activity at a refresh rate of 1 Hz.
|
85 |
A Content-Based Image Retrieval System for Fish TaxonomyTeng, Fei 22 May 2006 (has links)
It is estimated that less than ten percent of the world's species have been discovered and described. The main reason for the slow pace of new species description is that the science of taxonomy, as traditionally practiced, can be very laborious: taxonomists have to manually gather and analyze data from large numbers of specimens and identify the smallest subset of external body characters that uniquely diagnoses the new species as distinct from all its known relatives. The pace of data gathering and analysis can be greatly increased by the information technology. In this paper, we propose a content-based image retrieval system for taxonomic research. The system can identify representative body shape characters of known species based on digitized landmarks and provide statistical clues for assisting taxonomists to identify new species or subspecies. The experiments on a taxonomic problem involving species of suckers in the genera Carpiodes demonstrate promising results.
|
86 |
Distributed Feature Selection in Large n and Large p Regression ProblemsWang, Xiangyu January 2016 (has links)
<p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation
|
87 |
Predicting rifle shooting accuracy from context and sensor data : A study of how to perform data mining and knowledge discovery in the target shooting domain / Prediktering av skytteträffsäkerhet baserat på kontext och sensordata.Pettersson, Max, Jansson, Viktor January 2019 (has links)
The purpose of this thesis is to develop an interpretable model that gives predictions for what factors impacted a shooter’s results. Experiment is our chosen research method. Our three independent variables are weapon movement, trigger pull force and heart rate. Our dependent variable is shooting accuracy. A random forest regression model is trained with the experiment data to produce predictions of shooting accuracy and to show correlation between independent and dependent variables. Our method shows that an increase in weapon movement, trigger pull force and heart rate decrease the predicted accuracy score. Weapon movement impacted shooting results the most with 53.61%, while trigger pull force and heart rateimpacted shooting results 22.20% and 24.18% respectively. We have also shown that LIME can be a viable method to give explanations on how the measured factors impacted shooting results. The results from this thesis lay the groundwork for better training tools for target shooting using explainable prediction models with sensors.
|
88 |
Reconnaissance automatique des dimensions affectives dans l'interaction orale homme-machine pour des personnes dépendantes / Automatic Recognition of Affective Dimensions in the Oral Human-Machine Interaction for Dependent PeopleChastagnol, Clément 04 October 2013 (has links)
La majorité des systèmes de reconnaissance d'états affectifs est entrainée sur des données artificielles hors contexte applicatif et les évaluations sont effectuées sur des données pré-enregistrées de même qualité. Cette thèse porte sur les différents défis résultant de la confrontation de ces systèmes à des situations et des utilisateurs réels.Pour disposer de données émotionnelles spontanées au plus proche de la réalité, un système de collecte simulant une interaction naturelle et mettant en oeuvre un agent virtuel expressif a été développé. Il a été mis en oeuvre pour recueillir deux corpus émotionnels, avec la participation de près de 80 patients de centres médicaux de la région de Montpellier, dans le cadre du projet ANR ARMEN.Ces données ont été utilisées dans l'exploration d'approches pour la résolution du problème de la généralisation des performances des systèmes de détection des émotions à d'autres données. Dans cette optique, une grande partie des travaux menés a porté sur des stratégies cross-corpus ainsi que la sélection automatique des meilleurs paramètres. Un algorithme hybride combinant des techniques de sélection flottante avec des métriques de similitudes et des heuristiques multi-échelles a été proposé et appliqué notamment dans le cadre d'un challenge (InterSpeech 2012). Les résultats de l'application de cet algorithme offrent des pistes pour différencier des corpus émotionnels à partir des paramètres les plus pertinents pour les représenter.Un prototype du système de dialogue complet, incluant le module de détection des émotions et l'agent virtuel a également été implémenté. / Most of the affective states recognition systems are trained on artificial data, without any realistic context. Moreover the evaluations are done with pre-recorded data of the same quality. This thesis seeks to tackle the various challenges resulting from the confrontation of these systems with real situations and users.In order to obtain close-to-reality spontaneous emotional data, a data-collection system simulating a natural interaction was developed. It uses an expressive virtual character to conduct the interaction. Two emotional corpora where gathered with this system, with almost 80 patients from medical centers of the region of Montpellier, France, participating in. This work was carried out as part of the French ANR ARMEN collaborative project.This data was used to explore approaches to solve the problem of performance generalization for emotion detection systems. Most of the work in this part deals with cross-corpus strategies and automatic selection of the best features. An hybrid algorithm combining floating selection techniques with similarity measures and multi-scale heuristics was proposed and used in the frame of the InterSpeech 2012 Emotino Challenge. The results and insights gained with the help of this algorithm suggest ways of distinguishing between emotional corpora using their most relevant features.A prototype of the complete dialog system, including the emotion detection module and the virtual agent was also implemented.
|
89 |
Seleção supervisionada de características por ranking para processar consultas por similaridade em imagens médicas / Supervised feature selection by ranking to process similarity queries in medical imagesMamani, Gabriel Efrain Humpire 05 December 2012 (has links)
Obter uma representação sucinta e representativa de imagens médicas é um desafio que tem sido perseguido por pesquisadores da área de processamento de imagens médicas com o propósito de apoiar o diagnóstico auxiliado por computador (Computer Aided Diagnosis - CAD). Os sistemas CAD utilizam algoritmos de extração de características para representar imagens, assim, diferentes extratores podem ser avaliados. No entanto, as imagens médicas contêm estruturas internas que são importantes para a identificação de tecidos, órgãos, malformações ou doenças. É usual que um grande número de características sejam extraídas das imagens, porém esse fato que poderia ser benéfico, pode na realidade prejudicar o processo de indexação e recuperação das imagens com problemas como a maldição da dimensionalidade. Assim, precisa-se selecionar as características mais relevantes para tornar o processo mais eficiente e eficaz. Esse trabalho desenvolveu o método de seleção supervisionada de características FSCoMS (Feature Selection based on Compactness Measure from Scatterplots) para obter o ranking das características, contemplando assim, o que é necessário para o tipo de imagens médicas sob análise. Dessa forma, produziu-se vetores de características mais enxutos e eficientes para responder consultas por similaridade. Adicionalmente, foi desenvolvido o extrator de características k-Gabor que extrai características por níveis de cinza, ressaltando estruturas internas das imagens médicas. Os experimentos realizados foram feitos com quatro bases de imagens médicas do mundo real, onde o k-Gabor sobressai pelo desempenho na recuperação por similaridade de imagens médicas, enquanto o FSCoMS reduz a redundância das características para obter um vetor de características menor do que os métodos de seleção de características convencionais e ainda com um maior desempenho em recuperação de imagens / Obtaining a representative and succinct description of medical images is a challenge that has been pursued by researchers in the area of medical image processing to support Computer-Aided Diagnosis (CAD). CAD systems use feature extraction algorithms to represent images. Thus, different extractors can be evaluated. However, medical images contain important internal structures that allow identifying tissues, organs, deformations and diseases. It is usual that a large number of features are extracted the images. Nevertheless, what appears to be beneficial actually impairs the process of indexing and retrieval of images, revealing problems such as the curse of dimensionality. Thus, it is necessary to select the most relevant features to make the process more efficient and effective. This dissertation developed a supervised feature selection method called FSCoMS (Feature Selection based on Compactness Measure from Scatterplots) in order to obtain a ranking of features, suitable for medical image analysis. Our method FSCoMS had generated shorter and efficient feature vectors to answer similarity queries. Additionally, the k-Gabor feature extractor was developed, which extracts features by gray levels, highlighting internal structures of medical images. The experiments performed were performed on four real world medical datasets. Results have shown that the k-Gabor boosts the retrieval performance, whereas the FSCoMS reduces the subsets redundancy to produce a more compact feature vector than the conventional feature selection methods and even with a higher performance in image retrieval
|
90 |
Definição automática da quantidade de atributos selecionados em tarefas de agrupamento de dados / Automatic feature quantification in data clustering tasksAndrade Filho, José Augusto 17 September 2013 (has links)
Conjuntos de dados reais muitas vezes apresentam um grande número de atributos preditivos ou de entrada, o que leva a uma grande quantidade de informação. Entretanto, essa quantidade de informação nem sempre significa uma melhoria em termos de desempenho de técnicas de agrupamento. Além disso, alguns atributos podem estar correlacionados ou adicionar ruído, reduzindo a qualidade do agrupamento de dados. Esse problema motivou o desenvolvimento de técnicas de seleção de atributos, que tentam encontrar um subconjunto com os atributos mais relevantes para agrupar os dados. Neste trabalho, o foco está no problema de seleção de atributos não supervisionados. Esse é um problema difícil, pois não existe informação sobre rótulos das classes. Portanto, não existe um guia para medir a qualidade do subconjunto de atributos. O principal objetivo deste trabalho é definir um método para identificar quanto atributos devem ser selecionados (após ordená-los com base em algum critério). Essa tarefa é realizada por meio da técnica de Falsos Vizinhos Mais Próximos, que tem sua origem na teoria do caos. Resultados experimentais mostram que essa técnica informa um bom número aproximado de atributos a serem selecionados. Quando comparado a outras técnicas, na maioria dos casos analisados, enquanto menos atributos são selecionados, a qualidade da partição dos dados é mantida / Real-world datasets commonly present high dimensional data, what leads to an increased amount of information. However, this does not always imply on an improvement in terms of clustering techniques performance. Furthermore, some features may be correlated or add unexpected noise, reducing the data clustering performance. This problem motivated the development of feature selection techniques, which attempt to find the most relevant subset of features to cluster data. In this work, we focus on the problem of unsupervised feature selection. This is a difficult problem, since there is no class label information. Therefore, there is no guide to measure the quality of the feature subset. The main goal of this work is to define a method to identify the number of features to select (after sorting them based on some criterion). This task is carried out by means of the False Nearest Neighbor, which has its root in the Chaos Theory. Experimental results show that this technique gives an good approximate number of features to select. When compared to other techniques, in most of the analyzed cases, while selecting fewer features, it maintains the quality of the data partition
|
Page generated in 0.1291 seconds