Global ETD Search

301	Data Mining Algorithms for Classification of Complex Biomedical Data Lan, Liang January 2012 (has links) In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray classification, samples belong to several predefined categories (e.g., cancer vs. control tissues) and the goal is to build a predictor that classifies a new tissue sample based on its microarray measurements. When faced with the small-sample high-dimensional microarray data, most machine learning algorithm would produce an overly complicated model that performs well on training data but poorly on new data. To reduce the risk of over-fitting, feature selection becomes an essential technique in microarray classification. However, standard feature selection algorithms are bound to underperform when the size of the microarray data is particularly small. The best remedy is to borrow strength from external microarray datasets. In this dissertation, I will present two new multi-task feature filter methods which can improve the classification performance by utilizing the external microarray data. The first method is to aggregate the feature selection results from multiple microarray classification tasks. The resulting multi-task feature selection can be shown to improve quality of the selected features and lead to higher classification accuracy. The second method jointly selects a small gene set with maximal discriminative power and minimal redundancy across multiple classification tasks by solving an objective function with integer constraints. In protein function prediction problem, gene functions are predicted from a predefined set of possible functions (e.g., the functions defined in the Gene Ontology). Gene function prediction is a complex classification problem characterized by the following aspects: (1) a single gene may have multiple functions; (2) the functions are organized in hierarchy; (3) unbalanced training data for each function (much less positive than negative examples); (4) missing class labels; (5) availability of multiple biological data sources, such as microarray data, genome sequence and protein-protein interactions. As participants in the 2011 Critical Assessment of Function Annotation (CAFA) challenge, our team achieved the highest AUC accuracy among 45 groups. In the competition, we gained by focusing on the 5-th aspect of the problem. Thus, in this dissertation, I will discuss several schemes to integrate the prediction scores from multiple data sources and show their results. Interestingly, the experimental results show that a simple averaging integration method is competitive with other state-of-the-art data integration methods. Original spatial scan algorithm is used for detection of spatial overdensities: discovery of spatial subregions with significantly higher scores according to some density measure. This algorithm is widely used in identifying cluster of disease cases (e.g., identifying environmental risk factors for child leukemia). However, the original spatial scan algorithm only works on static spatial data. In this dissertation, I will propose one possible solution for spatial scan on movement data. / Computer and Information Science Computer Science Bioinformatics Information Science Data Mining Disease Mapping Machine Learning Multi-task Feature Selection Protein Function Prediction Spatial Scan
302	Data-Driven Supervised Classifiers in High-Dimensional Spaces: Application on Gene Expression Data Efrem, Nabiel H. January 2024 (has links) Several ready-to-use supervised classifiers perform predictively well in large-sample cases, but generally, the same cannot be expected when transitioning to high-dimensional settings. This can be explained by the classical supervised theory that has not been developed within high-dimensional spaces, giving several classifiers a hard combat against the curse of dimensionality. A rise in parsimonious classification procedures, particularly techniques incorporating feature selectors, can be observed. It can be interpreted as a two-step procedure: allowing an arbitrary selector to obtain a feature subset independent of a ready-to-use model and subsequently classify unlabelled instances within the selected subset. Modeling the two-step procedure is often heavy in motivation, and theoretical and algorithmic descriptions are frequently overlooked. In this thesis, we aim to describe the theoretical and algorithmic framework when employing a feature selector as a pre-processing step for Support Vector Machine and assess its validity in high-dimensional settings. The validity of the proposed classifier is evaluated based on predictive performance through a comparative study with a state-of-the-art algorithm designed for advanced learning tasks. The chosen algorithm effectively employs feature relevance during training, making it suitable for high-dimensional settings. The results suggest that the proposed classifier performs predicatively superior to the Support Vector Machine in lower input dimensions; however, a high rate of convergence towards a performance comparable to the Support Vector Machine tends to emerge for input dimensions beyond a certain threshold. Additionally, the thesis could not conclude any strict superior performance between the chosen state-of-the-art algorithm and the proposed classifier. Nonetheless, the state-of-the-art algorithm imposes a more balanced performance across both labels. Supervised Classification High-Dimensional Space Feature Selection Parsimonious Classifier Support Vector Machine Probability Theory and Statistics Sannolikhetsteori och statistik
303	Web genre classification using feature selection and semi-supervised learning Chetry, Roshan January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Doina Caragea / As the web pages continuously change and their number grows exponentially, the need for genre classification of web pages also increases. One simple reason for this is given by the need to group web pages into various genre categories in order to reduce the complexities of various web tasks (e.g., search). Experts unanimously agree on the huge potential of genre classification of web pages. However, while everybody agrees that genre classification of web pages is necessary, researchers face problems in finding enough labeled data to perform supervised classification of web pages into various genres. The high cost of skilled manual labor, rapid changing nature of web and never ending growth of web pages are the main reasons for the limited amount of labeled data. On the contrary unlabeled data can be acquired relatively inexpensively in comparison to labeled data. This suggests the use of semi-supervised learning approaches for genre classification, instead of using supervised approaches. Semi-supervised learning makes use of both labeled and unlabeled data for training - typically a small amount of labeled data and a large amount of unlabeled data. Semi-supervised learning have been extensively used in text classification problems. Given the link structure of the web, for web-page classification one can use link features in addition to the content features that are used for general text classification. Hence, the feature set corresponding to web-pages can be easily divided into two views, namely content and link based feature views. Intuitively, the two feature views are conditionally independent given the genre category and have the ability to predict the class on their own. The scarcity of labeled data, availability of large amounts of unlabeled data, richer set of features as compared to the conventional text classification tasks (specifically complementary and sufficient views of features) have encouraged us to use co-training as a tool to perform semi-supervised learning. During co-training labeled examples represented using the two views are used to learn distinct classifiers, which keep improving at each iteration by sharing the most confident predictions on the unlabeled data. In this work, we classify web-pages of .eu domain consisting of 1232 labeled host and 20000 unlabeled hosts (provided by the European Archive Foundation [Benczur et al., 2010]) into six different genres, using co-training. We compare our results with the results produced by standard supervised methods. We find that co-training can be an effective and cheap alternative to costly supervised learning. This is mainly due to the two independent and complementary feature sets of web: content based features and link based features. Web genre classification Co-training Semi-supervised learning Feature selection Roshan Chetry Computer Science (0984) Information Technology (0489) Web Studies (0646)
304	Interactive Object Retrieval using Interpretable Visual Models / Recherche Interactive d'Objets à l'Aide de Modèles Visuels Interprétables Rebai, Ahmed 18 May 2011 (has links) L'objectif de cette thèse est d'améliorer la recherche d'objets visuels à l'aide de l'interactivité avec l'utilisateur. Notre solution est de construire un système intéractif permettant aux utilisateurs de définir leurs propres concepts visuels à partir de certains mots-clés visuels. Ces mots-clés visuels, qui en théorie représentent les mots visuels les plus informatifs liés à une catégorie d'objets, sont appris auparavant à l'aide d'un algorithme d'apprentissage supervisé et d'une manière discriminative. Le challenge est de construire des mots-clés visuels concis et interprétables. Notre contribution repose sur deux points. D'abord, contrairement aux approches existantes qui utilisent les sacs de mots, nous proposons d'employer les descripteurs locaux sans aucune quantification préalable. Deuxièmement, nous proposons d'ajouter une contrainte de régularisation à la fonction de perte de notre classifieur pour favoriser la parcimonie des modèles produits. La parcimonie est en effet préférable pour sa concision (nombre de mots visuels réduits) ainsi pour sa diminution du temps de prédiction. Afin d'atteindre ces objectifs, nous avons développé une méthode d'apprentissage à instances multiples utilisant une version modifiée de l'algorithme BLasso. Cet algorithme est une forme de boosting qui se comporte similairement au LASSO (Least Absolute Shrinkage and Selection Operator). Il régularise efficacement la fonction de perte avec une contrainte additive de type L1 et ceci en alternant entre des itérations en avant et en arrière. La méthode proposée est générique dans le sens où elle pourrait être utilisée avec divers descripteurs locaux voire un ensemble structuré de descripteurs locaux qui décrit une région locale de l'image. / This thesis is an attempt to improve visual object retrieval by allowing users to interact with the system. Our solution lies in constructing an interactive system that allows users to define their own visual concept from a concise set of visual patches given as input. These patches, which represent the most informative clues of a given visual category, are trained beforehand with a supervised learning algorithm in a discriminative manner. Then, and in order to specialize their models, users have the possibility to send their feedback on the model itself by choosing and weighting the patches they are confident of. The real challenge consists in how to generate concise and visually interpretable models. Our contribution relies on two points. First, in contrast to the state-of-the-art approaches that use bag-of-words, we propose embedding local visual features without any quantization, which means that each component of the high-dimensional feature vectors used to describe an image is associated to a unique and precisely localized image patch. Second, we suggest using regularization constraints in the loss function of our classifier to favor sparsity in the models produced. Sparsity is indeed preferable for concision (a reduced number of patches in the model) as well as for decreasing prediction time. To meet these objectives, we developed a multiple-instance learning scheme using a modified version of the BLasso algorithm. BLasso is a boosting-like procedure that behaves in the same way as Lasso (Least Absolute Shrinkage and Selection Operator). It efficiently regularizes the loss function with an additive L1-constraint by alternating between forward and backward steps at each iteration. The method we propose here is generic in the sense that it can be used with any local features or feature sets representing the content of an image region. / تعالج هذه الأطروحة مسألة البحث عن الأشياء في الصور الثابتة و هي محاولة لتحسين نتائج البحث المنتظرة عن طريق تفاعل المستخدم مع النظام . يتمثل الحل المقترح في تصميم نظام تفاعلي يتيح للمستخدم صياغة مفهومه المرئي عن طريق مجموعة مقتضبة من أجزاء صغيرة للصور هي عبارة عن كلمات مفاتيح قد تم تعلمها سابقا عن طريق تعلم آلي استنتاجي . يمكن للمستخدم حينئذ تخصيص أنموذجه أولا بالاختيار ثم بترجيح الأجزاء التي يراها مناسبة . يتمثل التحدي القائم في كيفية توليد نماذج مرئية مفهومة و مقتضبة . نكون قد ساهمنا في هذا المجال بنقطتين أساسيتين تتمثل الأولى في إدماج الواصفات المحلية للصور دون أي تكميم ، و بذلك يكون كل مكون من ناقلات الميزات ذات الأبعاد العالية مرتبط حصريا بمكان وحيد و محدد في الصورة . ثانيا ، نقترح إضافة قيود تسوية لدالة الخسارة من أجل التحصل على حلول متفرقة و مقتضبة . يساهم ذلك في تقلص عدد هذه الأجزاء المرئية و بالتالي في ربح إضافي لوقت التكهن . في إطار تحقيق الأهداف المرسومة ، قمنا بإعداد مشروع تعلم قائم على تعدد الأمثلة يرتكز أساسا على نسخة محورة لخوارزمية بلاسو . تجدر الإشارة في الأخير أنه يمكن توظيف هذا العمل باستخدام نوع أو عدة أنواع من الواصفات المحلية للصور. Recherche d'objets Interprétabilité Sélection de variables Parcimonie Perception humaine Mots-clés visuels Interaction utilisateur Object retrieval Interpretability Feature selection Sparsity Human perception Visual keywords User interaction
305	Object detection, recognition and re-identification in video footage Irhebhude, Martins January 2015 (has links) There has been a significant number of security concerns in recent times; as a result, security cameras have been installed to monitor activities and to prevent crimes in most public places. These analysis are done either through video analytic or forensic analysis operations on human observations. To this end, within the research context of this thesis, a proactive machine vision based military recognition system has been developed to help monitor activities in the military environment. The proposed object detection, recognition and re-identification systems have been presented in this thesis. A novel technique for military personnel recognition is presented in this thesis. Initially the detected camouflaged personnel are segmented using a grabcut segmentation algorithm. Since in general a camouflaged personnel's uniform appears to be similar both at the top and the bottom of the body, an image patch is initially extracted from the segmented foreground image and used as the region of interest. Subsequently the colour and texture features are extracted from each patch and used for classification. A second approach for personnel recognition is proposed through the recognition of the badge on the cap of a military person. A feature matching metric based on the extracted Speed Up Robust Features (SURF) from the badge on a personnel's cap enabled the recognition of the personnel's arm of service. A state-of-the-art technique for recognising vehicle types irrespective of their view angle is also presented in this thesis. Vehicles are initially detected and segmented using a Gaussian Mixture Model (GMM) based foreground/background segmentation algorithm. A Canny Edge Detection (CED) stage, followed by morphological operations are used as pre-processing stage to help enhance foreground vehicular object detection and segmentation. Subsequently, Region, Histogram Oriented Gradient (HOG) and Local Binary Pattern (LBP) features are extracted from the refined foreground vehicle object and used as features for vehicle type recognition. Two different datasets with variant views of front/rear and angle are used and combined for testing the proposed technique. For night-time video analytics and forensics, the thesis presents a novel approach to pedestrian detection and vehicle type recognition. A novel feature acquisition technique named, CENTROG, is proposed for pedestrian detection and vehicle type recognition in this thesis. Thermal images containing pedestrians and vehicular objects are used to analyse the performance of the proposed algorithms. The video is initially segmented using a GMM based foreground object segmentation algorithm. A CED based pre-processing step is used to enhance segmentation accuracy prior using Census Transforms for initial feature extraction. HOG features are then extracted from the Census transformed images and used for detection and recognition respectively of human and vehicular objects in thermal images. Finally, a novel technique for people re-identification is proposed in this thesis based on using low-level colour features and mid-level attributes. The low-level colour histogram bin values were normalised to 0 and 1. A publicly available dataset (VIPeR) and a self constructed dataset have been used in the experiments conducted with 7 clothing attributes and low-level colour histogram features. These 7 attributes are detected using features extracted from 5 different regions of a detected human object using an SVM classifier. The low-level colour features were extracted from the regions of a detected human object. These 5 regions are obtained by human object segmentation and subsequent body part sub-division. People are re-identified by computing the Euclidean distance between a probe and the gallery image sets. The experiments conducted using SVM classifier and Euclidean distance has proven that the proposed techniques attained all of the aforementioned goals. The colour and texture features proposed for camouflage military personnel recognition surpasses the state-of-the-art methods. Similarly, experiments prove that combining features performed best when recognising vehicles in different views subsequent to initial training based on multi-views. In the same vein, the proposed CENTROG technique performed better than the state-of-the-art CENTRIST technique for both pedestrian detection and vehicle type recognition at night-time using thermal images. Finally, we show that the proposed 7 mid-level attributes and the low-level features results in improved performance accuracy for people re-identification. 006.3
306	Développement du système d'analyse des données recueillies par les capteurs et choix du groupement de capteurs optimal pour le suivi de la cuisson des aliments dans un four / Développement du système d'analyse des données recueillies par les capteurs et choix du groupement de capteurs optimal pour le suivi de la cuisson des aliments dans un four Monrousseau, Thomas 22 November 2016 (has links) Dans un monde où tous les appareils électro-ménagers se connectent et deviennent intelligents, il est apparu pour des industriels français le besoin de créer des fours de cuisson innovants capables de suivre l’état de cuisson à cœur de poissons et de viandes sans capteur au contact. Cette thèse se place dans ce contexte et se divise en deux grandes parties. La première est une phase de sélection d’attributs parmi un ensemble de mesures issues de capteurs spécifiques de laboratoire afin de permettre d’appliquer un algorithme de classification supervisée sur trois états de cuisson. Une méthode de sélection basée sur la logique floue a notamment été appliquée pour réduire grandement le nombre de variable à surveiller. La seconde partie concerne la phase de suivi de cuisson en ligne par plusieurs méthodes. Les techniques employées sont une approche par classification sur dix états à cœur, la résolution d’équation de la chaleur discrétisée, ainsi que le développement d’un capteur logiciel basé sur des réseaux de neurones artificiels synthétisés à partir d’expériences de cuisson, pour réaliser la reconstruction du signal de la température au cœur des aliments à partir de mesures disponibles en ligne. Ces algorithmes ont été implantés sur microcontrôleur équipant une version prototype d’un nouveau four afin d’être testés et validés dans le cas d’utilisations réelles. / In a world where all personal devices become smart and connected, some French industrials created a project to make ovens able detecting the cooking state of fish and meat without contact sensor. This thesis takes place in this context and is divided in two major parts. The first one is a feature selection phase to be able to classify food in three states: under baked, well baked and over baked. The point of this selection method, based on fuzzy logic is to strongly reduce the number of features got from laboratory specific sensors. The second part concerns on-line monitoring of the food cooking state by several methods. These technics are: classification algorithm into ten bake states, the use of a discrete version of the heat equation and the development of a soft sensor based on an artificial neural network model build from cooking experiments to infer the temperature inside the food from available on-line measurements. These algorithms have been implemented on microcontroller equipping a prototype version of a new oven in order to be tested and validated on real use cases. Méthodes de classification Apprentissage supervisé Logique floue Optimisation Sélection d’attributs Réseaux de neurones Classification methods Machine learning Fuzzy logic Optimization Feature selection Neural networks 004 629.8
307	LearnInPlanner: uma abordagem de aprendizado supervisionado com redes neurais para solução de problemas de planejamento clássico / LearnInPlanner : a supervised learning approach with neural networks to solve problems of classical planning Santos, Rosiane Correia 19 November 2013 (has links) A busca progressiva no espaço de estados é uma das abordagens mais populares de Planejamento Automatizado. O desempenho dos algoritmos de busca progressiva é influenciado pela heurística independente de domínio utilizada para guiá-lo. Nesse contexto, o foco do presente trabalho consiste em investigar técnicas de aprendizado de máquina supervisionadas que possibilitaram agregar à heurística do plano relaxado, comumente utilizada em abordagens atuais de planejamento, informações sobre o domínio em questão que viessem a ser úteis ao algoritmo de busca. Essas informações foram representadas por meio de um espaço de características do problema de planejamento e uma rede neural MLP foi aplicada para estimar uma nova função heurística para guiar a busca por meio de um processo de regressão não linear. Uma vez que o conjunto de características disponíveis para a construção da nova função heurística é grande, foi necessário a definição de um processo de seleção de características capaz de determinar qual conjunto de características de entrada da rede resultaria em melhor desempenho para o modelo de regressão. Portanto, para a seleção de características, aplicou-se uma abordagem de algoritmos genéticos. Como principal resultado, tem-se uma análise comparativa do desempenho entre a utilização da heurística proposta neste trabalho e a utilização da heurística do plano relaxado para guiar o algoritmo de busca na tarefa de planejamento. Para a análise empírica foram utilizados domínios de diferentes complexidades disponibilizados pela Competições Internacionais de Planejamento. Além dos resultados empíricos e análises comparativas, as contribuições deste trabalho envolvem o desenvolvimento de um novo planejador independente de domínio, denominado LearnInPlanner. Esse planejador utiliza a nova função heurística estimada por meio do processo de aprendizado e o algoritmo de Busca Gulosa para solucionar os problemas de planejamento. / The forward state-space search is one of the most popular Automated Planning approaches. The performance of forward search algorithms is affected by the domain-independent heuristic being used. In this context, the focus of this work consisted on investigating techniques of supervised machine learning that make possible to agregate to the relaxed plan heuristic, commonly used in current planning approaches, information about the domain which could be useful to the search algorithm. This information has been represented through a feature space of planning problem and a MLP neural network has been applied to estimate a new heuristic function for guiding the search through a non-linear regression process. Once the set of features available for the construction of the new heuristic function is large, it was necessary to define a feature selection process capable of determining which set of neural network input features would result in the best performance for the regression model. Therefore, for selecting features, an approach of genetic algorithms has been applied. As the main result, one has obtained a comparative performance analysis between the use of heuristic proposed in this work and the use of the relaxed plan heuristic to guide the search algorithm in the planning task. For the empirical analysis were used domains with different complexities provided by the International Planning Competitions. In addition to the empirical results and comparative analysis, the contributions of this work involves the development of a new domain-independent planner, named LearnInPlanner. This planner uses the new heuristic function estimated by the learning process and the Greedy Best-First search algorithm to solve planning problems. Algoritmos genéticos Aprendizado supervisionado Classical planning Feature selection Genetic algorithms Multilayer perceptron network Neural networks Planejamento clássico Rede perceptron multicamadas Redes neurais artificiais Seleção de características Supervised learning
308	Seleção de atributos para aprendizagem multirrótulo / Feature selection for multi-label learning Spolaôr, Newton 24 September 2014 (has links) A presença de atributos não importantes, i.e., atributos irrelevantes ou redundantes nos dados, pode prejudicar o desempenho de classificadores gerados a partir desses dados por algoritmos de aprendizado de máquina. O objetivo de algoritmos de seleção de atributos consiste em identificar esses atributos não importantes para removê-los dos dados antes da construção de classificadores. A seleção de atributos em dados monorrótulo, nos quais cada exemplo do conjunto de treinamento é associado com somente um rótulo, tem sido amplamente estudada na literatura. Entretanto, esse não é o caso para dados multirrótulo, nos quais cada exemplo é associado com um conjunto de rótulos (multirrótulos). Além disso, como esse tipo de dados usualmente apresenta relações entre os rótulos do multirrótulo, algoritmos de aprendizado de máquina deveriam considerar essas relações. De modo similar, a dependência de rótulos deveria também ser explorada por algoritmos de seleção de atributos multirrótulos. A abordagem filtro é uma das mais utilizadas por algoritmos de seleção de atributos, pois ela apresenta um custo computacional potencialmente menor que outras abordagens e utiliza características gerais dos dados para calcular as medidas de importância de atributos. tais como correlação de atributo-classe, entre outras. A hipótese deste trabalho é trabalho é que algoritmos de seleção de atributos em dados multirrótulo que consideram a dependência de rótulos terão um melhor desempenho que aqueles que ignoram essa informação. Para tanto, é proposto como objetivo deste trabalho o projeto e a implementação de algoritmos filtro de seleção de atributos multirrótulo que consideram relações entre rótulos. Em particular, foram propostos dois métodos que levam em conta essas relações por meio da construção de rótulos e da adaptação inovadora do algoritmo de seleção de atributos monorrótulo ReliefF. Esses métodos foram avaliados experimentalmente e apresentam bom desempenho em termos de redução no número de atributos e qualidade dos classificadores construídos usando os atributos selecionados. / Irrelevant and/or redundant features in data can deteriorate the performance of the classifiers built from this data by machine learning algorithms. The aim of feature selection algorithms consists in identifying these features and removing them from data before constructing classifiers. Feature selection in single-label data, in which each instance in the training set is associated with only one label, has been widely studied in the literature. However, this is not the case for multi-label data, in which each instance is associated with a set of labels. Moreover, as multi-label data usually exhibit relationships among the labels in the set of labels, machine learning algorithms should take thiis relatinship into account. Therefore, label dependence should also be explored by multi-label feature selection algorithms. The filter approach is one of the most usual approaches considered by feature selection algorithms, as it has potentially lower computational cost than approaches and uses general properties from data to calculate feature importance measures, such as the feature-class correlation. The hypothesis of this work is that feature selection algorithms which consider label dependence will perform better than the ones that disregard label dependence. To this end, ths work proposes and develops filter approach multi-label feature selection algorithms which take into account relations among labels. In particular, we proposed two methods that take into account these relations by performing label construction and adapting the single-label feature selection algorith RelieF. These methods were experimentally evaluated showing good performance in terms of feature reduction and predictability of the classifiers built using the selected features. Construção de rótulos Ganho de informação Information gain Label construction Multi-label feature selection ReliefF ReliefF Revisão sistemática Seleção de atributos multirrótulo Systematic review
309	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations / Seleção de atributos efetiva e não-supervisionada em grandes bases de dados: aplicando a Teoria de Fractais para remover correlações lineares e não-lineares Fraideinberze, Antonio Canabrava 04 September 2017 (has links) Given a very large dataset of moderate-to-high dimensionality, how to mine useful patterns from it? In such cases, dimensionality reduction is essential to overcome the well-known curse of dimensionality. Although there exist algorithms to reduce the dimensionality of Big Data, unfortunately, they all fail to identify/eliminate non-linear correlations that may occur between the attributes. This MSc work tackles the problem by exploring concepts of the Fractal Theory and massive parallel processing to present Curl-Remover, a novel dimensionality reduction technique for very large datasets. Our contributions are: (a) Curl-Remover eliminates linear and non-linear attribute correlations as well as irrelevant attributes; (b) it is unsupervised and suits for analytical tasks in general not only classification; (c) it presents linear scale-up on both the data size and the number of machines used; (d) it does not require the user to guess the number of attributes to be removed, and; (e) it preserves the attributes semantics by performing feature selection, not feature extraction. We executed experiments on synthetic and real data spanning up to 1.1 billion points, and report that our proposed Curl-Remover outperformed two PCA-based algorithms from the state-of-the-art, being in average up to 8% more accurate. / Dada uma grande base de dados de dimensionalidade moderada a alta, como identificar padrões úteis nos objetos de dados? Nesses casos, a redução de dimensionalidade é essencial para superar um fenômeno conhecido na literatura como a maldição da alta dimensionalidade. Embora existam algoritmos capazes de reduzir a dimensionalidade de conjuntos de dados na escala de Terabytes, infelizmente, todos falham em relação à identificação/eliminação de correlações não lineares entre os atributos. Este trabalho de Mestrado trata o problema explorando conceitos da Teoria de Fractais e processamento paralelo em massa para apresentar Curl-Remover, uma nova técnica de redução de dimensionalidade bem adequada ao pré-processamento de Big Data. Suas principais contribuições são: (a) Curl-Remover elimina correlações lineares e não lineares entre atributos, bem como atributos irrelevantes; (b) não depende de supervisão do usuário e é útil para tarefas analíticas em geral não apenas para a classificação; (c) apresenta escalabilidade linear tanto em relação ao número de objetos de dados quanto ao número de máquinas utilizadas; (d) não requer que o usuário sugira um número de atributos para serem removidos, e; (e) mantêm a semântica dos atributos por ser uma técnica de seleção de atributos, não de extração de atributos. Experimentos foram executados em conjuntos de dados sintéticos e reais contendo até 1,1 bilhões de pontos, e a nova técnica Curl-Remover apresentou desempenho superior comparada a dois algoritmos do estado da arte baseados em PCA, obtendo em média até 8% a mais em acurácia de resultados. Big data Big data Feature selection Fractal theory Massive parallel processing Non-linear attribute correlations Processamento paralelo em massa Seleção de atributos Teoria de fractais
310	Seleção de características para reconhecimento biométrico baseado em sinais de eletrocardiograma / Feature selection for biometric recognition based on electrocardiogram signals Teodoro, Felipe Gustavo Silva 22 June 2016 (has links) O campo da Biometria abarca uma grande variedade de tecnologias usadas para identificar e verificar a identidade de uma pessoa por meio da mensuração e análise de vários aspectos físicos e/ou comportamentais do ser humano. Diversas modalidades biométricas têm sido propostas para reconhecimento de pessoas, como impressões digitais, íris, face e voz. Estas modalidades biométricas possuem características distintas em termos de desempenho, mensurabilidade e aceitabilidade. Uma questão a ser considerada com a aplicação de sistemas biométricos em mundo real é sua robustez a ataques por circunvenção, repetição e ofuscação. Esses ataques estão se tornando cada vez mais frequentes e questionamentos estão sendo levantados a respeito dos níveis de segurança que esta tecnologia pode oferecer. Recentemente, sinais biomédicos, como eletrocardiograma (ECG), eletroencefalograma (EEG) e eletromiograma (EMG) têm sido estudados para uso em problemas envolvendo reconhecimento biométrico. A formação do sinal do ECG é uma função da anatomia estrutural e funcional do coração e dos seus tecidos circundantes. Portanto, o ECG de um indivíduo exibe padrão cardíaco único e não pode ser facilmente forjado ou duplicado, o que tem motivado a sua utilização em sistemas de identificação. Entretanto, a quantidade de características que podem ser extraídas destes sinais é muito grande. A seleção de característica tem se tornado o foco de muitas pesquisas em áreas em que bases de dados formadas por dezenas ou centenas de milhares de características estão disponíveis. Seleção de característica ajuda na compreensão dos dados, reduzindo o custo computacional, reduzindo o efeito da maldição da dimensionalidade e melhorando o desempenho do preditor. O foco da seleção de característica é selecionar um subconjunto de característica a partir dos dados de entrada, que pode descrever de forma eficiente os dados de entrada ao mesmo tempo reduzir os efeitos de ruídos ou características irrelevantes e ainda proporcionar bons resultados de predição. O objetivo desta dissertação é analisar o impacto de algumas técnicas de seleção de característica tais como, Busca Gulosa, Seleção \\textit, Algoritmo Genético, Algoritmo Memético, Otimização por Enxame de Partículas sobre o desempenho alcançado pelos sistemas biométricos baseado em ECG. Os classificadores utilizados foram $k$-Vizinhos mais Próximos, Máquinas de Vetores Suporte, Floresta de Caminhos Ótimos e classificador baseado em distância mínima. Os resultados demonstram que existe um subconjunto de características extraídas do sinal de ECG capaz de fornecer altas taxas de reconhecimento / The field of biometrics includes a variety of technologies used to identify and verify the identity of a person by measuring and analyzing various physical and/or behavioral aspects of the human being. Several biometric modalities have been proposed for recognition of people, such as fingerprints, iris, face and speech. These biometric modalities have distinct characteristics in terms of performance, measurability and acceptability. One issue to be considered with the application of biometric systems in real world is its robustness to attacks by circumvention, spoof and obfuscation. These attacks are becoming more frequent and more questions are being raised about the levels of security that this technology can offer. Recently, biomedical signals, as electrocardiogram (ECG), electroencephalogram (EEG) and electromyogram (EMG) have been studied for use in problems involving biometric recognition. The ECG signal formation is a function of structural and functional anatomy of the heart and its surrounding tissues. Therefore, the ECG of an individual exhibits unique cardiac pattern and cannot be easily forged or duplicated, that have motivated its use in various identification systems. However, the amount of features that can be extracted from this signal is very large. The feature selection has become the focus of much research in areas where databases formed by tens or hundreds of thousands of features are available. Feature Selection helps in understanding data, reducing computation requirement, reducing the effect of curse of dimensionality and improving the predictor performance. The focus of feature selection is to select a subset of features from the input which can efficiently describe the input data while reducing effects from noise or irrelevant features and still provide good prediction results. The aim of this dissertation is to analyze the impact of some feature selection techniques, such as, greedy search, Backward Selection, Genetic Algorithm, Memetic Algorithm, Particle Swarm Optimization on the performance achieved by biometric systems based on ECG. The classifiers used were $k$-Nearest Neighbors, Support Vector Machines, Optimum-Path Forest and minimum distance classifier. The results demonstrate that there is a subset of features extracted from the ECG signal capable of providing high recognition rates Algoritmo genético Algoritmo memético Biomedical biometri Biometria biomédica Biometric systems Eletrocardiograma Feature selection Genetic algorithm Memetic algorithm Pattern recognition Reconhecimento de padrões Seleção de características

Search results