abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2018
Burrell, Lauren S.
17 November 2008
This research focused on the development of a methodology for analyzing functional magnetic resonance imaging (fMRI) data collected from patients with epilepsy in order to map epileptic networks. Epilepsy, a chronic neurological disorder characterized by recurrent, unprovoked seizures, affects up to 1% of the world's population. Antiepileptic drug therapies either do not successfully control seizures or have unacceptable side effects in over 30% of patients. Approximately one-third of patients whose seizures cannot be controlled by medication are candidates for surgical removal of the affected area of the brain, potentially rendering them seizure free. Accurate localization of the epileptogenic focus, i.e., the area of seizure onset, is critical for the best surgical outcome. The main objective of the research was to develop a set of fMRI data features that could be used to distinguish between normal brain tissue and the epileptic focus. To determine the best combination of features from various domains for mapping the focus, genetic programming and several feature selection methods were employed. These composite features and feature sets were subsequently used to train a classifier capable of discriminating between the two classes of voxels. The classifier was then applied to a separate testing set in order to generate maps showing brain voxels labeled as either normal or epileptogenic based on the best feature or set of features. It should be noted that although this work focuses on the application of fMRI analysis to epilepsy data, similar techniques could be used when studying brain activations due to other sources. In addition to investigating in vivo data collected from temporal lobe epilepsy patients with uncertain epileptic foci, phantom (simulated) data were created and processed to provide quantitative measures of the efficacy of the techniques.
25 April 2016
Pedestrian detection is a canonical instance of object detection that remains a popular topic of research and a key problem in computer vision due to its diverse applications. These applications have the potential to positively improve the quality of life. In recent years, the number of approaches to detecting pedestrians in monocular and binocular images has grown steadily. However, the use of multispectral imaging is still uncommon. This thesis work presents a novel approach to data and feature fusion of a multispectral imaging system for pedestrian detection. It also includes the design and building of a test rig which allows for quick data collection of real-world driving. An application of the mathematical theory of trifocal tensor is used to post process this data. This allows for pixel level data fusion across a multispectral set of data. Performance results based on commonly used SVM classification architectures are evaluated against the collected data set. Lastly, a novel cascaded SVM architecture used in both classification and detection is discussed. Performance improvements through the use of feature fusion is demonstrated.
24 August 2015
No description available.
Automatic Building Change Detection Through Linear Feature Fusion and Difference of Gaussian ClassificationPrince, Daniel Paul January 2016 (has links)
No description available.
Almeida, Adolfo Ricardo Lopes De
Since the popularisation of media streaming, a number of video streaming services are continually buying new video content to mine the potential profit. As such, newly added content has to be handled appropriately to be recommended to suitable users. In this dissertation, the new item cold-start problem is addressed by exploring the potential of various deep learning features to provide video recommendations. The deep learning features investigated include features that capture the visual-appearance, as well as audio and motion information from video content. Different fusion methods are also explored to evaluate how well these feature modalities can be combined to fully exploit the complementary information captured by them. Experiments on a real-world video dataset for movie recommendations show that deep learning features outperform hand crafted features. In particular, it is found that recommendations generated with deep learning audio features and action-centric deep learning features are superior to Mel-frequency cepstral coefficients (MFCC) and state-of-the-art improved dense trajectory (iDT) features. It was also found that the combination of various deep learning features with textual metadata and hand-crafted features provide significant improvement in recommendations, as compared to combining only deep learning and hand-crafted features. / Dissertation (MEng (Computer Engineering))--University of Pretoria, 2021. / The MultiChoice Research Chair of Machine Learning at the University of Pretoria / UP Postgraduate Masters Research bursary / Electrical, Electronic and Computer Engineering / MEng (Computer Engineering) / Unrestricted
This work focuses on analysing and improving feature detection and matching. After creating an initial framework of study, four main areas of work are researched. These areas make up the main chapters within this thesis and focus on using the Scale Invariant Feature Transform (SIFT).The preliminary analysis of the SIFT investigates how this algorithm functions. Included is an analysis of the SIFT feature descriptor space and an investigation into the noise properties of the SIFT. It introduces a novel use of the a contrario methodology and shows the success of this method as a way of discriminating between images which are likely to contain corresponding regions from images which do not. Parameter analysis of the SIFT uses both parameter sweeps and genetic algorithms as an intelligent means of setting the SIFT parameters for different image types utilising a GPGPU implementation of SIFT. The results have demonstrated which parameters are more important when optimising the algorithm and the areas within the parameter space to focus on when tuning the values. A multi-exposure, High Dynamic Range (HDR), fusion features process has been developed where the SIFT image features are matched within high contrast scenes. Bracketed exposure images are analysed and features are extracted and combined from different images to create a set of features which describe a larger dynamic range. They are shown to reduce the effects of noise and artefacts that are introduced when extracting features from HDR images directly and have a superior image matching performance. The final area is the development of a novel, 3D-based, SIFT weighting technique which utilises the 3D data from a pair of stereo images to cluster and class matched SIFT features. Weightings are applied to the matches based on the 3D properties of the features and how they cluster in order to attempt to discriminate between correct and incorrect matches using the a contrario methodology. The results show that the technique provides a method for discriminating between correct and incorrect matches and that the a contrario methodology has potential for future investigation as a method for correct feature match prediction.
Reconnaissance des émotions par traitement d’images / Emotions recognition based on image processingGharsalli, Sonia 12 July 2016 (has links)
La reconnaissance des émotions est l'un des domaines scientifiques les plus complexes. Ces dernières années, de plus en plus d'applications tentent de l'automatiser. Ces applications innovantes concernent plusieurs domaines comme l'aide aux enfants autistes, les jeux vidéo, l'interaction homme-machine. Les émotions sont véhiculées par plusieurs canaux. Nous traitons dans notre recherche les expressions émotionnelles faciales en s'intéressant spécifiquement aux six émotions de base à savoir la joie, la colère, la peur, le dégoût, la tristesse et la surprise. Une étude comparative de deux méthodes de reconnaissance des émotions l'une basée sur les descripteurs géométriques et l'autre basée sur les descripteurs d'apparence est effectuée sur la base CK+, base d'émotions simulées, et la base FEEDTUM, base d'émotions spontanées. Différentes contraintes telles que le changement de résolution, le nombre limité d'images labélisées dans les bases d'émotions, la reconnaissance de nouveaux sujets non inclus dans la base d'apprentissage sont également prises en compte. Une évaluation de différents schémas de fusion est ensuite réalisée lorsque de nouveaux cas, non inclus dans l'ensemble d'apprentissage, sont considérés. Les résultats obtenus sont prometteurs pour les émotions simulées (ils dépassent 86%), mais restent insuffisant pour les émotions spontanées. Nous avons appliqué également une étude sur des zones locales du visage, ce qui nous a permis de développer des méthodes hybrides par zone. Ces dernières améliorent les taux de reconnaissance des émotions spontanées. Finalement, nous avons développé une méthode de sélection des descripteurs d'apparence basée sur le taux d'importance que nous avons comparée avec d'autres méthodes de sélection. La méthode de sélection proposée permet d'améliorer le taux de reconnaissance par rapport aux résultats obtenus par deux méthodes reprises de la littérature. / Emotion recognition is one of the most complex scientific domains. In the last few years, various emotion recognition systems are developed. These innovative applications are applied in different domains such as autistic children, video games, human-machine interaction… Different channels are used to express emotions. We focus on facial emotion recognition specially the six basic emotions namely happiness, anger, fear, disgust, sadness and surprise. A comparative study between geometric method and appearance method is performed on CK+ database as the posed emotion database, and FEEDTUM database as the spontaneous emotion database. We consider different constraints in this study such as different image resolutions, the low number of labelled images in learning step and new subjects. We evaluate afterward various fusion schemes on new subjects, not included in the training set. Good recognition rate is obtained for posed emotions (more than 86%), however it is still low for spontaneous emotions. Based on local feature study, we develop local features fusion methods. These ones increase spontaneous emotions recognition rates. A feature selection method is finally developed based on features importance scores. Compared with two methods, our developed approach increases the recognition rate.
Projeto e desenvolvimento de técnicas forenses para identificação de imagens sintéticas / Design and development of forensic techniques for synthetic image identificationTokuda, Eric Keiji, 1984- 21 August 2018 (has links)
Orientadores: Hélio Pedrini, Anderson de Rezende Rocha / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-21T20:45:31Z (GMT). No. of bitstreams: 1 Tokuda_EricKeiji_M.pdf: 9271810 bytes, checksum: 933cc41bd2c4a5d4ace8239be240b632 (MD5) Previous issue date: 2012 / Resumo: O grande investimento de companhias de desenvolvimento de software para animação 3D nos últimos anos tem levado a área de Computação Gráfica a patamares nunca antes atingidos. Frente a esta tecnologia, torna-se cada vez mais difícil a um usuário comum distinguir fotografias reais de imagens produzidas em computador. Mais do que nunca, a fotografia, como meio de informação segura, passa a ter sua idoneidade questionada. A identificação de imagens geradas por computador tornou-se uma tarefa imprescindível. Existem diversos métodos de classificação de imagens fotográficas e geradas por computador na literatura. Todos os trabalhos se concentram em identificar diferenças entre imagens fotográficas e imagens geradas por computador. Contudo, no atual estágio da Computação Gráfica, não há uma caracterização isolada que resolva o problema. Propomos uma análise comparativa entre diferentes formas de combinação de descritores para abordar este problema. Para tanto, criamos um ambiente de testes com diversidade de conteúdo e de qualidade; implementamos treze métodos representativos da literatura; criamos e implementamos quatro abordagens de fusão de dados; comparamos os resultados dos métodos isolados com o resultado dos mesmos métodos combinados. Realizamos a implementação e análise de um total de treze métodos. O conjunto de dados para validação foi composto por aproximadamente 5.000 fotografias e 5.000 imagens geradas por computador. Resultados isolados atingiram acurácias de até 93%. A combinação destes mesmos métodos atingiu uma precisão de 97% (uma redução de 57% no erro do melhor método de maneira isolada) / Abstract: The development of powerful and low-cost hardware devices allied with great advances on content editing and authoring tools have pushed the creation of computer generated images (CGI) to a degree of unrivaled realism. Differentiating a photorealistic computer generated image from a real photograph can be a difficult task to naked eyes. Digital forensics techniques can play a significant role in this task. Indeed, important research has been made by our community in this regard. The current approaches focus on single image features aiming at spotting out diferences between real and computer generated images. However, with the current technology advances, there is no universal image characterization technique that completely solves this problem. In our work, we present a complete study of several current CGI vs. Photograph approaches; create a big and heterogeneous dataset to be used as a training and validation database; implement representative methods of the literature; and devise automatic ways to combine the best approaches. We compare the implemented methods using the same validation environment. Approximately 5,000 photographs and 5,000 CGIs with large diversity of content and quality were collected. A total of 13 methods were implemented. Results show that this set of methods, in an integrated approach, can achieve up to 93% of accuracy. The same methods, when combined through the proposed fusion schemes, can achieve an accuracy rate of 97% (a reduction of 57% of the error over the best result alone) / Mestrado / Ciência da Computação / Mestre em Ciência da Computação
A mutimodal framework for geocoding digital objects / Um arcabouço multimodal para geocodificação de objetos digitaisLin, Tzy Li, 1972- 24 August 2018 (has links)
Orientador: Ricardo da Silva Torres / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-24T12:28:05Z (GMT). No. of bitstreams: 1 Lin_TzyLi_D.pdf: 31046132 bytes, checksum: 1b92a866d8b83a7500c124693f33d083 (MD5) Previous issue date: 2014 / Resumo: Informação geográfica é usualmente encontrada em objetos digitais (como documentos, imagens e vídeos), sendo de grande interesse utilizá-la na implementação de diferentes serviços. Por exemplo, serviços de navegação baseados em mapas e buscas geográficas podem se beneficiar das localizações geográficas associadas a objetos digitais. A implementação destes serviços, no entanto, demanda o uso de coleções de dados geocodificados. Este trabalho estuda a combinação de conteúdo textual e visual para geocodificar objetos digitais e propõe um arcabouço de agregação de listas para geocodificação multimodal. A informação textual e visual de vídeos e imagens é usada para definir listas ordenadas. Em seguida, elas são combinadas e a nova lista ordenada resultante é usada para definir a localização geográfica de vídeos e imagens. Uma arquitetura que implementa essa proposta foi projetada de modo que módulos específicos para cada modalidade (e.g., textual ou visual) possam ser aperfeiçoados independentemente. Outro componente é o módulo de fusão responsável pela combinação das listas ordenadas definidas por cada modalidade. Outra contribuição deste trabalho é a proposta de uma nova medida de avaliação da efetividade de métodos de geocodificação chamada Weighted Average Score (WAS). Ela é baseada em ponderações de distâncias que permitem avaliar a efetividade de uma abordagem, considerando todos os resultados de geocodificação das amostras de teste. O arcabouço proposto foi validado em dois contextos: desafio Placing Task da iniciativa MediaEval 2012, que consiste em atribuir, automaticamente, coordenadas geográficas a vídeos; e geocodificação de fotos de prédios da Virginia Tech (VT) nos EUA. No contexto do desafio Placing Task, os resultados mostram como nossa abordagem melhora a geocodificação em comparação a métodos que apenas contam com uma modalidade (sejam descritores textuais ou visuais). Nós mostramos ainda que a proposta multimodal produziu resultados comparáveis às melhores submissões que também não usavam informações adicionais além daquelas disponibilizadas na base de treinamento. Em relação à geocodificação das fotos de prédios da VT, os experimentos demostraram que alguns dos descritores visuais locais produziram resultados efetivos. A seleção desses descritores e sua combinação melhoraram esses resultados quando a base de conhecimento tinha as mesmas características da base de teste / Abstract: Geographical information is often enclosed in digital objects (like documents, images, and videos) and its use to support the implementation of different services is of great interest. For example, the implementation of map-based browser services and geographic searches may take advantage of geographic locations associated with digital objects. The implementation of such services, however, demands the use of geocoded data collections. This work investigates the combination of textual and visual content to geocode digital objects and proposes a rank aggregation framework for multimodal geocoding. Textual and visual information associated with videos and images are used to define ranked lists. These lists are later combined, and the new resulting ranked list is used to define appropriate locations. An architecture that implements the proposed framework is designed in such a way that specific modules for each modality (e.g., textual and visual) can be developed and evolved independently. Another component is a data fusion module responsible for combining seamlessly the ranked lists defined for each modality. Another contribution of this work is related to the proposal of a new effectiveness evaluation measure named Weighted Average Score (WAS). The proposed measure is based on distance scores that are combined to assess how effective a designed/tested approach is, considering its overall geocoding results for a given test dataset. We validate the proposed framework in two contexts: the MediaEval 2012 Placing Task, whose objective is to automatically assign geographical coordinates to videos; and the task of geocoding photos of buildings from Virginia Tech (VT), USA. In the context of Placing Task, obtained results show how our multimodal approach improves the geocoding results when compared to methods that rely on a single modality (either textual or visual descriptors). We also show that the proposed multimodal approach yields comparable results to the best submissions to the Placing Task in 2012 using no additional information besides the available development/training data. In the context of the task of geocoding VT building photos, performed experiments demonstrate that some of the evaluated local descriptors yield effective results. The descriptor selection criteria and their combination improved the results when the used knowledge base has the same characteristics of the test set / Doutorado / Ciência da Computação / Doutora em Ciência da Computação
Page generated in 0.0996 seconds