1 |
CONTEXT AWARE PRIVACY PRESERVING CLUSTERING AND CLASSIFICATIONThapa, Nirmal 01 January 2013 (has links)
Data are valuable assets to any organizations or individuals. Data are sources of useful information which is a big part of decision making. All sectors have potential to benefit from having information. Commerce, health, and research are some of the fields that have benefited from data. On the other hand, the availability of the data makes it easy for anyone to exploit the data, which in many cases are private confidential data. It is necessary to preserve the confidentiality of the data. We study two categories of privacy: Data Value Hiding and Data Pattern Hiding. Privacy is a huge concern but equally important is the concern of data utility. Data should avoid privacy breach yet be usable. Although these two objectives are contradictory and achieving both at the same time is challenging, having knowledge of the purpose and the manner in which it will be utilized helps. In this research, we focus on some particular situations for clustering and classification problems and strive to balance the utility and privacy of the data.
In the first part of this dissertation, we propose Nonnegative Matrix Factorization (NMF) based techniques that accommodate constraints defined explicitly into the update rules. These constraints determine how the factorization takes place leading to the favorable results. These methods are designed to make alterations on the matrices such that user-specified cluster properties are introduced. These methods can be used to preserve data value as well as data pattern. As NMF and K-means are proven to be equivalent, NMF is an ideal choice for pattern hiding for clustering problems. In addition to the NMF based methods, we propose methods that take into account the data structures and the attribute properties for the classification problems. We separate the work into two different parts: linear classifiers and nonlinear classifiers. We propose two different solutions based on the classifiers. We study the effect of distortion on the utility of data.
We propose three distortion measurement metrics which demonstrate better characteristics than the traditional metrics. The effectiveness of the measures is examined on different benchmark datasets. The result shows that the methods have the desirable properties such as invariance to translation, rotation, and scaling.
|
2 |
Topic Analysis of Hidden Trends in Patented Features Using Nonnegative Matrix FactorizationLin, Yicong 01 January 2016 (has links)
Intellectual property has gained more attention in recent decades because innovations have become one of the most important resources. This paper implements a probabilistic topic model using nonnegative matrix factorization (NMF) to discover some of the key elements in computer patent, as the industry grew from 1990 to 2009. This paper proposes a new “shrinking model” based on NMF and also performs a close examination of some variations of the base model. Note that rather than studying the strategy to pick the optimized number of topics (“rank”), this paper is particularly interested in which factorization (including different kinds of initiation) methods are able to construct “topics” with the best quality given the predetermined rank. Performing NMF to the description text of patent features, we observe key topics emerge such as “platform” and “display” with strong presence across all years but we also see other short-lived significant topics such as “power” and “heat” which signify the saturation of the industry.
|
3 |
Examination of Initialization Techniques for Nonnegative Matrix FactorizationFrederic, John 21 November 2008 (has links)
While much research has been done regarding different Nonnegative Matrix Factorization (NMF) algorithms, less time has been spent looking at initialization techniques. In this thesis, four different initializations are considered. After a brief discussion of NMF, the four initializations are described and each one is independently examined, followed by a comparison of the techniques. Next, each initialization's performance is investigated with respect to the changes in the size of the data set. Finally, a method by which smaller data sets may be used to determine how to treat larger data sets is examined.
|
4 |
Nonnegative matrix factorization for clusteringKuang, Da 27 August 2014 (has links)
This dissertation shows that nonnegative matrix factorization (NMF) can be extended to a general and efficient clustering method. Clustering is one of the fundamental tasks in machine learning. It is useful for unsupervised knowledge discovery in a variety of applications such as text mining and genomic analysis. NMF is a dimension reduction method that approximates a nonnegative matrix by the product of two lower rank nonnegative matrices, and has shown great promise as a clustering method when a data set is represented as a nonnegative data matrix. However, challenges in the widespread use of NMF as a clustering method lie in its correctness and efficiency: First, we need to know why and when NMF could detect the true clusters and guarantee to deliver good clustering quality; second, existing algorithms for computing NMF are expensive and often take longer time than other clustering methods. We show that the original NMF can be improved from both aspects in the context of clustering. Our new NMF-based clustering methods can achieve better clustering quality and run orders of magnitude faster than the original NMF and other clustering methods.
Like other clustering methods, NMF places an implicit assumption on the cluster structure. Thus, the success of NMF as a clustering method depends on whether the representation of data in a vector space satisfies that assumption. Our approach to extending the original NMF to a general clustering method is to switch from the vector space representation of data points to a graph representation. The new formulation, called Symmetric NMF, takes a pairwise similarity matrix as an input and can be viewed as a graph clustering method. We evaluate this method on document clustering and image segmentation problems and find that it achieves better clustering accuracy. In addition, for the original NMF, it is difficult but important to choose the right number of clusters. We show that the widely-used consensus NMF in genomic analysis for choosing the number of clusters have critical flaws and can produce misleading results. We propose a variation of the prediction strength measure arising from statistical inference to evaluate the stability of clusters and select the right number of clusters. Our measure shows promising performances in artificial simulation experiments.
Large-scale applications bring substantial efficiency challenges to existing algorithms for computing NMF. An important example is topic modeling where users want to uncover the major themes in a large text collection. Our strategy of accelerating NMF-based clustering is to design algorithms that better suit the computer architecture as well as exploit the computing power of parallel platforms such as the graphic processing units (GPUs). A key observation is that applying rank-2 NMF that partitions a data set into two clusters in a recursive manner is much faster than applying the original NMF to obtain a flat clustering. We take advantage of a special property of rank-2 NMF and design an algorithm that runs faster than existing algorithms due to continuous memory access. Combined with a criterion to stop the recursion, our hierarchical clustering algorithm runs significantly faster and achieves even better clustering quality than existing methods. Another bottleneck of NMF algorithms, which is also a common bottleneck in many other machine learning applications, is to multiply a large sparse data matrix with a tall-and-skinny dense matrix. We use the GPUs to accelerate this routine for sparse matrices with an irregular sparsity structure. Overall, our algorithm shows significant improvement over popular topic modeling methods such as latent Dirichlet allocation, and runs more than 100 times faster on data sets with millions of documents.
|
5 |
Détection de changements en imagerie hyperspectrale : une approche directionnelle / Change detection in hyperspectral imagery : a directional approachBrisebarre, Godefroy 24 November 2014 (has links)
L’imagerie hyperspectrale est un type d’imagerie émergent qui connaît un essor important depuis le début des années 2000. Grâce à une structure spectrale très fine qui produit un volume de donnée très important, elle apporte, par rapport à l’imagerie visible classique, un supplément d’information pouvant être mis à profit dans de nombreux domaines d’exploitation. Nous nous intéressons spécifiquement à la détection et l’analyse de changements entre deux images de la même scène, pour des applications orientées vers la défense.Au sein de ce manuscrit, nous commençons par présenter l’imagerie hyperspectrale et les contraintes associées à son utilisation pour des problématiques de défense. Nous présentons ensuite une méthode de détection et de classification de changements basée sur la recherche de directions spécifiques dans l’espace généré par le couple d’images, puis sur la fusion des directions proches. Nous cherchons ensuite à exploiter l’information obtenue sur les changements en nous intéressant aux possibilités de dé-mélange de séries temporelles d’images d’une même scène. Enfin, nous présentons un certain nombre d’extensions qui pourront être réalisées afin de généraliser ou améliorer les travaux présentés et nous concluons. / Hyperspectral imagery is an emerging imagery technology which has known a growing interest since the 2000’s. This technology allows an impressive growth of the data registered from a specific scene compared to classical RGB imagery. Indeed, although the spatial resolution is significantly lower, the spectral resolution is very small and the covered spectral area is very wide. We focus on change detection between two images of a given scene for defense oriented purposes.In the following, we start by introducing hyperspectral imagery and the specificity of its exploitation for defence purposes. We then present a change detection and analysis method based on the search for specifical directions in the space generated by the image couple, followed by a merging of the nearby directions. We then exploit this information focusing on theunmixing capabilities of multitemporal hyperspectral data. Finally, we will present a range of further works that could be done in relation with our work and conclude about it.
|
6 |
Dictionary learning methods for single-channel source separationLefèvre, Augustin 03 October 2012 (has links) (PDF)
In this thesis we provide three main contributions to blind source separation methods based on NMF. Our first contribution is a group-sparsity inducing penalty specifically tailored for Itakura-Saito NMF. In many music tracks, there are whole intervals where only one source is active at the same time. The group-sparsity penalty we propose allows to blindly indentify these intervals and learn source specific dictionaries. As a consequence, those learned dictionaries can be used to do source separation in other parts of the track were several sources are active. These two tasks of identification and separation are performed simultaneously in one run of group-sparsity Itakura-Saito NMF. Our second contribution is an online algorithm for Itakura-Saito NMF that allows to learn dictionaries on very large audio tracks. Indeed, the memory complexity of a batch implementation NMF grows linearly with the length of the recordings and becomes prohibitive for signals longer than an hour. In contrast, our online algorithm is able to learn NMF on arbitrarily long signals with limited memory usage. Our third contribution deals user informed NMF. In short mixed signals, blind learning becomes very hard and sparsity do not retrieve interpretable dictionaries. Our contribution is very similar in spirit to inpainting. It relies on the empirical fact that, when observing the spectrogram of a mixture signal, an overwhelming proportion of it consists in regions where only one source is active. We describe an extension of NMF to take into account time-frequency localized information on the absence/presence of each source. We also investigate inferring this information with tools from machine learning.
|
7 |
Blind inverse imaging with positivity constraints / Inversion aveugle d'images avec contraintes de positivitéLecharlier, Loïc 09 September 2014 (has links)
Dans les problèmes inverses en imagerie, on suppose généralement connu l’opérateur ou matrice décrivant le système de formation de l’image. De façon équivalente pour un système linéaire, on suppose connue sa réponse impulsionnelle. Toutefois, ceci n’est pas une hypothèse réaliste pour de nombreuses applications pratiques pour lesquelles cet opérateur n’est en fait pas connu (ou n’est connu qu’approximativement). On a alors affaire à un problème d’inversion dite “aveugle”. Dans le cas de systèmes invariants par translation, on parle de “déconvolution aveugle” car à la fois l’image ou objet de départ et la réponse impulsionnelle doivent être estimées à partir de la seule image observée qui résulte d’une convolution et est affectée d’erreurs de mesure. Ce problème est notoirement difficile et pour pallier les ambiguïtés et les instabilités numériques inhérentes à ce type d’inversions, il faut recourir à des informations ou contraintes supplémentaires, telles que la positivité qui s’est avérée un levier de stabilisation puissant dans les problèmes d’imagerie non aveugle. La thèse propose de nouveaux algorithmes d’inversion aveugle dans un cadre discret ou discrétisé, en supposant que l’image inconnue, la matrice à inverser et les données sont positives. Le problème est formulé comme un problème d’optimisation (non convexe) où le terme d’attache aux données à minimiser, modélisant soit le cas de données de type Poisson (divergence de Kullback-Leibler) ou affectées de bruit gaussien (moindres carrés), est augmenté par des termes de pénalité sur les inconnues du problème. La stratégie d’optimisation consiste en des ajustements alternés de l’image à reconstruire et de la matrice à inverser qui sont de type multiplicatif et résultent de la minimisation de fonctions coût “surrogées” valables dans le cas positif. Le cadre assez général permet d’utiliser plusieurs types de pénalités, y compris sur la variation totale (lissée) de l’image. Une normalisation éventuelle de la réponse impulsionnelle ou de la matrice est également prévue à chaque itération. Des résultats de convergence pour ces algorithmes sont établis dans la thèse, tant en ce qui concerne la décroissance des fonctions coût que la convergence de la suite des itérés vers un point stationnaire. La méthodologie proposée est validée avec succès par des simulations numériques relatives à différentes applications telle que la déconvolution aveugle d'images en astronomie, la factorisation en matrices positives pour l’imagerie hyperspectrale et la déconvolution de densités en statistique. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished
|
8 |
Topic Analysis of Tweets on the European Refugee Crisis Using Non-negative Matrix FactorizationShen, Chong 01 January 2016 (has links)
The ongoing European Refugee Crisis has been one of the most popular trending topics on Twitter for the past 8 months. This paper applies topic modeling on bulks of tweets to discover the hidden patterns within these social media discussions. In particular, we perform topic analysis through solving Non-negative Matrix Factorization (NMF) as an Inexact Alternating Least Squares problem. We accelerate the computation using techniques including tweet sampling and augmented NMF, compare NMF results with different ranks and visualize the outputs through topic representation and frequency plots. We observe that supportive sentiments maintained a strong presence while negative sentiments such as safety concerns have emerged over time.
|
9 |
Apprentissage avec la parcimonie et sur des données incertaines par la programmation DC et DCA / Learning with sparsity and uncertainty by Difference of Convex functions optimizationVo, Xuan Thanh 15 October 2015 (has links)
Dans cette thèse, nous nous concentrons sur le développement des méthodes d'optimisation pour résoudre certaines classes de problèmes d'apprentissage avec la parcimonie et/ou avec l'incertitude des données. Nos méthodes sont basées sur la programmation DC (Difference of Convex functions) et DCA (DC Algorithms) étant reconnues comme des outils puissants d'optimisation. La thèse se compose de deux parties : La première partie concerne la parcimonie tandis que la deuxième partie traite l'incertitude des données. Dans la première partie, une étude approfondie pour la minimisation de la norme zéro a été réalisée tant sur le plan théorique qu'algorithmique. Nous considérons une approximation DC commune de la norme zéro et développons quatre algorithmes basées sur la programmation DC et DCA pour résoudre le problème approché. Nous prouvons que nos algorithmes couvrent tous les algorithmes standards existants dans le domaine. Ensuite, nous étudions le problème de la factorisation en matrices non-négatives (NMF) et fournissons des algorithmes appropriés basés sur la programmation DC et DCA. Nous étudions également le problème de NMF parcimonieuse. Poursuivant cette étude, nous étudions le problème d'apprentissage de dictionnaire où la représentation parcimonieuse joue un rôle crucial. Dans la deuxième partie, nous exploitons la technique d'optimisation robuste pour traiter l'incertitude des données pour les deux problèmes importants dans l'apprentissage : la sélection de variables dans SVM (Support Vector Machines) et le clustering. Différents modèles d'incertitude sont étudiés. Les algorithmes basés sur DCA sont développés pour résoudre ces problèmes. / In this thesis, we focus on developing optimization approaches for solving some classes of optimization problems in sparsity and robust optimization for data uncertainty. Our methods are based on DC (Difference of Convex functions) programming and DCA (DC Algorithms) which are well-known as powerful tools in optimization. This thesis is composed of two parts: the first part concerns with sparsity while the second part deals with uncertainty. In the first part, a unified DC approximation approach to optimization problem involving the zero-norm in objective is thoroughly studied on both theoretical and computational aspects. We consider a common DC approximation of zero-norm that includes all standard sparse inducing penalty functions, and develop general DCA schemes that cover all standard algorithms in the field. Next, the thesis turns to the nonnegative matrix factorization (NMF) problem. We investigate the structure of the considered problem and provide appropriate DCA based algorithms. To enhance the performance of NMF, the sparse NMF formulations are proposed. Continuing this topic, we study the dictionary learning problem where sparse representation plays a crucial role. In the second part, we exploit robust optimization technique to deal with data uncertainty for two important problems in machine learning: feature selection in linear Support Vector Machines and clustering. In this context, individual data point is uncertain but varies in a bounded uncertainty set. Different models (box/spherical/ellipsoidal) related to uncertain data are studied. DCA based algorithms are developed to solve the robust problems
|
10 |
Simultaneous control of coupled actuators using singular value decomposition and semi-nonnegative matrix factorizationWinck, Ryder Christian 08 November 2012 (has links)
This thesis considers the application of singular value decomposition (SVD) and semi-nonnegative matrix factorization (SNMF) within feedback control systems, called the SVD System and SNMF System, to control numerous subsystems with a reduced number of control inputs. The subsystems are coupled using a row-column structure to allow mn subsystems to be controlled using m+n inputs. Past techniques for controlling systems in this row-column structure have focused on scheduling procedures that offer limited performance. The SVD and SNMF Systems permit simultaneous control of every subsystem, which increases the convergence rate by an order of magnitude compared with previous methods. In addition to closed loop control, open loop procedures using the SVD and SNMF are compared with previous scheduling procedures, demonstrating significant performance improvements. This thesis presents theoretical results for the controllability of systems using the row-column structure and for the stability and performance of the SVD and SNMF Systems. Practical challenges to the implementation of the SVD and SNMF Systems are also examined. Numerous simulation examples are provided, in particular, a dynamic simulation of a pin array device, called Digital Clay, and two physical demonstrations are used to assess the feasibility of the SVD and SNMF Systems for specific applications.
|
Page generated in 0.1669 seconds