Spelling suggestions: "subject:"mmf"" "subject:"fmf""
31 |
Cluster Identification : Topic Models, Matrix Factorization And Concept Association NetworksArun, R 07 1900 (has links) (PDF)
The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization.
Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.
|
32 |
Extending the explanatory power of factor pricing models using topic modeling / Högre förklaringsgrad hos faktorprismodeller genom topic modelingEverling, Nils January 2017 (has links)
Factor models attribute stock returns to a linear combination of factors. A model with great explanatory power (R2) can be used to estimate the systematic risk of an investment. One of the most important factors is the industry which the company of the stock operates in. In commercial risk models this factor is often determined with a manually constructed stock classification scheme such as GICS. We present Natural Language Industry Scheme (NLIS), an automatic and multivalued classification scheme based on topic modeling. The topic modeling is performed on transcripts of company earnings calls and identifies a number of topics analogous to industries. We use non-negative matrix factorization (NMF) on a term-document matrix of the transcripts to perform the topic modeling. When set to explain returns of the MSCI USA index we find that NLIS consistently outperforms GICS, often by several hundred basis points. We attribute this to NLIS’ ability to assign a stock to multiple industries. We also suggest that the proportions of industry assignments for a given stock could correspond to expected future revenue sources rather than current revenue sources. This property could explain some of NLIS’ success since it closely relates to theoretical stock pricing. / Faktormodeller förklarar aktieprisrörelser med en linjär kombination av faktorer. En modell med hög förklaringsgrad (R2) kan användas föratt skatta en investerings systematiska risk. En av de viktigaste faktorerna är aktiebolagets industritillhörighet. I kommersiella risksystem bestäms industri oftast med ett aktieklassifikationsschema som GICS, publicerat av ett finansiellt institut. Vi presenterar Natural Language Industry Scheme (NLIS), ett automatiskt klassifikationsschema baserat på topic modeling. Vi utför topic modeling på transkript av aktiebolags investerarsamtal. Detta identifierar ämnen, eller topics, som är jämförbara med industrier. Topic modeling sker genom icke-negativmatrisfaktorisering (NMF) på en ord-dokumentmatris av transkripten. När NLIS används för att förklara prisrörelser hos MSCI USA-indexet finner vi att NLIS överträffar GICS, ofta med 2-3 procent. Detta tillskriver vi NLIS förmåga att ge flera industritillhörigheter åt samma aktie. Vi föreslår också att proportionerna hos industritillhörigheterna för en aktie kan motsvara förväntade inkomstkällor snarare än nuvarande inkomstkällor. Denna egenskap kan också vara en anledning till NLIS framgång då den nära relaterar till teoretisk aktieprissättning.
|
33 |
Help Document Recommendation SystemVijay Kumar, Keerthi, Mary Stanly, Pinky January 2023 (has links)
Help documents are important in an organization to use the technology applications licensed from a vendor. Customers and internal employees frequently use and interact with the help documents section to use the applications and know about the new features and developments in them. Help documents consist of various knowledge base materials, question and answer documents and help content. In day- to-day life, customers go through these documents to set up, install or use the product. Recommending similar documents to the customers can increase customer engagement in the product and can also help them proceed without any hurdles. The main aim of this study is to build a recommendation system by exploring different machine-learning techniques to recommend the most relevant and similar help document to the user. To achieve this, in this study a hybrid-based recommendation system for help documents is proposed where the documents are recommended based on similarity of the content using content-based filtering and similarity between the users using collaborative filtering. Finally, the recommendations from content-based filtering and collaborative filtering are combined and ranked to form a comprehensive list of recommendations. The proposed approach is evaluated by the internal employees of the company and by external users. Our experimental results demonstrate that the proposed approach is feasible and provides an effective way to recommend help documents.
|
34 |
Sentiment-Driven Topic Analysis Of Song LyricsSharma, Govind 08 1900 (has links) (PDF)
Sentiment Analysis is an area of Computer Science that deals with the impact a document makes on a user. The very field is further sub-divided into Opinion Mining and Emotion Analysis, the latter of which is the basis for the present work. Work on songs is aimed at building affective interactive applications such as music recommendation engines. Using song lyrics, we are interested in both supervised and unsupervised analyses, each of which has its own pros and cons.
For an unsupervised analysis (clustering), we use a standard probabilistic topic model called Latent Dirichlet Allocation (LDA). It mines topics from songs, which are nothing but probability distributions over the vocabulary of words. Some of the topics seem sentiment-based, motivating us to continue with this approach. We evaluate our clusters using a gold dataset collected from an apt website and get positive results. This approach would be useful in the absence of a supervisor dataset.
In another part of our work, we argue the inescapable existence of supervision in terms of having to manually analyse the topics returned. Further, we have also used explicit supervision in terms of a training dataset for a classifier to learn sentiment specific classes. This analysis helps reduce dimensionality and improve classification accuracy. We get excellent dimensionality reduction using Support Vector Machines (SVM) for feature selection. For re-classification, we use the Naive Bayes Classifier (NBC) and SVM, both of which perform well. We also use Non-negative Matrix Factorization (NMF) for classification, but observe that the results coincide with those of NBC, with no exceptions. This drives us towards establishing a theoretical equivalence between the two.
|
35 |
Programmation DC et DCA pour l'optimisation non convexe/optimisation globale en variables mixtes entières : Codes et Applications / DC programming and DCA for nonconvex optimization/ global optimization in mixed integer programming : Codes and applicationsPham, Viet Nga 18 April 2013 (has links)
Basés sur les outils théoriques et algorithmiques de la programmation DC et DCA, les travaux de recherche dans cette thèse portent sur les approches locales et globales pour l'optimisation non convexe et l'optimisation globale en variables mixtes entières. La thèse comporte 5 chapitres. Le premier chapitre présente les fondements de la programmation DC et DCA, et techniques de Séparation et Evaluation (B&B) (utilisant la technique de relaxation DC pour le calcul des bornes inférieures de la valeur optimale) pour l'optimisation globale. Y figure aussi des résultats concernant la pénalisation exacte pour la programmation en variables mixtes entières. Le deuxième chapitre est consacré au développement d'une méthode DCA pour la résolution d'une classe NP-difficile des programmes non convexes non linéaires en variables mixtes entières. Ces problèmes d'optimisation non convexe sont tout d'abord reformulées comme des programmes DC via les techniques de pénalisation en programmation DC de manière que les programmes DC résultants soient efficacement résolus par DCA et B&B bien adaptés. Comme première application en optimisation financière, nous avons modélisé le problème de gestion de portefeuille sous le coût de transaction concave et appliqué DCA et B&B à sa résolution. Dans le chapitre suivant nous étudions la modélisation du problème de minimisation du coût de transaction non convexe discontinu en gestion de portefeuille sous deux formes : la première est un programme DC obtenu en approximant la fonction objectif du problème original par une fonction DC polyèdrale et la deuxième est un programme DC mixte 0-1 équivalent. Et nous présentons DCA, B&B, et l'algorithme combiné DCA-B&B pour leur résolution. Le chapitre 4 étudie la résolution exacte du problème multi-objectif en variables mixtes binaires et présente deux applications concrètes de la méthode proposée. Nous nous intéressons dans le dernier chapitre à ces deux problématiques challenging : le problème de moindres carrés linéaires en variables entières bornées et celui de factorisation en matrices non négatives (Nonnegative Matrix Factorization (NMF)). La méthode NMF est particulièrement importante de par ses nombreuses et diverses applications tandis que les applications importantes du premier se trouvent en télécommunication. Les simulations numériques montrent la robustesse, rapidité (donc scalabilité), performance et la globalité de DCA par rapport aux méthodes existantes. / Based on theoretical and algorithmic tools of DC programming and DCA, the research in this thesis focus on the local and global approaches for non convex optimization and global mixed integer optimization. The thesis consists of 5 chapters. The first chapter presents fundamentals of DC programming and DCA, and techniques of Branch and Bound method (B&B) for global optimization (using the DC relaxation technique for calculating lower bounds of the optimal value). It shall include results concerning the exact penalty technique in mixed integer programming. The second chapter is devoted of a DCA method for solving a class of NP-hard nonconvex nonlinear mixed integer programs. These nonconvex problems are firstly reformulated as DC programs via penalty techniques in DC programming so that the resulting DC programs are effectively solved by DCA and B&B well adapted. As a first application in financial optimization, we modeled the problem pf portfolio selection under concave transaction costs and applied DCA and B&B to its solutions. In the next chapter we study the modeling of the problem of minimization of nonconvex discontinuous transaction costs in portfolio selection in two forms: the first is a DC program obtained by approximating the objective function of the original problem by a DC polyhedral function and the second is an equivalent mixed 0-1 DC program. And we present DCA, B&B algorithm, and a combined DCA-B&B algorithm for their solutions. Chapter 4 studied the exact solution for the multi-objective mixed zero-one linear programming problem and presents two practical applications of proposed method. We are interested int the last chapter two challenging problems: the linear integer least squares problem and the Nonnegative Mattrix Factorization problem (NMF). The NMF method is particularly important because of its many various applications of the first are in telecommunications. The numerical simulations show the robustness, speed (thus scalability), performance, and the globality of DCA in comparison to existent methods.
|
36 |
Fusion pour la séparation de sources audio / Fusion for audio source separationJaureguiberry, Xabier 16 June 2015 (has links)
La séparation aveugle de sources audio dans le cas sous-déterminé est un problème mathématique complexe dont il est aujourd'hui possible d'obtenir une solution satisfaisante, à condition de sélectionner la méthode la plus adaptée au problème posé et de savoir paramétrer celle-ci soigneusement. Afin d'automatiser cette étape de sélection déterminante, nous proposons dans cette thèse de recourir au principe de fusion. L'idée est simple : il s'agit, pour un problème donné, de sélectionner plusieurs méthodes de résolution plutôt qu'une seule et de les combiner afin d'en améliorer la solution. Pour cela, nous introduisons un cadre général de fusion qui consiste à formuler l'estimée d'une source comme la combinaison de plusieurs estimées de cette même source données par différents algorithmes de séparation, chaque estimée étant pondérée par un coefficient de fusion. Ces coefficients peuvent notamment être appris sur un ensemble d'apprentissage représentatif du problème posé par minimisation d'une fonction de coût liée à l'objectif de séparation. Pour aller plus loin, nous proposons également deux approches permettant d'adapter les coefficients de fusion au signal à séparer. La première formule la fusion dans un cadre bayésien, à la manière du moyennage bayésien de modèles. La deuxième exploite les réseaux de neurones profonds afin de déterminer des coefficients de fusion variant en temps. Toutes ces approches ont été évaluées sur deux corpus distincts : l'un dédié au rehaussement de la parole, l'autre dédié à l'extraction de voix chantée. Quelle que soit l'approche considérée, nos résultats montrent l'intérêt systématique de la fusion par rapport à la simple sélection, la fusion adaptative par réseau de neurones se révélant être la plus performante. / Underdetermined blind source separation is a complex mathematical problem that can be satisfyingly resolved for some practical applications, providing that the right separation method has been selected and carefully tuned. In order to automate this selection process, we propose in this thesis to resort to the principle of fusion which has been widely used in the related field of classification yet is still marginally exploited in source separation. Fusion consists in combining several methods to solve a given problem instead of selecting a unique one. To do so, we introduce a general fusion framework in which a source estimate is expressed as a linear combination of estimates of this same source given by different separation algorithms, each source estimate being weighted by a fusion coefficient. For a given task, fusion coefficients can then be learned on a representative training dataset by minimizing a cost function related to the separation objective. To go further, we also propose two ways to adapt the fusion coefficients to the mixture to be separated. The first one expresses the fusion of several non-negative matrix factorization (NMF) models in a Bayesian fashion similar to Bayesian model averaging. The second one aims at learning time-varying fusion coefficients thanks to deep neural networks. All proposed methods have been evaluated on two distinct corpora. The first one is dedicated to speech enhancement while the other deals with singing voice extraction. Experimental results show that fusion always outperform simple selection in all considered cases, best results being obtained by adaptive time-varying fusion with neural networks.
|
Page generated in 0.0431 seconds