41 |
Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis / Prétopologie et modélisation de sujets pour l'analyse de systèmes complexes : application à la classification de documents et à l'analyse de réseaux complexesBui, Quang Vu 27 September 2018 (has links)
Les travaux de cette thèse présentent le développement d'algorithmes de classification de documents d'une part, ou d'analyse de réseaux complexes d'autre part, en s'appuyant sur la prétopologie, une théorie qui modélise le concept de proximité. Le premier travail développe un cadre pour la classification de documents en combinant une approche de topicmodeling et la prétopologie. Notre contribution propose d'utiliser des distributions de sujets extraites à partir d'un traitement topic-modeling comme entrées pour des méthodes de classification. Dans cette approche, nous avons étudié deux aspects : déterminer une distance adaptée entre documents en étudiant la pertinence des mesures probabilistes et des mesures vectorielles, et effet réaliser des regroupements selon plusieurs critères en utilisant une pseudo-distance définie à partir de la prétopologie. Le deuxième travail introduit un cadre général de modélisation des Réseaux Complexes en développant une reformulation de la prétopologie stochastique, il propose également un modèle prétopologique de cascade d'informations comme modèle général de diffusion. De plus, nous avons proposé un modèle agent, Textual-ABM, pour analyser des réseaux complexes dynamiques associés à des informations textuelles en utilisant un modèle auteur-sujet et nous avons introduit le Textual-Homo-IC, un modèle de cascade indépendant de la ressemblance, dans lequel l'homophilie est fondée sur du contenu textuel obtenu par un topic-model. / The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling.
|
42 |
Near-Duplicate Detection Using Instance Level ConstraintsPatel, Vishal 08 1900 (has links) (PDF)
For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.
|
43 |
Représentations robustes de documents bruités dans des espaces homogènes / Robust representation of noisy documents in homogeneous spacesMorchid, Mohamed 25 November 2014 (has links)
En recherche d’information, les documents sont le plus souvent considérés comme des "sacs-de-mots". Ce modèle ne tient pas compte de la structure temporelle du document et est sensible aux bruits qui peuvent altérer la forme lexicale. Ces bruits peuvent être produits par différentes sources : forme peu contrôlée des messages des sites de micro-blogging, messages vocaux dont la transcription automatique contient des erreurs, variabilités lexicales et grammaticales dans les forums du Web. . . Le travail présenté dans cette thèse s’intéresse au problème de la représentation de documents issus de sources bruitées.La thèse comporte trois parties dans lesquelles différentes représentations des contenus sont proposées. La première partie compare une représentation classique utilisant la fréquence des mots à une représentation de haut-niveau s’appuyant sur un espace de thèmes. Cette abstraction du contenu permet de limiter l’altération de la forme de surface du document bruité en le représentant par un ensemble de caractéristiques de haut-niveau. Nos expériences confirment que cette projection dans un espace de thèmes permet d’améliorer les résultats obtenus sur diverses tâches de recherche d’information en comparaison d’une représentation plus classique utilisant la fréquence des mots.Le problème majeur d’une telle représentation est qu’elle est fondée sur un espace de thèmes dont les paramètres sont choisis empiriquement.La deuxième partie décrit une nouvelle représentation s’appuyant sur des espaces multiples et permettant de résoudre trois problèmes majeurs : la proximité des sujets traités dans le document, le choix difficile des paramètres du modèle de thèmes ainsi que la robustesse de la représentation. Partant de l’idée qu’une seule représentation des contenus ne peut pas capturer l’ensemble des informations utiles, nous proposons d’augmenter le nombre de vues sur un même document. Cette multiplication des vues permet de générer des observations "artificielles" qui contiennent des fragments de l’information utile. Une première expérience a validé cette approche multi-vues de la représentation de textes bruités. Elle a cependant l’inconvénient d’être très volumineuse,redondante, et de contenir une variabilité additionnelle liée à la diversité des vues. Dans un deuxième temps, nous proposons une méthode s’appuyant sur l’analyse factorielle pour fusionner les vues multiples et obtenir une nouvelle représentation robuste,de dimension réduite, ne contenant que la partie "utile" du document tout en réduisant les variabilités "parasites". Lors d’une tâche de catégorisation de conversations,ce processus de compression a confirmé qu’il permettait d’augmenter la robustesse de la représentation du document bruité.Cependant, lors de l’élaboration des espaces de thèmes, le document reste considéré comme un "sac-de-mots" alors que plusieurs études montrent que la position d’un terme au sein du document est importante. Une représentation tenant compte de cette structure temporelle du document est proposée dans la troisième partie. Cette représentation s’appuie sur les nombres hyper-complexes de dimension appelés quaternions. Nos expériences menées sur une tâche de catégorisation ont montré l’efficacité de cette méthode comparativement aux représentations classiques en "sacs-de-mots". / In the Information Retrieval field, documents are usually considered as a "bagof-words". This model does not take into account the temporal structure of thedocument and is sensitive to noises which can alter its lexical form. These noisescan be produced by different sources : uncontrolled form of documents in microbloggingplatforms, automatic transcription of speech documents which are errorprone,lexical and grammatical variabilities in Web forums. . . The work presented inthis thesis addresses issues related to document representations from noisy sources.The thesis consists of three parts in which different representations of content areavailable. The first one compares a classical representation based on a term-frequencyrepresentation to a higher level representation based on a topic space. The abstractionof the document content allows us to limit the alteration of the noisy document byrepresenting its content with a set of high-level features. Our experiments confirm thatmapping a noisy document into a topic space allows us to improve the results obtainedduring different information retrieval tasks compared to a classical approach based onterm frequency. The major problem with such a high-level representation is that it isbased on a space theme whose parameters are chosen empirically.The second part presents a novel representation based on multiple topic spaces thatallow us to solve three main problems : the closeness of the subjects discussed in thedocument, the tricky choice of the "right" values of the topic space parameters and therobustness of the topic-based representation. Based on the idea that a single representationof the contents cannot capture all the relevant information, we propose to increasethe number of views on a single document. This multiplication of views generates "artificial"observations that contain fragments of useful information. The first experimentvalidated the multi-view approach to represent noisy texts. However, it has the disadvantageof being very large and redundant and of containing additional variability associatedwith the diversity of views. In the second step, we propose a method based onfactor analysis to compact the different views and to obtain a new robust representationof low dimension which contains only the informative part of the document whilethe noisy variabilities are compensated. During a dialogue classification task, the compressionprocess confirmed that this compact representation allows us to improve therobustness of noisy document representation.Nonetheless, during the learning process of topic spaces, the document is consideredas a "bag-of-words" while many studies have showed that the word position in a7document is useful. A representation which takes into account the temporal structureof the document based on hyper-complex numbers is proposed in the third part. Thisrepresentation is based on the hyper-complex numbers of dimension four named quaternions.Our experiments on a classification task have showed the effectiveness of theproposed approach compared to a conventional "bag-of-words" representation.
|
44 |
Text mining Twitter social media for Covid-19 : Comparing latent semantic analysis and latent Dirichlet allocationSheikha, Hassan January 2020 (has links)
In this thesis, the Twitter social media is data mined for information about the covid-19 outbreak during the month of March, starting from the 3’rd and ending on the 31’st. 100,000 tweets were collected from Harvard’s opensource data and recreated using Hydrate. This data is analyzed further using different Natural Language Processing (NLP) methodologies, such as termfrequency inverse document frequency (TF-IDF), lemmatizing, tokenizing, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Furthermore, the results of the LSA and LDA algorithms is reduced dimensional data that will be clustered using clustering algorithms HDBSCAN and K-Means for later comparison. Different methodologies are used to determine the optimal parameters for the algorithms. This is all done in the python programing language, as there are libraries for supporting this research, the most important being scikit-learn. The frequent words of each cluster will then be displayed and compared with factual data regarding the outbreak to discover if there are any correlations. The factual data is collected by World Health Organization (WHO) and is then visualized in graphs in ourworldindata.org. Correlations with the results are also looked for in news articles to find any significant moments to see if that affected the top words in the clustered data. The news articles with good timelines used for correlating incidents are that of NBC News and New York Times. The results show no direct correlations with the data reported by WHO, however looking into the timelines reported by news sources some correlation can be seen with the clustered data. Also, the combination of LDA and HDBSCAN yielded the most desireable results in comparison to the other combinations of the dimnension reductions and clustering. This was much due to the use of GridSearchCV on LDA to determine the ideal parameters for the LDA models on each dataset as well as how well HDBSCAN clusters its data in comparison to K-Means.
|
45 |
Dynamic stochastic block models, clustering and segmentation in dynamic graphs / Modèles à bloques stochastiques dynamiques pour la classification et la segmentation des graphes dynamiquesCorneli, Marco 17 November 2017 (has links)
Cette thèse porte sur l’analyse de graphes dynamiques, définis en temps discret ou continu. Nous introduisons une nouvelle extension dynamique du modèle a blocs stochastiques (SBM), appelée dSBM, qui utilise des processus de Poisson non homogènes pour modéliser les interactions parmi les paires de nœuds d’un graphe dynamique. Les fonctions d’intensité des processus ne dépendent que des classes des nœuds comme dans SBM. De plus, ces fonctions d’intensité ont des propriétés de régularité sur des intervalles temporels qui sont à estimer, et à l’intérieur desquels les processus de Poisson redeviennent homogènes. Un récent algorithme d’estimation pour SBM, qui repose sur la maximisation d’un critère exact (ICL exacte) est ici adopté pour estimer les paramètres de dSBM et sélectionner simultanément le modèle optimal. Ensuite, un algorithme exact pour la détection de rupture dans les séries temporelles, la méthode «pruned exact linear time» (PELT), est étendu pour faire de la détection de rupture dans des données de graphe dynamique selon le modèle dSBM. Enfin, le modèle dSBM est étendu ultérieurement pour faire de l’analyse de réseau textuel dynamique. Les réseaux sociaux sont un exemple de réseaux textuels: les acteurs s’échangent des documents (posts, tweets, etc.) dont le contenu textuel peut être utilisé pour faire de la classification et détecter la structure temporelle du graphe dynamique. Le modèle que nous introduisons est appelé «dynamic stochastic topic block model» (dSTBM). / This thesis focuses on the statistical analysis of dynamic graphs, both defined in discrete or continuous time. We introduce a new extension of the stochastic block model (SBM) for dynamic graphs. The proposed approach, called dSBM, adopts non homogeneous Poisson processes to model the interaction times between pairs of nodes in dynamic graphs, either in discrete or continuous time. The intensity functions of the processes only depend on the node clusters, in a block modelling perspective. Moreover, all the intensity functions share some regularity properties on hidden time intervals that need to be estimated. A recent estimation algorithm for SBM, based on the greedy maximization of an exact criterion (exact ICL) is adopted for inference and model selection in dSBM. Moreover, an exact algorithm for change point detection in time series, the "pruned exact linear time" (PELT) method is extended to deal with dynamic graph data modelled via dSBM. The approach we propose can be used for change point analysis in graph data. Finally, a further extension of dSBM is developed to analyse dynamic net- works with textual edges (like social networks, for instance). In this context, the graph edges are associated with documents exchanged between the corresponding vertices. The textual content of the documents can provide additional information about the dynamic graph topological structure. The new model we propose is called "dynamic stochastic topic block model" (dSTBM).Graphs are mathematical structures very suitable to model interactions between objects or actors of interest. Several real networks such as communication networks, financial transaction networks, mobile telephone networks and social networks (Facebook, Linkedin, etc.) can be modelled via graphs. When observing a network, the time variable comes into play in two different ways: we can study the time dates at which the interactions occur and/or the interaction time spans. This thesis only focuses on the first time dimension and each interaction is assumed to be instantaneous, for simplicity. Hence, the network evolution is given by the interaction time dates only. In this framework, graphs can be used in two different ways to model networks. Discrete time […] Continuous time […]. In this thesis both these perspectives are adopted, alternatively. We consider new unsupervised methods to cluster the vertices of a graph into groups of homogeneous connection profiles. In this manuscript, the node groups are assumed to be time invariant to avoid possible identifiability issues. Moreover, the approaches that we propose aim to detect structural changes in the way the node clusters interact with each other. The building block of this thesis is the stochastic block model (SBM), a probabilistic approach initially used in social sciences. The standard SBM assumes that the nodes of a graph belong to hidden (disjoint) clusters and that the probability of observing an edge between two nodes only depends on their clusters. Since no further assumption is made on the connection probabilities, SBM is a very flexible model able to detect different network topologies (hubs, stars, communities, etc.).
|
46 |
Neural probabilistic topic modeling of short and messy text / Neuronprobabilistisk ämnesmodellering av kort och stökig textHarrysson, Mattias January 2016 (has links)
Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test. / Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.
|
47 |
Parallel Algorithms for Machine LearningMoon, Gordon Euhyun 02 October 2019 (has links)
No description available.
|
48 |
Dynamic Probabilistic Risk Assessment of Nuclear Power Generation StationsElsefy, Mohamed HM January 2021 (has links)
Risk assessment is essential for nuclear power plants (NPPs) due to the complex dynamic nature of such systems-of-systems, as well as the devastating impacts of nuclear accidents on the environment, public health, and economy. Lessons learned from the Fukushima nuclear accident demonstrated the importance of enhancing current risk assessment methodologies and developing efficient early warning decision support tools. Static probabilistic risk assessment (PRA) techniques (e.g., event and fault tree analysis) have been extensively adopted in nuclear applications to ensure NPPs comply with safety regulations. However, numerous studies have highlighted the limitations of static PRA methods such as the lack of considering the dynamic hardware/software/operator interactions inside the NPP and the timing/sequence of events. In response, several dynamic probabilistic risk assessment (DPRA) methodologies have been developed and continuously evolved over the past four decades to overcome the limitations of static PRA methods. DPRA presents a comprehensive approach to assess the risks associated with complex, dynamic systems. However, current DPRA approaches are faced with challenges associated with the intra/interdependence within/between different NPP complex systems and the massive amount of data that needs to be analyzed and rapidly acted upon. In response to these limitations of previous work, the main objective of this dissertation is to develop a physics-based DPRA platform and an intelligent data-driven prediction tool for NPP safety enhancement under normal and abnormal operating conditions. The results of this dissertation demonstrate that the developed DPRA platform is capable of simulating the dynamic interaction between different NPP systems and estimating the temporal probability of core damage under different transients with significant analysis advantages from both the computational time and data storage perspectives. The developed platform can also explicitly account for uncertainties associated with the NPP's physical parameters and operating conditions on the plant's response and probability of its core damage. Furthermore, an intelligent decision support tool, developed based on artificial neural networks (ANN), can significantly improve the safety of NPPs by providing the plant operators with fast and accurate predictions that are specific to such NPP. Such rapid prediction will minimize the need to resort to idealized physics-based simulators to predict the underlying complex physical interactions. Moving forward, the developed ANN model can be trained under plant operational data, plants operating experience database, and data from rare event simulations to consider for example plant ageing with time, operational transients, and rare events in predicting the plant behavior. Such intelligent tool can be key for NPP operators and managers to take rapid and reliable actions under abnormal conditions. / Thesis / Doctor of Philosophy (PhD)
|
49 |
A Classification Tool for Predictive Data Analysis in HealthcareVictors, Mason Lemoyne 07 March 2013 (has links) (PDF)
Hidden Markov Models (HMMs) have seen widespread use in a variety of applications ranging from speech recognition to gene prediction. While developed over forty years ago, they remain a standard tool for sequential data analysis. More recently, Latent Dirichlet Allocation (LDA) was developed and soon gained widespread popularity as a powerful topic analysis tool for text corpora. We thoroughly develop LDA and a generalization of HMMs and demonstrate the conjunctive use of both methods in predictive data analysis for health care problems. While these two tools (LDA and HMM) have been used in conjunction previously, we use LDA in a new way to reduce the dimensionality involved in the training of HMMs. With both LDA and our extension of HMM, we train classifiers to predict development of Chronic Kidney Disease (CKD) in the near future.
|
50 |
Cliff Walls: Threats to Validity in Empirical Studies of Open Source ForgesPratt, Landon James 27 February 2013 (has links) (PDF)
Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. Open source software forges are of great value to the software researcher, because they expose many of the artifacts of software development. However, many challenges affect the quality of artifact-based studies, especially those studies examining software evolution. This thesis addresses one of these threats: the presence of very large commits, which we refer to as "Cliff Walls." Cliff walls are a threat to studies of software evolution because they do not appear to represent incremental development. In this thesis we demonstrate the existence of cliff walls in open source software projects and discuss the threats they present. We also seek to identify key causes of these monolithic commits, and begin to explore ways that researchers can mitigate the threats of cliff walls.
|
Page generated in 0.1705 seconds