Global ETD Search

21	Ukhetho : A Text Mining Study Of The South African General Elections Moodley, Avashlin January 2019 (has links) The elections in South Africa are contested by multiple political parties appealing to a diverse population that comes from a variety of socioeconomic backgrounds. As a result, a rich source of discourse is created to inform voters about election-related content. Two common sources of information to help voters with their decision are news articles and tweets, this study aims to understand the discourse in these two sources using natural language processing. Topic modelling techniques, Latent Dirichlet Allocation and Non- negative Matrix Factorization, are applied to digest the breadth of information collected about the elections into topics. The topics produced are subjected to further analysis that uncovers similarities between topics, links topics to dates and events and provides a summary of the discourse that existed prior to the South African general elections. The primary focus is on the 2019 elections, however election-related articles from 2014 and 2019 were also compared to understand how the discourse has changed. / Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2019. / Computer Science / MIT (Big Data Science) / Unrestricted UCTD Election analysis, natural language processing text mining latent dirichlet allocation non-negative matrix factorization
22	OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM Jaradat, Shatha January 2015 (has links) Providing high quality of topics inference in today's large and dynamic corpora, such as Twitter, is a challenging task. This is especially challenging taking into account that the content in this environment contains short texts and many abbreviations. This project proposes an improvement of a popular online topics modelling algorithm for Latent Dirichlet Allocation (LDA), by incorporating supervision to make it suitable for Twitter context. This improvement is motivated by the need for a single algorithm that achieves both objectives: analyzing huge amounts of documents, including new documents arriving in a stream, and, at the same time, achieving high quality of topics’ detection in special case environments, such as Twitter. The proposed algorithm is a combination of an online algorithm for LDA and a supervised variant of LDA - labeled LDA. The performance and quality of the proposed algorithm is compared with these two algorithms. The results demonstrate that the proposed algorithm has shown better performance and quality when compared to the supervised variant of LDA, and it achieved better results in terms of quality in comparison to the online algorithm. These improvements make our algorithm an attractive option when applied to dynamic environments, like Twitter. An environment for analyzing and labelling data is designed to prepare the dataset before executing the experiments. Possible application areas for the proposed algorithm are tweets recommendation and trends detection. / Tillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt. Latent Dirichlet Allocation Labeled Latent Dirichlet Allocation online Variational Bayes for LDA multi-labeled supervised Twitter recommendations variational inference. Latent Dirichlet Allocation Labeled Latent Dirichlet Allocation online Variational Bayes for LDA multi-labeled supervised Twitter recommendations variational inference. Computer and Information Sciences Data- och informationsvetenskap
23	Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research / Ämnesspridning över tid inom säkerhetskonferenser med hjälp av topic modeling Johansson, Richard, Engström Heino, Otto January 2021 (has links) When conducting research, it is valuable to find high-ranked papers closely related to the specific research area, without spending too much time reading insignificant papers. To make this process more effective an automated process to extract topics from documents would be useful, and this is possible using topic modeling. Topic modeling can also be used to provide topic trends, where a topic is first mentioned, and who the original author was. In this paper, over 5000 articles are scraped from four different top-ranked internet security conferences, using a web scraper built in Python. From the articles, fourteen topics are extracted, using the topic modeling library Gensim and LDA Mallet, and the topics are visualized in graphs to find trends about which topics are emerging and fading away over twenty years. The result found in this research is that topic modeling is a powerful tool to extract topics, and when put into a time perspective, it is possible to identify topic trends, which can be explained when put into a bigger context. LDA Latent Dirichlet Allocation Machine learning Web scraping Topic Modeling Computer and Information Sciences Data- och informationsvetenskap
24	Topic Modeling for Customer Insights : A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls Axelborn, Henrik, Berggren, John January 2023 (has links) Customer calls serve as a valuable source of feedback for financial service providers, potentially containing a wealth of unexplored insights into customer questions and concerns. However, these call data are typically unstructured and challenging to analyze effectively. This thesis project focuses on leveraging Topic Modeling techniques, a sub-field of Natural Language Processing, to extract meaningful customer insights from recorded customer calls to a European financial service provider. The objective of the study is to compare two widely used Topic Modeling algorithms, Latent Dirichlet Allocation (LDA) and BERTopic, in order to categorize and analyze the content of the calls. By leveraging the power of these algorithms, the thesis aims to provide the company with a comprehensive understanding of customer needs, preferences, and concerns, ultimately facilitating more effective decision-making processes. Through a literature review and dataset analysis, i.e., pre-processing to ensure data quality and consistency, the two algorithms, LDA and BERTopic, are applied to extract latent topics. The performance is then evaluated using quantitative and qualitative measures, i.e., perplexity and coherence scores as well as in- terpretability and usefulness of topic quality. The findings contribute to knowledge on Topic Modeling for customer insights and enable the company to improve customer engagement, satisfaction and tailor their customer strategies. The results show that LDA outperforms BERTopic in terms of topic quality and business value. Although BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human interpretation, indicating a stronger ability to capture meaningful and coherent topics within company’s customer call data. Customer Insights Natural Language Processing Topic Modeling Latent Dirichlet Allocation BERTopic Mathematics Matematik
25	Views or news? : Exploring the interplay of production and consumption of political news content on YouTube Darin, Jasper January 2023 (has links) YouTube is the second largest social media platform in the world, with a multitude of popularchannels which combine politicised commentary with news reporting. The platform providesdirect accessibility to data which makes it possible for the commentators to adjust theircontent to reach wider audiences, however done to an extreme could mean that the creatorspick topics which are the most financially beneficial or lead to fame. If this were the case itwould highlight populist newsmaking and the mechanisms behind it. To investigate theproduction-consumption interaction, data from the 10 most popular channels for 2021 wascollected. Using latent Dirichlet allocation and preferential attachment analysis, the effect ofcumulative advantage, and whether topic choice was driven by views were measured. Apositive feedback loop, where prevalent topics become more prevalent, was found in all buttwo channels, but picking topics which generated more views was only present for onechannel. The findings imply that the top political news commentators over a year have a set oftopics which they return to at a high degree, but choosing the topics which simply are themost popular for the time is not a general feature. YouTube political news commentary grifting latent Dirichlet allocation preferential attachment Sociology Sociologi
26	A Latent Dirichlet Allocation/N-gram Composite Language Model Kulhanek, Raymond Daniel 08 November 2013 (has links) No description available. Computer Science natural language processing language models topic clustering Latent Dirichlet Allocation n-grams
27	Latent Dirichlet Allocation for the Detection of Multi-Stage Attacks Lefoane, Moemedi, Ghafir, Ibrahim, Kabir, Sohag, Awan, Irfan U. 19 December 2023 (has links) No / The rapid shift and increase in remote access to organisation resources have led to a significant increase in the number of attack vectors and attack surfaces, which in turn has motivated the development of newer and more sophisticated cyber-attacks. Such attacks include Multi-Stage Attacks (MSAs). In MSAs, the attack is executed through several stages. Classifying malicious traffic into stages to get more information about the attack life-cycle becomes a challenge. This paper proposes a malicious traffic clustering approach based on Latent Dirichlet Allocation (LDA). LDA is a topic modelling approach used in natural language processing to address similar problems. The proposed approach is unsupervised learning and therefore will be beneficial in scenarios where traffic data is not labeled and analysis needs to be performed. The proposed approach uncovers intrinsic contexts that relate to different categories of attack stages in MSAs. These are vital insights needed across different areas of cybersecurity teams like Incident Response (IR) within the Security Operations Center (SOC), the insights uncovered could have a positive impact in ensuring that attacks are detected at early stages in MSAs. Besides, for IR, these insights help to understand the attack behavioural patterns and lead to reduced time in recovery following an incident. The proposed approach is evaluated on a publicly available MSAs dataset. The performance results are promising as evidenced by over 99% accuracy in identified malicious traffic clusters. Multi-stage attack Network security Intrusion detection system Latent dirichlet allocation Topic modelling
28	Measuring the information content of Riksbank meeting minutes Fröjd, Sofia January 2019 (has links) As the amount of information available on the internet has increased sharply in the last years, methods for measuring and comparing text-based information is gaining popularity on financial markets. Text mining and natural language processing has become an important tool for classifying large collections of texts or documents. One field of applications is topic modelling of the minutes from central banks' monetary policy meetings, which tend to be about topics such as"inflation", "economic growth" and "rates". The central bank of Sweden is the Riksbank, which hold 6 annual monetary policy meetings where the members of the Executive Board decide on the new repo rate. Two weeks later, the minutes of the meeting is published and information regarding the future monetary policy is given to the market in the form of text. This information has before release been unknown to the market, thus having the potential to be market-sensitive. Using Latent Dirichlet Allocation (LDA), an algorithm used for uncovering latent topics in documents, the topics in the meeting minutes should be possible to identify and quantify. In this project, 8 topics were found regarding, among other, inflation, rates, household debt and economic development. An important factor in analysis of central bank communication is the underlying tone in the discussions. It is common to classify central bankers as hawkish or dovish. Hawkish members of the board tend to favour tightening monetary policy and rate hikes, while more dovish members advocate a more expansive monetary policy and rate cuts. Thus, analysing the tone of the minutes can give an indication of future moves of the monetary policy rate. The purpose of this project is to provide a fast method for analysing the minutes from the Riksbank monetary policy meetings. The project is divided into two parts. First, a LDA model was trained to identify the topics in the minutes, which was then used to compare the content of two consecutive meeting minutes. Next, the sentiment was measured as a degree of hawkishness or dovishness. This was done by categorising each sentence in terms of their content, and then counting words with hawkish or dovish sentiment. The resulting net score gives larger values to more hawkish minutes and was shown to follow the repo rate path well. At the time of the release of the minutes, the new repo rate is already known, but the net score does gives an indication of the stance of the board. Latent Dirichlet Allocation Natual Language Processing Central Bank communication Hawk-o-meter Riksbank Other Physics Topics Annan fysik
29	Visualização em multirresolução do fluxo de tópicos em coleções de texto Schneider, Bruno 21 March 2014 (has links) Submitted by Bruno Schneider (bruno.sch@gmail.com) on 2014-05-08T17:46:04Z No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) Previous issue date: 2014-03-21 / The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications. / O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações. Modelagem de tópicos Visualização Latent Dirichlet Allocation (LDA) Visualization Topic modeling Matemática Visualização de fluxo Modelagem de dados Mineração de dados (Computação)
30	Characterisation of a developer’s experience fields using topic modelling Déhaye, Vincent January 2020 (has links) Finding the most relevant candidate for a position represents an ubiquitous challenge for organisations. It can also be arduous for a candidate to explain on a concise resume what they have experience with. Due to the fact that the candidate usually has to select which experience to expose and filter out some of them, they might not be detected by the person carrying out the search, whereas they were indeed having the desired experience. In the field of software engineering, developing one's experience usually leaves traces behind: the code one produced. This project explores approaches to tackle the screening challenges with an automated way of extracting experience directly from code by defining common lexical patterns in code for different experience fields, using topic modeling. Two different techniques were compared. On one hand, Latent Dirichlet Allocation (LDA) is a generative statistical model which has proven to yield good results in topic modeling. On the other hand Non-Negative Matrix Factorization (NMF) is simply a singular value decomposition of a matrix representing the code corpus as word counts per piece of code.The code gathered consisted of 30 random repositories from all the collaborators of the open-source Ruby-on-Rails project on GitHub, which was then applied common natural language processing transformation steps. The results of both techniques were compared using respectively perplexity for LDA, reconstruction error for NMF and topic coherence for both. The two first represent how well the data could be represented by the topics produced while the later estimates the hanging and fitting together of the elements of a topic, and can depict human understandability and interpretability. Given that we did not have any similar work to benchmark with, the performance of the values obtained is hard to assess scientifically. However, the method seems promising as we would have been rather confident in assigning labels to 10 of the topics generated. The results imply that one could probably use natural language processing methods directly on code production in order to extend the detected fields of experience of a developer, with a finer granularity than traditional resumes and with fields definition evolving dynamically with the technology. Computer Systems Datorsystem

Search results