Global ETD Search

1	The statistics of topic modelling. Abey, Rebecca January 2015 (has links) This research project aims to provide a clear and concise guide to latent dirichlet allocation which is a form of topic modelling. The aim is to help researchers who do not have a strong background in mathematics or statistics to feel comfortable with using topic modelling in their work. In order to achieve this, the thesis provides a step-by-step explanation of how topic modelling works. A range of tools that can be used to perform a topic model analysis are also described. The first chapter gives an explanation of how topic modelling, and (more specifically), latent dirichlet allocation works; it offers a very basic explanation and then provides an easy to follow mathematical explanation. The second chapter explains how to perform a topic model analysis; this is done through an explanation of each step used to run a topic model analysis, starting from the type of dataset through to the software packages available to use. The third section provides an example topic model analysis, based on the Philpapers dataset. The final section provides a discussion on the highlights of each chapter and areas for further research. statistics topic modelling text analysis
2	A Gamma-Poisson topic model for short text Mazarura, Jocelyn Rangarirai January 2020 (has links) Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text. The application of GPM was then extended to a further real-world task: that of distinguishing between semantically similar and dissimilar texts. The objective was to determine whether GPM could produce semantic representations that allow the user to determine the relevance of new, unseen documents to a corpus of interest. The challenge of addressing this problem in short text from small corpora was of key interest. Corpora of small size are not uncommon. For example, at the start of the Coronavirus pandemic limited research was available on the topic. Handling short text is not only challenging due to the sparsity of such text, but some corpora, such as chats between people, also tend to be noisy. The performance of GPM was compared to that of word2vec under these challenging conditions on labelled corpora. It was found that the GPM was able to produce better results based on accuracy, precision and recall in most cases. In addition, unlike word2vec, GPM was shown to be applicable on datasets that were unlabelled and a methodology for this was also presented. Finally, a relevance index metric was introduced. This relevance index translates the similarity distance between a corpus of interest and a test document to the probability of the test document to be semantically similar to the corpus of interest. / Thesis (PhD (Mathematical Statistics))--University of Pretoria, 2020. / Statistics / PhD (Mathematical Statistics) / Unrestricted Topic modelling for short text Gamma-poisson mixture mixture models topic modelling document similarity
3	Dynamic topic adaptation for improved contextual modelling in statistical machine translation Hasler, Eva Cornelia January 2015 (has links) In recent years there has been an increased interest in domain adaptation techniques for statistical machine translation (SMT) to deal with the growing amount of data from different sources. Topic modelling techniques applied to SMT are closely related to the field of domain adaptation but more flexible in dealing with unstructured text. Topic models can capture latent structure in texts and are therefore particularly suitable for modelling structure in between and beyond corpus boundaries, which are often arbitrary. In this thesis, the main focus is on dynamic translation model adaptation to texts of unknown origin, which is a typical scenario for an online MT engine translating web documents. We introduce a new bilingual topic model for SMT that takes the entire document context into account and for the first time directly estimates topic-dependent phrase translation probabilities in a Bayesian fashion. We demonstrate our model’s ability to improve over several domain adaptation baselines and further provide evidence for the advantages of bilingual topic modelling for SMT over the more common monolingual topic modelling. We also show improved performance when deriving further adapted translation features from the same model which measure different aspects of topical relatedness. We introduce another new topic model for SMT which exploits the distributional nature of phrase pair meaning by modelling topic distributions over phrase pairs using their distributional profiles. Using this model, we explore combinations of local and global contextual information and demonstrate the usefulness of different levels of contextual information, which had not been previously examined for SMT. We also show that combining this model with a topic model trained at the document-level further improves performance. Our dynamic topic adaptation approach performs competitively in comparison with two supervised domain-adapted systems. Finally, we shed light on the relationship between domain adaptation and topic adaptation and propose to combine multi-domain adaptation and topic adaptation in a framework that entails automatic prediction of domain labels at the document level. We show that while each technique provides complementary benefits to the overall performance, there is an amount of overlap between domain and topic adaptation. This can be exploited to build systems that require less adaptation effort at runtime. 418
4	Topical Structure in Long Informal Documents Kazantseva, Anna January 2014 (has links) This dissertation describes a research project concerned with establishing the topical structure of long informal documents. In this research, we place special emphasis on literary data, but also work with speech transcripts and several other types of data. It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way. In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs. In our model, each topical segment is described by a segment centre -- a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering. While it is far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents. In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people. The corpora, the model of hierarchical segmentation and software for segmentation are the main contributions of this work. Natural Language Processing topic modelling topical segmentation discourse structure
5	Rekonstrukce identit ve fake news: Srovnání dvou webových stránek s obsahem fake news / Reconstructing Identities in Fake News: Comparing two Fake News Websites Ely, Nicole January 2020 (has links) TOPICAL ANALYSIS OF FAKE NEWS 4 Abstract Since the 2016 US presidential campaign of Donald Trump, the term "fake news" has permeated mainstream discourse. The proliferation of disinformation and false narratives on social media platforms has caused concern in security circles in both the United States and European Union. Combining latent Dirichlet allocation, a machine learning method for text mining, with themes on topical analysis, ideology and social identity drawn from Critical Discourse theory, this thesis examines the elaborate fake news environments of two well-known English language websites: InfoWars and Sputnik News. Through the exploration of the ideologies and social representations at play in the larger thematic structure of these websites, a picture of two very different platforms emerges. One, a white dominant, somewhat isolationist counterculture mindset that promotes a racist and bigoted view of the world. Another, a more subtle world order-making perspective intent on reaching people in the realm of the mundane. Keywords: fake news, Sputnik, InfoWars, topical analysis, latent Dirichlet allocation Od americké prezidentské kampaně Donalda Trumpa z roku 2016, termín "fake news" (doslovně falešné zprávy) pronikl do mainstreamového diskurzu. Šíření dezinformací a falešných zpráv na platformách...
6	Exploring Design Discussions With Semi-Supervised Topic Modelling Lasrado, Roshan N. 11 August 2022 (has links) Stack Overflow is a rich source of questions and answers—discussions—about software development. One topic of discussion is software design, such as the correct use of design patterns or best practices in data access. Since design is a more abstract topic in software engineering, researchers have long sought to characterize and model design knowledge. However, these approaches typically require significant expert input to contextualize the abstract design information. In this study, we explore how combining expert input with Stack Overflow might serve as an effective way to identify design topics. Being able to identify and classify this design knowledge would enable the discovery and sharing of this knowledge, enabling developers better leverage Stack Overflow for crowd-sourcing their design decisions. We first perform inductive coding of design-tagged Stack Overflow questions and answers to identify the design concepts that developers discuss. We report on areas where inter-rater agreement was a challenge, including abstraction levels. Since inductive coding is expensive, we apply a semi-supervised (Anchored CorEx) approach. We find that it outperforms LDA and offers superior interpretability and the ability to incorporate expert domain knowledge. We leverage Anchored CorEx to identify how design is discussed on Stack Overflow and leveraged in GitHub projects. We conclude by describing how our experience using the semi-supervised CorEx approach leads us to believe that approaches like Anchored CorEx that combine domain knowledge and scalability are key for analyzing large SE text repositories. / Graduate design discussions semi-supervised topic modelling design mining
7	Customer Experience and its Implication for Value Creation within the Night-Time Economy / Kundupplevelse och dess innebörd för värdeskapande inom nattlivet Lewerentz, Eric January 2021 (has links) The consumer behaviour is adapting within industries due to new technologies such as smart phones. As consumer behaviour changes so do companies by adapting their way of engaging and interacting with their customers. This provides potential to innovate new service offerings. Successfully launching new services which provide value for the customer is faced with risk of failure. To mitigate risks associated with failure, a clear understanding of the customer can aid with understanding what value a service offering should provide to be successfully adopted by the market. Due to customer experience being unique for each individual, personalization is a technique which could be used within software to improve the customer experience. Challenges could arise in terms of scarcity of data which can impact the performance negatively of a data driven algorithm. However, veracity is another aspect of data known to be associated with the potential to improve performance. Based on these two issues, this study conducted a sequential mixed methods study consisting of an etnographic study on Instagram to better understand the customer experience within nightlife. Furthermore, the netnographic study enabled the construction of a gold standard, which were used while conducting a GSDMM topic modelling experiment with the purpose to evaluate what topics required further pre-processing due to high ambiguity of the text content. Findings from the netnographic study and its implication for customer experience was discussed from the point of view of a software service offering. This study suggests software offerings within nightlife to improve customer experience during the pre-purchasing phase by considering aspects related to age, interests in atmosphere, type of activity, preferred music genres, spending time with friends or facilitating escapism. The discussed service has negligible control during the post-purchasing stage suggesting that the firm could innovate controlled touchpoints, such experiences can be related to anticipation, joy, celebration, social adventures, memory of previous nights out (stories), current music preferences or new desires occurring spontaneously. Upon adopting a service dominant logic, this study suggests that software services can facilitate the customer experience within nightlife through co-creation, since with the proper usage of data, network effects could occur between the customer and an organizer or venue within nightlife, but also between customer to customer. A future study is proposed to investigate how the coordination could be conducted through crowd-sourced based interactions where the software functions as an overseer of a multi-actor setting to provide further insights regarding how such coordination impacts the co-creation of value. / Konsumentbeteende förändras inom industrier mot bakgrund av att nya teknologier introduceras, till exempel smarttelefoner. Då konsumentbeteendet förändras, gör även företagen förändringar i hur de involverar och interagerar med kunder. Dessa förändringar ger möjligheter för att utveckla eller ta fram nya tjänster. Samtidigt finns utmaningar vid lansering av nya tjänster. För att minska riskerna vid lansering av nya tjänster kan en god förståelse av konsumenten tydliggöra vilket värde en tjänst bör erbjuda för att bemötas positivt av marknaden. Då kundupplevelse är unikt för varje person, kan individualiseringstekniker inom mjukvara tillämpas för att förbättra kundupplevelsen. Det kan däremot uppstå problem när det är bristfälligt med data som algoritmen kan använda sig av. Kvalité och valt fokus på data kan dock förbättra algoritmens prestationer. Mot bakgrund av de två redogjorda problemen, genomfördes en sekventiellt blandad metodstudie bestående av en nätnografisk studie på Instragram för att utöka förståelsen av kundupplevelsen inom nattlivet. Resultatet från nätnografistudien har därefter använts för att konstruera en guldstandard vilket tillämpades på en ämnesklassificerare vid namn GSDMM. Syftet med ämnesklassifikationsexperimenten var att förstå vilka ämnen som skrivs med en hög grad av tvetydighet och därför komma att kräva en mer gedigen förbehandling av den textbaserade informationen. För att tillägga, har insikter från nätnografistudien diskuterats och dess betydelse för kundupplevelsen utifrån en mjukvarutjänsts perspektiv. Studien tyder på att mjukvarutjänster inom nattlivet kan förbättra kundupplevelsen i förköpsstadiet genom att beakta aspekter relaterat till ålder, föredragen stämning, typ av aktivitet, föredragna musikgenrer, att vara med vänner eller framhävning av eskapism. Den diskuterade tjänsten har försumbar kontroll av kundupplevelsen i efterköpsstadiet, därför föreslås införandet av kontrollerbara interaktioner med tjänsten. Sådana upplevelser bör fokusera på att spänna förväntningar, glädje, firande, sociala äventyr, minnen från tidigare utgångar (berättelser), föredragen musik i stunden eller nya önskemål som uppstår spontant under utgången. Vid tillämpning av tjänstedominantlogik indikerar studien att mjukvarutjänster kan förbättra kundupplevelsen genom samskapande, eftersom vid korrekt användning av data, kan nätverkseffekter förekomma mellan dels kund och organisatör eller lokal inom nattlivet, men även mellan kund och kund. Fortsatta studier föreslås forska om hur samverkan kan koordineras genom crowdsource-baserade interaktioner där en mjukvarutjänst fungerar som kontrollant/moderator av en multi-aktörkonstellation. En sådan studie kan ge förståelse om hur koordinationen påverkar värdeskapandet under samverkan. Customer Experience Service Dominant Logic Night-time Economy Nightlife Topic Modelling Kundupplevelse tjänstedominantlogik nattliv topic modelling Engineering and Technology Teknik och teknologier Economics and Business Ekonomi och näringsliv
8	A text-mining based approach to capturing the NHS patient experience Bahja, Mohammed January 2017 (has links) An important issue for healthcare service providers is to achieve high levels of patient satisfaction. Collecting patient feedback about their experience in hospital enables providers to analyse their performance in terms of the levels of satisfaction and to identify the strengths and limitations of their service delivery. A common method of collecting patient feedback is via online portals and the forums of the service provider, where the patients can rate and comment about the service received. A challenge in analysing patient experience collected via online portals is that the amount of data can be huge and hence, prohibitive to analyse manually. In this thesis, an automated approach to patient experience analysis via Sentiment Analysis, Topic Modelling, and Dependency Parsing methods is presented. The patient experience data collected from the National Health Service (NHS) online portal in the United Kingdom is analysed in the study to understand this experience. The study was carried out in three iterations: (1) In the first, the Sentiment Analysis method was applied, which identified whether a given patient feedback item was positive or negative. (2) The second iteration involved applying Topic Modelling methods to identify automatically themes and topics from the patient feedback. Further, the outcomes of the Sentiment Analysis study from the first iteration were utilised to identify the patient sentiment regarding the topic being discussed in a given comment. (3) In the third iteration of the study, Dependency Parsing methods were employed for each patient feedback item and the topics identified. A method was devised to summarise the reason for a particular sentiment about each of the identified topics. The outcomes of the study demonstrate that text-mining methods can be effectively utilised to identify patients’ sentiment in their feedback as well as to identify the themes and topics discussed in it. The approach presented in the study was proven capable of effectively automatically analysing the NHS patient feedback database. Specifically, it can provide an overview of the positive and negative sentiment rate, identify the frequently discussed topics and summarise individual patient feedback items. Moreover, an API visualisation tool is introduced to make the outcomes more accessible to the health care providers. 362.1
9	When Does it Mean? Detecting Semantic Change in Historical Texts Hengchen, Simon 06 December 2017 (has links) Contrary to what has been done to date in the hybrid field of natural language processing (NLP), this doctoral thesis holds that the new approach developed below makes it possible to semi-automatically detect semantic changes in digitised, OCRed, historical corpora. We define the term semi-automatic as “making use of an advanced tool whilst remaining in control of key decisions regarding the processing of the corpus”. If the tool utilised – “topic modelling”, and more precisely the “Latent Dirichlet Allocation” (LDA) – is not unknown in NLP or computational historical semantics, where it is already mobilised to follow a priori selected words and try to detect when these words change meaning, it has never been used, to the best of our knowledge, to detect which words change in a humanistically-relevant way. In other terms, our method does not study a word in context to gather information on this specific word, but the whole context – which we consider a witness to a potential evolution of reality – to gather more contextual information on one or several particular semantic shift candidates. In order to detect these semantic changes, we use the algorithm to create lexical fields: groups of words that together define a subject to which they all relate. By comparing lexical fields over different time periods of the same corpus (that is, by mobilising a diachronic approach), we try to determine whether words appear over time. We support that if a word starts to be used in a certain context at a certain time, it is a likely candidate for semantic change. Of course, the method developed here and illustrated by a case study applies to a certain context: that of digitised, OCRed, historical archives in Dutch. Nevertheless, this doctoral work also describes the advantages and disadvantages of the algorithm and postulates, on the basis of this evaluation, that the method is applicable to other fields, under other conditions. By carrying out a critical evaluation of the tools available and used, this doctoral thesis invites the community to the reproducibility of the method, whilst pointing out obvious limitations of the approach and propositions on how to solve them. / Doctorat en Information et communication / info:eu-repo/semantics/nonPublished Linguistique appliquée Linguistique historique Lexicologie LDA topic modelling topic modeling semantic change
10	Perceiving Umeå : Instagram's Lens on Neighborhoods in the City Fuhler, Rick January 2023 (has links) This master thesis in human geography explores how neighborhoods are represented and perceived on the popular social media platform Instagram. By analyzing user-generated content, both visually and textually, this study aims to uncover the predominant themes, characteristics, and subjective perspectives associated with neighborhood representation on Instagram. Through a systematic analysis of the content shared by Instagram users, the research identifies recurring themes, visual motifs, and distinguishing features that emerge when portraying and expressing experiences of different neighborhoods using topic modelling and sentiment analysis in Orange. The study specifically focuses on Umeå, allowing for a deeper understanding of how Instagram users perceive and portray the various neighborhoods within the city. The findings of this research hold potential implications for urban planning practices, as they shed light on the factors influencing neighborhood representation on Instagram and their relevance to decision-making processes related to urban development, community engagement, and social well-being. Overall, this study provides valuable insights into the interplay between social media and neighborhood representation. thesis neighborhood representation neighborhood Instagram topic modelling sentiment analysis spatial planning human geography Human Geography Kulturgeografi

Search results