Spelling suggestions: "subject:"[een] LATENT DIRICHLET ALLOCATION"" "subject:"[enn] LATENT DIRICHLET ALLOCATION""
21 |
Views or news? : Exploring the interplay of production and consumption of political news content on YouTubeDarin, Jasper January 2023 (has links)
YouTube is the second largest social media platform in the world, with a multitude of popularchannels which combine politicised commentary with news reporting. The platform providesdirect accessibility to data which makes it possible for the commentators to adjust theircontent to reach wider audiences, however done to an extreme could mean that the creatorspick topics which are the most financially beneficial or lead to fame. If this were the case itwould highlight populist newsmaking and the mechanisms behind it. To investigate theproduction-consumption interaction, data from the 10 most popular channels for 2021 wascollected. Using latent Dirichlet allocation and preferential attachment analysis, the effect ofcumulative advantage, and whether topic choice was driven by views were measured. Apositive feedback loop, where prevalent topics become more prevalent, was found in all buttwo channels, but picking topics which generated more views was only present for onechannel. The findings imply that the top political news commentators over a year have a set oftopics which they return to at a high degree, but choosing the topics which simply are themost popular for the time is not a general feature.
|
22 |
A Latent Dirichlet Allocation/N-gram Composite Language ModelKulhanek, Raymond Daniel 08 November 2013 (has links)
No description available.
|
23 |
Latent Dirichlet Allocation for the Detection of Multi-Stage AttacksLefoane, Moemedi, Ghafir, Ibrahim, Kabir, Sohag, Awan, Irfan U. 19 December 2023 (has links)
No / The rapid shift and increase in remote access to
organisation resources have led to a significant increase in the
number of attack vectors and attack surfaces, which in turn
has motivated the development of newer and more sophisticated
cyber-attacks. Such attacks include Multi-Stage Attacks (MSAs).
In MSAs, the attack is executed through several stages. Classifying malicious traffic into stages to get more information about
the attack life-cycle becomes a challenge. This paper proposes a
malicious traffic clustering approach based on Latent Dirichlet
Allocation (LDA). LDA is a topic modelling approach used in
natural language processing to address similar problems. The
proposed approach is unsupervised learning and therefore will
be beneficial in scenarios where traffic data is not labeled and
analysis needs to be performed. The proposed approach uncovers
intrinsic contexts that relate to different categories of attack
stages in MSAs. These are vital insights needed across different
areas of cybersecurity teams like Incident Response (IR) within
the Security Operations Center (SOC), the insights uncovered
could have a positive impact in ensuring that attacks are detected
at early stages in MSAs. Besides, for IR, these insights help to
understand the attack behavioural patterns and lead to reduced
time in recovery following an incident. The proposed approach is
evaluated on a publicly available MSAs dataset. The performance
results are promising as evidenced by over 99% accuracy in
identified malicious traffic clusters.
|
24 |
Visualização em multirresolução do fluxo de tópicos em coleções de textoSchneider, Bruno 21 March 2014 (has links)
Submitted by Bruno Schneider (bruno.sch@gmail.com) on 2014-05-08T17:46:04Z
No. of bitstreams: 1
dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1
dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1
dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) / Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1
dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Previous issue date: 2014-03-21 / The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications. / O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações.
|
25 |
Characterisation of a developer’s experience fields using topic modellingDéhaye, Vincent January 2020 (has links)
Finding the most relevant candidate for a position represents an ubiquitous challenge for organisations. It can also be arduous for a candidate to explain on a concise resume what they have experience with. Due to the fact that the candidate usually has to select which experience to expose and filter out some of them, they might not be detected by the person carrying out the search, whereas they were indeed having the desired experience. In the field of software engineering, developing one's experience usually leaves traces behind: the code one produced. This project explores approaches to tackle the screening challenges with an automated way of extracting experience directly from code by defining common lexical patterns in code for different experience fields, using topic modeling. Two different techniques were compared. On one hand, Latent Dirichlet Allocation (LDA) is a generative statistical model which has proven to yield good results in topic modeling. On the other hand Non-Negative Matrix Factorization (NMF) is simply a singular value decomposition of a matrix representing the code corpus as word counts per piece of code.The code gathered consisted of 30 random repositories from all the collaborators of the open-source Ruby-on-Rails project on GitHub, which was then applied common natural language processing transformation steps. The results of both techniques were compared using respectively perplexity for LDA, reconstruction error for NMF and topic coherence for both. The two first represent how well the data could be represented by the topics produced while the later estimates the hanging and fitting together of the elements of a topic, and can depict human understandability and interpretability. Given that we did not have any similar work to benchmark with, the performance of the values obtained is hard to assess scientifically. However, the method seems promising as we would have been rather confident in assigning labels to 10 of the topics generated. The results imply that one could probably use natural language processing methods directly on code production in order to extend the detected fields of experience of a developer, with a finer granularity than traditional resumes and with fields definition evolving dynamically with the technology.
|
26 |
Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing styleApelthun, Catharina January 2021 (has links)
A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.
|
27 |
News media attention in Climate Action: Latent topics and open accessKarlsson, Kalle January 2020 (has links)
The purpose of the thesis is i) to discover the latent topics of SDG13 and their coverage in news media ii) to investigate the share of OA and Non-OA articles and reviews in each topic iii) to compare the share of different OA types (Green, Gold, Hybrid and Bronze) in each topic. It imposes a heuristic perspective and explorative approach in reviewing the three concepts open access, altmetrics and climate action (SDG13). Data is collected from SciVal, Unpaywall, Altmetric.com and Scopus rendering a dataset of 70,206 articles and reviews published between 2014-2018. The documents retrieved are analyzed with descriptive statistics and topic modeling using Sklearn’s package for LDA(Latent Dirichlet Allocation) in Python. The findings show an altmetric advantage for OA in the case of news media and SDG13 which fluctuates over topics. News media is shown to focus on subjects with “visible” effects in concordance with previous research on media coverage. Examples of this were topics concerning emissions of greenhouse gases and melting glaciers. Gold OA is the most common type being mentioned in news outlets. It also generates the highest number of news mentions while the average sum of news mentions was highest for documents published as Bronze. Moreover, the thesis is largely driven by methods used and most notably the programming language Python. As such it outlines future paths for research into the three concepts reviewed as well as methods used for topic modeling and programming.
|
28 |
Natural Language Processing on the Balance of theSwedish Software Industry and Higher VocationalEducationBäckstrand, Emil, Djupedal, Rasmus January 2023 (has links)
The Swedish software industry is fast-growing and in needof competent personnel, the education system is on the frontline of producing qualified graduates to meet the job marketdemand. Reports and studies show there exists a gapbetween industry needs and what is taught in highereducation, and that there is an undefined skills shortageleading to recruitment failures. This study explored theindustry-education gap with a focus on higher vocationaleducation (HVE) through the use of natural languageprocessing (NLP) to ascertain the demands of the industryand what is taught in HVE. Using the authors' custom-madetool Vocational Education and Labour Market Analyser(VELMA), job ads and HVE curricula were collected fromthe Internet. Then analysed through the topic modellingprocess latent Dirichlet allocation (LDA) to classify lowerlevel keywords into cohesive categories for documentfrequency analysis. Findings show that a large number ofHVE programmes collaborate with the industry via indirectfinancing and that job ads written in Swedish consist, inlarger part, of inconsequential words compared to adswritten in English. Moreover, An industry demand withincloud and embedded technologies, security engineers andsoftware architects can be observed. Whereas, the findingsfrom HVE curricula point to a focus on educating webdevelopers and general object-oriented programminglanguages. While there are limitations in the topic modellingprocess, the authors conclude that there is a mismatchbetween what is taught in HVE programmes and industrydemand. The skills identified to be lacking in HVE wereassociated with cloud-, embedded-, and security-relatedtechnologies together with architectural disciplines. Theauthors recommend future work with a focus on improvingthe topic modelling process and including curricula fromgeneral higher education.
|
29 |
Generating Thematic Maps from Hyperspectral Imagery Using a Bag-of-Materials ModelPark, Kyoung Jin 25 July 2013 (has links)
No description available.
|
30 |
Anemone: a Visual Semantic GraphFicapal Vila, Joan January 2019 (has links)
Semantic graphs have been used for optimizing various natural language processing tasks as well as augmenting search and information retrieval tasks. In most cases these semantic graphs have been constructed through supervised machine learning methodologies that depend on manually curated ontologies such as Wikipedia or similar. In this thesis, which consists of two parts, we explore in the first part the possibility to automatically populate a semantic graph from an ad hoc data set of 50 000 newspaper articles in a completely unsupervised manner. The utility of the visual representation of the resulting graph is tested on 14 human subjects performing basic information retrieval tasks on a subset of the articles. Our study shows that, for entity finding and document similarity our feature engineering is viable and the visual map produced by our artifact is visually useful. In the second part, we explore the possibility to identify entity relationships in an unsupervised fashion by employing abstractive deep learning methods for sentence reformulation. The reformulated sentence structures are qualitatively assessed with respect to grammatical correctness and meaningfulness as perceived by 14 test subjects. We negatively evaluate the outcomes of this second part as they have not been good enough to acquire any definitive conclusion but have instead opened new doors to explore. / Semantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.
|
Page generated in 0.052 seconds