31 |
Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing styleApelthun, Catharina January 2021 (has links)
A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.
|
32 |
News media attention in Climate Action: Latent topics and open accessKarlsson, Kalle January 2020 (has links)
The purpose of the thesis is i) to discover the latent topics of SDG13 and their coverage in news media ii) to investigate the share of OA and Non-OA articles and reviews in each topic iii) to compare the share of different OA types (Green, Gold, Hybrid and Bronze) in each topic. It imposes a heuristic perspective and explorative approach in reviewing the three concepts open access, altmetrics and climate action (SDG13). Data is collected from SciVal, Unpaywall, Altmetric.com and Scopus rendering a dataset of 70,206 articles and reviews published between 2014-2018. The documents retrieved are analyzed with descriptive statistics and topic modeling using Sklearn’s package for LDA(Latent Dirichlet Allocation) in Python. The findings show an altmetric advantage for OA in the case of news media and SDG13 which fluctuates over topics. News media is shown to focus on subjects with “visible” effects in concordance with previous research on media coverage. Examples of this were topics concerning emissions of greenhouse gases and melting glaciers. Gold OA is the most common type being mentioned in news outlets. It also generates the highest number of news mentions while the average sum of news mentions was highest for documents published as Bronze. Moreover, the thesis is largely driven by methods used and most notably the programming language Python. As such it outlines future paths for research into the three concepts reviewed as well as methods used for topic modeling and programming.
|
33 |
Natural Language Processing on the Balance of theSwedish Software Industry and Higher VocationalEducationBäckstrand, Emil, Djupedal, Rasmus January 2023 (has links)
The Swedish software industry is fast-growing and in needof competent personnel, the education system is on the frontline of producing qualified graduates to meet the job marketdemand. Reports and studies show there exists a gapbetween industry needs and what is taught in highereducation, and that there is an undefined skills shortageleading to recruitment failures. This study explored theindustry-education gap with a focus on higher vocationaleducation (HVE) through the use of natural languageprocessing (NLP) to ascertain the demands of the industryand what is taught in HVE. Using the authors' custom-madetool Vocational Education and Labour Market Analyser(VELMA), job ads and HVE curricula were collected fromthe Internet. Then analysed through the topic modellingprocess latent Dirichlet allocation (LDA) to classify lowerlevel keywords into cohesive categories for documentfrequency analysis. Findings show that a large number ofHVE programmes collaborate with the industry via indirectfinancing and that job ads written in Swedish consist, inlarger part, of inconsequential words compared to adswritten in English. Moreover, An industry demand withincloud and embedded technologies, security engineers andsoftware architects can be observed. Whereas, the findingsfrom HVE curricula point to a focus on educating webdevelopers and general object-oriented programminglanguages. While there are limitations in the topic modellingprocess, the authors conclude that there is a mismatchbetween what is taught in HVE programmes and industrydemand. The skills identified to be lacking in HVE wereassociated with cloud-, embedded-, and security-relatedtechnologies together with architectural disciplines. Theauthors recommend future work with a focus on improvingthe topic modelling process and including curricula fromgeneral higher education.
|
34 |
Generating Thematic Maps from Hyperspectral Imagery Using a Bag-of-Materials ModelPark, Kyoung Jin 25 July 2013 (has links)
No description available.
|
35 |
Anemone: a Visual Semantic GraphFicapal Vila, Joan January 2019 (has links)
Semantic graphs have been used for optimizing various natural language processing tasks as well as augmenting search and information retrieval tasks. In most cases these semantic graphs have been constructed through supervised machine learning methodologies that depend on manually curated ontologies such as Wikipedia or similar. In this thesis, which consists of two parts, we explore in the first part the possibility to automatically populate a semantic graph from an ad hoc data set of 50 000 newspaper articles in a completely unsupervised manner. The utility of the visual representation of the resulting graph is tested on 14 human subjects performing basic information retrieval tasks on a subset of the articles. Our study shows that, for entity finding and document similarity our feature engineering is viable and the visual map produced by our artifact is visually useful. In the second part, we explore the possibility to identify entity relationships in an unsupervised fashion by employing abstractive deep learning methods for sentence reformulation. The reformulated sentence structures are qualitatively assessed with respect to grammatical correctness and meaningfulness as perceived by 14 test subjects. We negatively evaluate the outcomes of this second part as they have not been good enough to acquire any definitive conclusion but have instead opened new doors to explore. / Semantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.
|
36 |
Topic classification of Monetary Policy Minutes from the Swedish Central Bank / Ämnesklassificering av Riksbankens penningpolitiska mötesprotokollCedervall, Andreas, Jansson, Daniel January 2018 (has links)
Over the last couple of years, Machine Learning has seen a very high increase in usage. Many previously manual tasks are becoming automated and it stands to reason that this development will continue in an incredible pace. This paper builds on the work in Topic Classification and attempts to provide a baseline on how to analyse the Swedish Central Bank Minutes and gather information using both Latent Dirichlet Allocation and a simple Neural Networks. Topic Classification is done on Monetary Policy Minutes from 2004 to 2018 to find how the distributions of topics change over time. The results are compared to empirical evidence that would confirm trends. Finally a business perspective of the work is analysed to reveal what the benefits of implementing this type of technique could be. The results of these methods are compared and they differ. Specifically the Neural Network shows larger changes in topic distributions than the Latent Dirichlet Allocation. The neural network also proved to yield more trends that correlated with other observations such as the start of bond purchasing by the Swedish Central Bank. Thus, our results indicate that a Neural Network would perform better than the Latent Dirichlet Allocation when analyzing Swedish Monetary Policy Minutes. / Under de senaste åren har artificiell intelligens och maskininlärning fått mycket uppmärksamhet och växt otroligt. Tidigare manuella arbeten blir nu automatiserade och mycket tyder på att utvecklingen kommer att fortsätta i en hög takt. Detta arbete bygger vidare på arbeten inom topic modeling (ämnesklassifikation) och applicera detta i ett tidigare outforskat område, riksbanksprotokoll. Latent Dirichlet Allocation och Neural Network används för att undersöka huruvida fördelningen av diskussionspunkter (topics) förändras över tid. Slutligen presenteras en teoretisk diskussion av det potentiella affärsvärdet i att implementera en liknande metod. Resultaten för de olika modellerna uppvisar stora skillnader över tid. Medan Latent Dirichlet Allocation inte finner några större trender i diskussionspunkter visar Neural Network på större förändringar över tid. De senare stämmer dessutom väl överens med andra observationer såsom påbörjandet av obligationsköp. Därav indikerar resultaten att Neural Network är en mer lämplig metod för analys av riksbankens mötesprotokoll.
|
37 |
Overcoming The New Item Problem In Recommender Systems : A Method For Predicting User Preferences Of New ItemsJonason, Alice January 2023 (has links)
This thesis addresses the new item problem in recommender systems, which pertains to the challenges of providing personalized recommendations for items which have limited user interaction history. The study proposes and evaluates a method for generating personalized recommendations for movies, shows, and series on one of Sweden’s largest streaming platforms. By treating these items as documents of the attributes which characterize them and utilizing item similarity through the k-nearest neighbor algorithm, user preferences for new items are predicted based on users’ past preferences for similar items. Two models for feature representation, namely the Vector Space Model (VSM) and a Latent Dirichlet Allocation (LDA) topic model, are considered and compared. The k-nearest neighbor algorithm is utilized to identify similar items for each type of representation, with cosine distance for VSM and Kullback-Leibler divergence for LDA. Furthermore, three different ways of predicting user preferences based on the preferences for the neighbors are presented and compared. The performances of the models in terms of predicting preferences for new items are evaluated with historical streaming data. The results indicate the potential of leveraging item similarity and previous streaming history to predict preferences of new items. The VSM representation proved more successful; using this representation, 77 percent of actual positive instances were correctly classified as positive. For both types of representations, giving higher weight to preferences for more similar items when predicting preferences yielded higher F2 scores, and optimizing for the F2 score implied that recommendations should be made when there is the slightest indication of preference for the neighboring items. The results indicate that the neighbors identified through the VSM representation were more representative of user preferences for new items, compared to those identified through the LDA representation.
|
38 |
A Novel Approach For Cancer Characterization Using Latent Dirichlet Allocation and Disease-Specific Genomic AnalysisYalamanchili, Hima Bindu 05 June 2018 (has links)
No description available.
|
39 |
A framework for exploiting electronic documentation in support of innovation processesUys, J. W. 03 1900 (has links)
Thesis (PhD (Industrial Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The crucial role of innovation in creating sustainable competitive advantage is widely recognised in industry today. Likewise, the importance of having the required information accessible to the right employees at the right time is well-appreciated. More specifically, the dependency of effective, efficient innovation processes on the availability of information has been pointed out in literature.
A great challenge is countering the effects of the information overload phenomenon in organisations in order for employees to find the information appropriate to their needs without having to wade through excessively large quantities of information to do so. The initial stages of the innovation process, which are characterised by free association, semi-formal activities, conceptualisation, and experimentation, have already been identified as a key focus area for improving the effectiveness of the entire innovation process. The dependency on information during these early stages of the innovation process is especially high.
Any organisation requires a strategy for innovation, a number of well-defined, implemented processes and measures to be able to innovate in an effective and efficient manner and to drive its innovation endeavours. In addition, the organisation requires certain enablers to support its innovation efforts which include certain core competencies, technologies and knowledge. Most importantly for this research, enablers are required to more effectively manage and utilise innovation-related information. Information residing inside and outside the boundaries of the organisation is required to feed the innovation process. The specific sources of such information are numerous. Such information may further be structured or unstructured in nature. However, an ever-increasing ratio of available innovation-related information is of the unstructured type. Examples include the textual content of reports, books, e-mail messages and web pages. This research explores the innovation landscape and typical sources of innovation-related information. In addition, it explores the landscape of text analytical approaches and techniques in search of ways to more effectively and efficiently deal with unstructured, textual information.
A framework that can be used to provide a unified, dynamic view of an organisation‟s innovation-related information, both structured and unstructured, is presented. Once implemented, this framework will constitute an innovation-focused knowledge base that will organise and make accessible such innovation-related information to the stakeholders of the innovation process. Two novel, complementary text analytical techniques, Latent Dirichlet Allocation and the Concept-Topic Model, were identified for application with the framework. The potential value of these techniques as part of the information systems that would embody the framework is illustrated. The resulting knowledge base would cause a quantum leap in the accessibility of information and may significantly improve the way innovation is done and managed in the target organisation. / AFRIKAANSE OPSOMMING: Die belangrikheid van innovasie vir die daarstel van „n volhoubare mededingende voordeel word tans wyd erken in baie sektore van die bedryf. Ook die belangrikheid van die toeganklikmaking van relevante inligting aan werknemers op die geskikte tyd, word vandag terdeë besef. Die afhanklikheid van effektiewe, doeltreffende innovasieprosesse op die beskikbaarheid van inligting word deurlopend beklemtoon in die navorsingsliteratuur.
„n Groot uitdaging tans is om die oorsake en impak van die inligtingsoorvloedverskynsel in ondernemings te bestry ten einde werknemers in staat te stel om inligting te vind wat voldoen aan hul behoeftes sonder om in die proses deur oormatige groot hoeveelhede inligting te sif. Die aanvanklike stappe van die innovasieproses, gekenmerk deur vrye assosiasie, semi-formele aktiwiteite, konseptualisering en eksperimentasie, is reeds geïdentifiseer as sleutelareas vir die verbetering van die effektiwiteit van die innovasieproses in sy geheel. Die afhanklikheid van hierdie deel van die innovasieproses op inligting is besonder hoog.
Om op „n doeltreffende en optimale wyse te innoveer, benodig elke onderneming „n strategie vir innovasie sowel as „n aantal goed gedefinieerde, ontplooide prosesse en metingskriteria om die innovasieaktiwiteite van die onderneming te dryf. Bykomend benodig ondernemings sekere innovasie-ondersteuningsmeganismes wat bepaalde sleutelaanlegde, -tegnologiëe en kennis insluit. Kern tot hierdie navorsing, benodig organisasies ook ondersteuningsmeganismes om hul in staat te stel om meer doeltreffend innovasie-verwante inligting te bestuur en te gebruik. Inligting, gehuisves beide binne en buite die grense van die onderneming, word benodig om die innovasieproses te voer. Die bronne van sulke inligting is veeltallig en hierdie inligting mag gestruktureerd of ongestruktureerd van aard wees. „n Toenemende persentasie van innovasieverwante inligting is egter van die ongestruktureerde tipe, byvoorbeeld die inligting vervat in die tekstuele inhoud van verslae, boeke, e-posboodskappe en webbladsye. In hierdie navorsing word die innovasielandskap asook tipiese bronne van innovasie-verwante inligting verken. Verder word die landskap van teksanalitiese benaderings en -tegnieke ondersoek ten einde maniere te vind om meer doeltreffend en optimaal met ongestruktureerde, tekstuele inligting om te gaan. „n Raamwerk wat aangewend kan word om „n verenigde, dinamiese voorstelling van „n onderneming se innovasieverwante inligting, beide gestruktureerd en ongestruktureerd, te skep word voorgestel. Na afloop van implementasie sal hierdie raamwerk die innovasieverwante inligting van die onderneming organiseer en meer toeganklik maak vir die deelnemers van die innovasieproses. Daar word verslag gelewer oor die aanwending van twee nuwerwetse, komplementêre teksanalitiese tegnieke tot aanvulling van die raamwerk. Voorts word die potensiele waarde van hierdie tegnieke as deel van die inligtingstelsels wat die raamwerk realiseer, verder uitgewys en geillustreer.
|
40 |
Model trees with topic model preprocessing: an approach for data journalism illustrated with the WikiLeaks Afghanistan war logsRusch, Thomas, Hofmarcher, Paul, Hatzinger, Reinhold, Hornik, Kurt 06 1900 (has links) (PDF)
The WikiLeaks Afghanistan war logs contain nearly 77,000 reports of
incidents in the US-led Afghanistan war, covering the period from January
2004 to December 2009. The recent growth of data on complex social systems
and the potential to derive stories from them has shifted the focus of
journalistic and scientific attention increasingly toward data-driven journalism
and computational social science. In this paper we advocate the usage
of modern statistical methods for problems of data journalism and beyond,
which may help journalistic and scientific work and lead to additional insight.
Using the WikiLeaks Afghanistan war logs for illustration, we present an approach
that builds intelligible statistical models for interpretable segments in
the data, in this case to explore the fatality rates associated with different circumstances
in the Afghanistan war. Our approach combines preprocessing by
Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process
the natural language information contained in each report summary by estimating
latent topics and assigning each report to one of them. Together with
other variables these topic assignments serve as splitting variables for finding
segments in the data to which local statistical models for the reported number
of fatalities are fitted. Segmentation and fitting is carried out with recursive
partitioning of negative binomial distributions. We identify segments with
different fatality rates that correspond to a small number of topics and other
variables as well as their interactions. Furthermore, we carve out the similarities
between segments and connect them to stories that have been covered in
the media. This gives an unprecedented description of the war in Afghanistan
and serves as an example of how data journalism, computational social science
and other areas with interest in database data can benefit from modern
statistical techniques. (authors' abstract)
|
Page generated in 0.1108 seconds