21 |
A Geometric Framework for Transfer Learning Using Manifold AlignmentWang, Chang 01 September 2010 (has links)
Many machine learning problems involve dealing with a large amount of high-dimensional data across diverse domains. In addition, annotating or labeling the data is expensive as it involves significant human effort. This dissertation explores a joint solution to both these problems by exploiting the property that high-dimensional data in real-world application domains often lies on a lower-dimensional structure, whose geometry can be modeled as a graph or manifold. In particular, we propose a set of novel manifold-alignment based approaches for transfer learning. The proposed approaches transfer knowledge across different domains by finding low-dimensional embeddings of the datasets to a common latent space, which simultaneously match corresponding instances while preserving local or global geometry of each input dataset. We develop a novel two-step transfer learning method called Procrustes alignment. Procrustes alignment first maps the datasets to low-dimensional latent spaces reflecting their intrinsic geometries and then removes the translational, rotational and scaling components from one set so that the optimal alignment between the two sets can be achieved. This approach can preserve either global geometry or local geometry depending on the dimensionality reduction approach used in the first step. We propose a general one-step manifold alignment framework called manifold projections that can find alignments, both across instances as well as across features, while preserving local domain geometry. We develop and mathematically analyze several extensions of this framework to more challenging situations, including (1) when no correspondences across domains are given; (2) when the global geometry of each input domain needs to be respected; (3) when label information rather than correspondence information is available. A final contribution of this thesis is the study of multiscale methods for manifold alignment. Multiscale alignment automatically generates alignment results at different levels by discovering the shared intrinsic multilevel structures of the given datasets, providing a common representation across all input datasets.
|
22 |
MULTI-ATTRIBUTE AND TEMPORAL ANALYSIS OF PRODUCT REVIEWS USING TOPIC MODELLING AND SENTIMENT ANALYSISMeet Tusharbhai Suthar (14232623) 08 December 2022 (has links)
<p>Online reviews are frequently utilized to determine a product's quality before purchase along with the photographs and one-to-five star ratings. The research addressed the two distinct problems observed in the review systems. </p>
<p>First, due to thousands of reviews for a product, the different characteristics of customer evaluations, such as consumer sentiments, cannot be understood by manually reading only a few reviews. Second, from these reviews, it is extremely hard to understand the change in these sentiments and other important product aspects over the years (temporal analysis). To address these problems, the study focused on 2 main research parts.</p>
<p>Part one of the research was focused on answering how topic modelling and sentiment analysis can work together to give deeper understanding on attribute-based product review. The second part compared different topic modelling approaches to evaluate the performances and advantages of emerging NLP models. For this purpose, a dataset consisting of 469 publicly accessible Amazon evaluations of the Kindle E-reader and 15,000 reviews of iPhone products was utilized to examine sentiment Analysis and Topic modelling. Latent Dirichlet Allocation topic model and BERTopic topic model were used to perform topic modelling and to acquire the diverse topics of concern. Sentiment Analysis was carried out to better understand each topic's positive and negative tones. Topic analysis of Kindle user evaluations revealed the following major themes: (a) leisure consumption, (b) utility as a gift, (c) pricing, (d) parental control, (e) reliability and durability, and (f) charging. While the main themes emerged from the analysis of iPhone reviews depended on the model and year of the device, some themes were found to be consistent across all the iPhone models including (a) Apple vs Android (b) utility as gift and (c) service. The study's approach helped to analyze customer reviews for any product, and the study results provided a deeper understanding of the product's strengths and weaknesses based on a comprehensive analysis of user feedback useful for product makers, retailers, e-commerce platforms, and consumers.</p>
|
23 |
Product Defect Discovery and Summarization from Online User ReviewsZhang, Xuan 29 October 2018 (has links)
Product defects concern various groups of people, such as customers, manufacturers, government officials, etc. Thus, defect-related knowledge and information are essential. In keeping with the growth of social media, online forums, and Internet commerce, people post a vast amount of feedback on products, which forms a good source for the automatic acquisition of knowledge about defects. However, considering the vast volume of online reviews, how to automatically identify critical product defects and summarize the related information from the huge number of user reviews is challenging, even when we target only the negative reviews. As a kind of opinion mining research, existing defect discovery methods mainly focus on how to classify the type of product issues, which is not enough for users. People expect to see defect information in multiple facets, such as product model, component, and symptom, which are necessary to understand the defects and quantify their influence. In addition, people are eager to seek problem resolutions once they spot defects. These challenges cannot be solved by existing aspect-oriented opinion mining models, which seldom consider the defect entities mentioned above. Furthermore, users also want to better capture the semantics of review text, and to summarize product defects more accurately in the form of natural language sentences. However, existing text summarization models including neural networks can hardly generalize to user review summarization due to the lack of labeled data.
In this research, we explore topic models and neural network models for product defect discovery and summarization from user reviews. Firstly, a generative Probabilistic Defect Model (PDM) is proposed, which models the generation process of user reviews from key defect entities including product Model, Component, Symptom, and Incident Date. Using the joint topics in these aspects, which are produced by PDM, people can discover defects which are represented by those entities. Secondly, we devise a Product Defect Latent Dirichlet Allocation (PDLDA) model, which describes how negative reviews are generated from defect elements like Component, Symptom, and Resolution. The interdependency between these entities is modeled by PDLDA as well. PDLDA answers not only what the defects look like, but also how to address them using the crowd wisdom hidden in user reviews. Finally, the problem of how to summarize user reviews more accurately, and better capture the semantics in them, is studied using deep neural networks, especially Hierarchical Encoder-Decoder Models.
For each of the research topics, comprehensive evaluations are conducted to justify the effectiveness and accuracy of the proposed models, on heterogeneous datasets. Further, on the theoretical side, this research contributes to the research stream on product defect discovery, opinion mining, probabilistic graphical models, and deep neural network models. Regarding impact, these techniques will benefit related users such as customers, manufacturers, and government officials. / Ph. D. / Product defects concern various groups of people, such as customers, manufacturers, and government officials. Thus, defect-related knowledge and information are essential. In keeping with the growth of social media, online forums, and Internet commerce, people post a vast amount of feedback on products, which forms a good source for the automatic acquisition of knowledge about defects. However, considering the vast volume of online reviews, how to automatically identify critical product defects and summarize the related information from the huge number of user reviews is challenging, even when we target only the negative reviews. People expect to see defect information in multiple facets, such as product model, component, and symptom, which are necessary to understand the defects and quantify their influence. In addition, people are eager to seek problem resolutions once they spot defects. Furthermore, users also want to better summarize product defects more accurately in the form of natural language sentences. These requirements cannot be satisfied by existing methods, which seldom consider the defect entities mentioned above, or hardly generalize to user review summarization. In this research, we develop novel Machine Learning (ML) algorithms for product defect discovery and summarization. Firstly, we study how to identify product defects and their related attributes, such as Product Model, Component, Symptom, and Incident Date. Secondly, we devise a novel algorithm, which can discover product defects and the related Component, Symptom, and Resolution, from online user reviews. This method tells not only what the defects look like, but also how to address them using the crowd wisdom hidden in user reviews. Finally, we address the problem of how to summarize user reviews in the form of natural language sentences using a paraphrase-style method. On the theoretical side, this research contributes to multiple research areas in Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning. Regarding impact, these techniques will benefit related users such as customers, manufacturers, and government officials.
|
24 |
Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery StudiesWang, Minkun 14 June 2017 (has links)
Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis.
In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation.
Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included. / Ph. D. / This dissertation documents the methodology and outputs for computational deconvolution of heterogeneous omics data generated from biospecimens of interest. These omics data convey qualitative and quantitative information of biomolecules (e.g., glycans, proteins, metabolites, etc.) which are profiled by instruments named liquid (or gas) chromatography and mass spectrometer (LC/GC-MS). In the scenarios of biomarker discovery, we aim to find out the significant difference on intensities of biomolecules with respect to two specific phenotype groups so that the biomarkers can be used as clinical indicators for early stage diagnose. However, the purity of collected samples constitutes the fundamental challenge to the process of differential analysis. Instead of experimental methods that are costly and time-consuming, we treat the purification task as one of the topic modeling procedures, where we assume each observed biomolecular profile is a mixture of hidden pure source together with unwanted contaminants.
The developed models output the estimated mixture proportion as well as the underlying “topics”. With different level’s purification applied, improved discrimination power of candidate biomarkers and more biologically meaningful pathways were discovered in LC/GC-MS based multi-omic studies for liver cancer. This research work originates from a broader scope of probabilistic generative modeling, where rational assumptions are made to characterize the generation process of the observations. Therefore, the developed models in this dissertation have great potential in applications other than heterogeneous data purification discussed in this dissertation. A good example is to uncover the relationship of human gut microbiome with the host’s phenotypes of interest (e.g., disease like type-II diabetes). Similar challenges exist in how to infer the underlying intestinal flora distribution and estimate their mixture proportions.
This dissertation also covers topics of related data preprocessing and integration, but with a consistent goal in improving the performance of biomarker discovery. In summary, the research help address sample heterogeneity issue observed in LC/GC-MS based cancer biomarker discovery studies and shed light on computational deconvolution of the mixtures, which can be generalized to other domains of interest.
|
25 |
基於意見探勘與主題模型之部落格食記剖析研究 / A Study of Opinion Mining and Topic Model Analysis on Food Diaries賴柏帆, Lai, Po Fan Unknown Date (has links)
隨著Web 2.0興起,社群網站在資訊傳遞與獲取所占比重相當高。以美食領域來看,人們在進餐廳前先行閱覽食記評論之情形越來越常見,而部落格文章因圖文並茂,常被消費者列入參考比較之來源。儘管這一類食記內容相對短篇食評來說較為完整,但評論分散於文章中,且多半沒有評分可供參考,讀者很難在第一時間獲悉評論樣貌,得花上一番心力進行閱覽,才能對餐廳整體有所評鑑。
本研究提出一套基於意見探勘與主題模型的食記剖析方法,由部落格中各餐廳貼文情緒量來反映正負面評價,將提及評論歸納為「食物」、「服務」及「環境」三個評分面向,進而提供該家餐廳的整體推薦分數,供讀者快速參閱之。實驗語料自痞客邦美食類貼文中選定添好運台灣-台北站前店、京星港式飲茶PART2、金泰日式料理-內湖店以及喀佈貍(一店)大眾和風串燒居酒洋食堂,合計4家餐廳與200篇語料。
透過LDA主題模型對食記敘述進行主題式分群,使擁有相近主題概念的句子分為一群,並歸類至各面向,例如喀佈貍(一店)之語料可分為10群主題語句,食物面向上有6群,服務與環境面向各為2群。另一方面,為了更有效辨別食記中含有的正負向情緒,本研究透過語意導向方法(SO-PMI)來計算食記中常出現情緒詞彙之極性,以建置該領域的意見詞詞庫。
實驗結果方面,以線上餐廳評論網站-iPeen愛評網作為驗證對象,顯示其語料的平均情緒量相近,於大眾觀感與評價上傾向一致,且相較一般評論網站,本研究能從較細微的面向來切入,並以情緒量反映真實的餐廳評價。最後提出未來欲探討與改善之處,供後續研究參考之。 / As the time of Web 2.0 rise, social media platform plays a crucial role in transferring and receiving information. More and more people get used to reading the related posts before having meal. Because of its richness in content and referring photographs, blog posts are most frequently used for reference. Although the blog posts are more complete regarding their content than other short reviews, the actual reviews are scattered among words that are simply descriptions, and there are no grading scale to take as reference. These all together gives the reader a hard time to efficiently organize the overview of the review, and for them to, therefore, make the decision if they should go to the restaurant.
Our study offers a method of analyzing food diaries based on opinion mining and topic model. The scale of emotion in a blog post about a restaurant is used as the reflection of its review's positive or negative. The comments are categorized into food, service and environment. And the restaurant will be graded based on these three aspects to further provide the user an overall score of recommendation.
We collected total of 200 articles written on 4 restaurants in PIXNET, then categorized the contents using LDA (Latent Dirichlet Allocation) model base on their theme. The sentences with similar theme with be put into a group, then be further categorized to the three aspects that was mentioned earlier. On the other hand, to better distinguish if the emotion in certain food diary is positive or negative, our study calculated the polarity of common opinion-based words in food diaries using semantic orientation (SO-PMI), and built an opinion corpus specifically for food diaries.
In terms of the result, using iPeen, a restaurant rating website, as test reference, it shows that the average scales of opinion of the restaurants we got using our method are close to iPeen, which in this case we can say are close to the public opinion and review. Furthermore, compare to common rating website, our study touches on even the minute aspect, and use the cumulative opinion to reflect the true blog authors' evaluation of the restaurant. Lastly, we would like to bring up what we intend to discuss and improve in the future for upcoming research's reference.
|
26 |
AppReco: 基於行為識別的行動應用服務推薦系統 / AppReco: Behavior-aware Recommendation for iOS Mobile Applications方子睿, Fang, Zih Ruei Unknown Date (has links)
在現在的社會裡,手機應用程式已經被人們接受與廣泛地利用,然而目前市面上的手機 App 推薦系統,多以使用者實際使用與回報作為參考,若有惡意行為軟體,在使用者介面後竊取使用者資料,這些推薦系統是難以查知其行為的,因此我們提出了 AppReco,一套可以系統化的推薦 iOS App 的推薦系統,而且不需要使用者去實際操作、執行 App。
整個分析流程包括三個步驟:(1) 透過無監督式學習法的隱含狄利克雷分布(Latent Dirichlet Allocation, LDA)做出主題模型,再使用增長層級式自我組織映射圖(Growing Hierarchical Self-Organizing Map, GHSOM)進行分群。(2)使用靜態分析程式碼,去找出其應用程式所執行的行為。(3)透過我們的評分公式對於這些 App,進行評分。
在分群 App 方面,AppReco 使用這些應用程式的官方敘述來進行分群,讓擁有類似屬性的手機應用程式群聚在一起;在檢視 App 方面,AppReco 透過靜態分析這些 App 的程式碼,來計算其使用行為的多寡;在推薦 App 方面,AppReco 分析類似屬性的 App 與其執行的行為,最後推薦使用者使用較少敏感行為(如使用廣告、使用個人資料、使用社群軟體開發包等)的 App。
而本研究使用在 Apple App Store 上面數千個在各個類別中的前兩百名 App 做為我們的實驗資料集來進行實驗。 / Mobile applications have been widely used in life and become dominant software applications nowadays. However there are lack of systematic recommendation systems that can be leveraged in advance without users’ evaluations. We present AppReco, a systematic recommendation system of iOS mobile applications that can evaluate mobile applications without executions.
AppReco evaluates apps that have similar interests with static binary analysis, revealing their behaviors according to the embedded functions in the executable. The analysis consists of three stages: (1) unsupervised learning on app descriptions with Latent Dirichlet Allocation for topic discovery and Growing Hierarchical Self-organizing Maps for hierarchical clustering, (2) static binary analysis on executables to discover embedded system calls and (3) ranking common-topic applications from their matched behavior patterns.
To find apps that have similar interests, AppReco discovers (unsupervised) topics in official descriptions and clusters apps that have common topics as similar-interest apps. To evaluate apps, AppReco adopts static binary analysis on their executables to count invoked system calls and reveal embedded functions. To recommend apps, AppReco analyzes similar-interest apps with their behaviors of executables, and recommend apps that have less sensitive behaviors such as commercial advertisements, privacy information access, and internet connections, to users.
We report our analysis against thousands of iOS apps in the Apple app store including most of the listed top 200 applications in each category.
|
27 |
Deep generative models for natural language processingMiao, Yishu January 2017 (has links)
Deep generative models are essential to Natural Language Processing (NLP) due to their outstanding ability to use unlabelled data, to incorporate abundant linguistic features, and to learn interpretable dependencies among data. As the structure becomes deeper and more complex, having an effective and efficient inference method becomes increasingly important. In this thesis, neural variational inference is applied to carry out inference for deep generative models. While traditional variational methods derive an analytic approximation for the intractable distributions over latent variables, here we construct an inference network conditioned on the discrete text input to provide the variational distribution. The powerful neural networks are able to approximate complicated non-linear distributions and grant the possibilities for more interesting and complicated generative models. Therefore, we develop the potential of neural variational inference and apply it to a variety of models for NLP with continuous or discrete latent variables. This thesis is divided into three parts. Part I introduces a <b>generic variational inference framework</b> for generative and conditional models of text. For continuous or discrete latent variables, we apply a continuous reparameterisation trick or the REINFORCE algorithm to build low-variance gradient estimators. To further explore Bayesian non-parametrics in deep neural networks, we propose a family of neural networks that parameterise categorical distributions with continuous latent variables. Using the stick-breaking construction, an unbounded categorical distribution is incorporated into our deep generative models which can be optimised by stochastic gradient back-propagation with a continuous reparameterisation. Part II explores <b>continuous latent variable models for NLP</b>. Chapter 3 discusses the Neural Variational Document Model (NVDM): an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. In Chapter 4, the neural topic models modify the neural document models by parameterising categorical distributions with continuous latent variables, where the topics are explicitly modelled by discrete latent variables. The models are further extended to neural unbounded topic models with the help of stick-breaking construction, and a truncation-free variational inference method is proposed based on a Recurrent Stick-breaking construction (RSB). Chapter 5 describes the Neural Answer Selection Model (NASM) for learning a latent stochastic attention mechanism to model the semantics of question-answer pairs and predict their relatedness. Part III discusses <b>discrete latent variable models</b>. Chapter 6 introduces latent sentence compression models. The Auto-encoding Sentence Compression Model (ASC), as a discrete variational auto-encoder, generates a sentence by a sequence of discrete latent variables representing explicit words. The Forced Attention Sentence Compression Model (FSC) incorporates a combined pointer network biased towards the usage of words from source sentence, which significantly improves the performance when jointly trained with the ASC model in a semi-supervised learning fashion. Chapter 7 describes the Latent Intention Dialogue Models (LIDM) that employ a discrete latent variable to learn underlying dialogue intentions. Additionally, the latent intentions can be interpreted as actions guiding the generation of machine responses, which could be further refined autonomously by reinforcement learning. Finally, Chapter 8 summarizes our findings and directions for future work.
|
28 |
應用主題探勘與標籤聚合於標籤推薦之研究 / Application of topic mining and tag clustering for tag recommendation高挺桂, Kao, Ting Kuei Unknown Date (has links)
標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式,作為傳統分類方法的替代,其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點,除了有相當多無標籤標註的內容,也存在大量模糊、不精確的標籤,降低了系統本身組織分類標籤的能力。為了解決上述兩項問題,本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法,期望能夠建立一個去人工過程的自動化標籤推薦規則,來推薦合適的標籤給使用者。
本研究蒐集了痞客邦部落格中,點閱次數大於5000次的熱門中文文章共2500篇,經過前處理,並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分,本研究利用LDA主題模型計算不同文章的主題語意,來與既有標籤作出關聯,而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中,本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數,改善了LDA需要人主觀決定主題數的問題;在標籤聚合部分,本研究以階層式分群法,將有共同出現過的標籤群聚起來,以便找出有相似語意概念的標籤。其中,本研究將分群停止條件設定為共現次數最少為1次,改善了分群方法需要設定分群數量才能有結果的問題,也使本方法能夠自動化的找出合適的分群數目。
實驗結果顯示,依照文章主題語意來推薦標籤有一定程度的可行性,且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中,同一群中的標籤確實擁有相似、類似的概念語意。最後,在結合主題探勘與標籤聚合的方法上,其Top-1至Top-5的準確率平均提升了14.1%,且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法,確實能幫助提升標籤推薦的準確率,也代表本研究確實建立了一個自動化的標籤推薦規則,能推薦出合適的標籤來幫助使用者在撰寫文章後,能夠更方便、精確的標上標籤。 / Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering.
In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result.
First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use.
|
29 |
Topic Modeling and Spam Detection for Short Text Segments in Web ForumsSun, Yingcheng 28 January 2020 (has links)
No description available.
|
30 |
Automatic Identification of Duplicates in Literature in Multiple LanguagesKlasson Svensson, Emil January 2018 (has links)
As the the amount of books available online the sizes of each these collections are at the same pace growing larger and more commonly in multiple languages. Many of these cor- pora contain duplicates in form of various editions or translations of books. The task of finding these duplicates is usually done manually but with the growing sizes making it time consuming and demanding. The thesis set out to find a method in the field of Text Mining and Natural Language Processing that can automatize the process of manually identifying these duplicates in a corpora mainly consisting of fiction in multiple languages provided by Storytel. The problem was approached using three different methods to compute distance measures between books. The first approach was comparing titles of the books using the Levenstein- distance. The second approach used extracting entities from each book using Named En- tity Recognition and represented them using tf-idf and cosine dissimilarity to compute distances. The third approach was using a Polylingual Topic Model to estimate the books distribution of topics and compare them using Jensen Shannon Distance. In order to es- timate the parameters of the Polylingual Topic Model 8000 books were translated from Swedish to English using Apache Joshua a statistical machine translation system. For each method every book written by an author was pairwise tested using a hypothesis test where the null hypothesis was that the two books compared is not an edition or translation of the others. Since there is no known distribution to assume as the null distribution for each book a null distribution was estimated using distance measures of books not written by the author. The methods were evaluated on two different sets of manually labeled data made by the author of the thesis. One randomly sampled using one-stage cluster sampling and one consisting of books from authors that the corpus provider prior to the thesis be considered more difficult to label using automated techniques. Of the three methods the Title Matching was the method that performed best in terms of accuracy and precision based of the sampled data. The entity matching approach was the method with the lowest accuracy and precision but with a almost constant recall at around 50 %. It was concluded that there seems to be a set of duplicates that are clearly distin- guished from the estimated null-distributions, with a higher significance level a better pre- cision and accuracy could have been made with a similar recall for the specific method. For topic matching the result was worse than the title matching and when studied the es- timated model was not able to create quality topics the cause of multiple factors. It was concluded that further research is needed for the topic matching approach. None of the three methods were deemed be complete solutions to automatize detection of book duplicates.
|
Page generated in 0.0803 seconds