Spelling suggestions: "subject:"textmining"" "subject:"detennining""
201 |
Investigations of Term Expansion on Text Mining TechniquesYang, Chin-Sheng 02 August 2002 (has links)
Recent advances in computer and network technologies have contributed significantly to global connectivity and stimulated the amount of online textual document to grow extremely rapidly. The rapid accumulation of textual documents on the Web or within an organization requires effective document management techniques, covering from information retrieval, information filtering and text mining. The word mismatch problem represents a challenging issue to be addressed by the document management research. Word mismatch has been extensively investigated in information retrieval (IR) research by the use of term expansion (or specifically query expansion). However, a review of text mining literature suggests that the word mismatch problem has seldom been addressed by text mining techniques. Thus, this thesis aims at investigating the use of term expansion on some text mining techniques, specifically including text categorization, document clustering and event detection. Accordingly, we developed term expansion extensions to these three text mining techniques. The empirical evaluation results showed that term expansion increased the categorization effectiveness when the correlation coefficient feature selection was employed. With respect to document clustering, techniques extended with term expansion achieved comparable clustering effectiveness to existing techniques and showed its superiority in improving clustering specificity measure. Finally, the use of term expansion for supporting event detection has degraded the detection effectiveness as compared to the traditional event detection technique.
|
202 |
Graph Similarity, Parallel Texts, and Automatic Bilingual Lexicon AcquisitionTörnfeldt, Tobias January 2008 (has links)
<p>In this masters’ thesis report we present a graph theoretical method used for automatic bilingual lexicon acquisition with parallel texts. We analyze the concept of graph similarity and give an interpretation, of the parallel texts, connected to the vector space model. We represent the parallel texts by a directed, tripartite graph and from here use the corresponding adjacency matrix, A, to compute the similarity of the graph. By solving the eigenvalue problem ρS = ASAT + ATSA we obtain the self-similarity matrix S and the Perron root ρ. A rank k approximation of the self-similarity matrix is computed by implementations of the singular value decomposition and the non-negative matrix factorization algorithm GD-CLS. We construct an algorithm in order to extract the bilingual lexicon from the self-similarity matrix and apply a statistical model to estimate the precision, the correctness, of the translations in the bilingual lexicon. The best result is achieved with an application of the vector space model with a precision of about 80 %. This is a good result and can be compared with the precision of about 60 % found in the literature.</p>
|
203 |
Nachrichtenklassifikation als Komponente in WEBISKrellner, Björn 29 September 2006 (has links) (PDF)
In der Diplomarbeit wird die Weiterentwicklung eines Prototyps zur Nachrichtenklassifikation sowie die Integration in das bestehende Web-orientierte Informationssystem (WEBIS) beschrieben.
Mit der entstandenen Software vorgenommene Klassifikationen werden vorgestellt und mit bisherigen Erkenntnissen verglichen.
|
204 |
Einsatz von Text Mining zur Prognose kurzfristiger Trends von Aktienkursen nach der Publikation von Unternehmensnachrichten /Mittermayer, Marc-André. January 2006 (has links)
Univ., Diss--Bern, 2005.
|
205 |
A text mining framework in R and its applicationsFeinerer, Ingo 08 1900 (has links) (PDF)
Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well established text mining techniques can be applied in our framework and show how common text mining tasks can be performed utilizing our infrastructure. The second part in this thesis is dedicated to a set of realistic applications using our framework. The first application deals with the implementation of a sophisticated mailing list analysis, whereas the second example identifies the potential of text mining methods for business to consumer electronic commerce. The third application shows the benefits of text mining for law documents. Finally we present an application which deals with authorship attribution on the famous Wizard of Oz book series. (author's abstract)
|
206 |
Latent Dirichlet Allocation in RPonweiser, Martin 05 1900 (has links) (PDF)
Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes.
This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract) / Series: Theses / Institute for Statistics and Mathematics
|
207 |
Learning and Relevance in Information Retrieval: A Study in the Application of Exploration and User Knowledge to Enhance PerformanceHyman, Harvey Stuart 01 January 2012 (has links)
This dissertation examines the impact of exploration and learning upon eDiscovery information retrieval; it is written in three parts. Part I contains foundational concepts and background on the topics of information retrieval and eDiscovery. This part informs the reader about the research frameworks, methodologies, data collection, and instruments that guide this dissertation.
Part II contains the foundation, development and detailed findings of Study One, "The Relationship of Exploration with Knowledge Acquisition." This part of the dissertation reports on experiments designed to measure user exploration of a randomly selected subset of a corpus and its relationship with performance in the information retrieval (IR) result. The IR results are evaluated against a set of scales designed to measure behavioral IR factors and individual innovativeness. The findings reported in Study One suggest a new explanation for the relationship between recall and precision, and provide insight into behavioral measures that can be used to predict user IR performance.
Part II also reports on a secondary set of experiments performed on a technique for filtering IR results by using "elimination terms." These experiments have been designed to develop and evaluate the elimination term method as a means to improve precision without loss of recall in the IR result.
Part III contains the foundation, and development of Study Three, "A New System for eDiscovery IR Based on Context Learning and Relevance." This section reports on a set of experiments performed on an IT artifact, Legal Intelligence®, developed during this dissertation.
The artifact developed for Study Three uses a learning tool for context and relevance to improve the IR extraction process by allowing the user to adjust the IR search structure based on iterative document extraction samples. The artifact has been developed based on the needs of the business community of practitioners in the domain of eDiscovery; it has been instantiated and tested during Study Three and has produced significant results supporting its feasibility for use. Part III contains conclusions and steps for future research extending beyond this dissertation.
|
208 |
Evidence of Things Not Seen: A Semi-Automated Descriptive Phrase and Frame Analysis of Texts about the Herbicide Agent OrangeHopton, Sarah Beth 01 January 2015 (has links)
From 1961 to 1971 the United States and the Republic of South Vietnam used chemicals to defoliate the coastal and upload forest areas of Viet Nam. The most notorious of these chemicals was named Agent Orange, a weaponized herbicide made up of two chemicals that, when combined, produced a toxic byproduct called TCDD-dioxin. Studied suggest that TCDD-dioxin causes significant human health problems in exposed American and Vietnamese veterans, and possibly their children (Agency, U.S. Environmental Protection, 2011). In the years since the end of the Vietnam War, volumes of discourse about Agent Orange has been generated, much of which is now digitally archived and machine-readable, providing rich sites of study ideal for “big data” text mining, extraction and computation. This study uses a combination of tools and text mining scripts developed in Python to study the descriptive phrases four discourse communities used across 45 years of discourse to talk about key issues in the debates over Agent Orange. Findings suggests these stakeholders describe and frame in significantly different ways, with Congress focused on taking action, the New York Times article and editorial corpus focused on controversy, and the Vietnamese News Agency focused on victimization. Findings also suggest that while new tools and methods make lighter work of mining large sets of corpora, a mixed-methods approach yields the most reliable insights. Though fully automated text analysis is still a distant reality, this method was designed to study potential effects of rhetoric on public policy and advocacy initiatives across large corpora of texts and spans of time.
|
209 |
探討美國上市公司MD&A揭露與財務表現一致性之決定因素 / Explore the Determinants of the Consistency between US Listed Companies’ MD&A Disclosure and Financial Performance李宸昕, Lee, Chen Hsin Unknown Date (has links)
本研究透過文字探勘對美國企業2004年至2014年的MD&A資訊進行分析,並搭配財務資訊相互比較,分析美國企業所揭露的MD&A語調一致性,接著透過實證研究分析造成美國企業MD&A語調一致性結果的原因。MD&A非量化資訊運用Loughran and McDonald正負向詞典、TFIDF、K-Means等技術進行分析,並結合財務資訊分析,分析美國企業2004年至2014年的MD&A資訊;再利用企業績效變異度、企業規模與企業成立年數等變數,來分析影響公司MD&A揭露誇大與否的因素。
研究結果顯示,企業規模、企業風險程度、分析師追蹤人數與企業成立年
數皆會深深影響MD&A語調的一致性。除了主要實證分析結果外,另外搭配三組穩健性測試來測試模型的敏感性。本研究希望讓資訊使用者運用企業所揭露的MD&A資訊時,能做更多適當的調整,考慮公司MD&A的揭露是否有過度樂觀誇大或是過度悲觀的情勢,並且可以藉此做出正確的經濟決策。 / This study presented a way to analyze the MD&A information of US listed companies from 2004 to 2014 via text mining techniques such as Loughran and McDonald Word Count and TFIDF. Then I cross compare a company’s MD&A information with its financial information using K-Means and establish an index to capture the consistency between the two types of information. Finally, I develop empirical model with explanatory variables such as volatility of earnings, company scale, company’s age, etc. for the consistency index.
According to the empirical results, company scale, company operating risks, analyst coverage, and company’s age are significantly related to the MD&A consistency. Three robustness checks demonstrate the similar results. The results suggest investors an additional way of using MD&A other than merely reading it. Investors should consider whether the MD&A is overstated or understated while using it in their investment decisions.
|
210 |
以文字探勘為基礎之財務風險分析方法研究 / Exploring Financial Risk via Text Mining Approaches劉澤 Unknown Date (has links)
近年來有許多研究將機器學習應用於財務方面的股價走勢與風險預 測。透過分析股票價格、財報的文字資訊、財經新聞或者更即時的推 特推文,都有不同的應用方式可以做出一定程度的投資風險評估與股 價走勢預測。在這篇論文中,我們著重在財務報表中的文字資訊,並 利用文字資訊於財務風險評估的問題上。我們以財報中的文字資訊預 測上市公司的風險程度,在此論文中我們選用股價波動度作為衡量財 務風險的評量方法。在文字的處理上,我們首先利用財金領域的情緒 字典改善原有的文字模型,情緒分析的研究指出情緒字能更有效率地 反應文章中的意見或是對於事件的看法,因而能有效地降低文字資訊 的雜訊並且提升財報文字資訊預測時的準確率。其次,我們嘗試以權 重的方式將股價與投資報酬率等數值資訊帶入機器學習模型中,在學 習模型時我們根據公司財報中的數值資訊,給予不同公司財報中的文 字資訊權重,並且透過不同權重設定的支持向量機將財報中的文字資 訊結合。根據我們的實驗結果顯示,財務情緒字典能有效地代表財報 中的文字資訊,同時,財務情緒字與公司的風險高度相關。在財務情 緒字以權重的方式將股價與投資報酬率結合的實驗結果中,數值資訊 顯著地提升了風險預測的準確率。 / In recent years, there have been some studies using machine learning techniques to predict stock tendency and investment risks in finance. There have also been some applications that analyze the textual information in fi- nancial reports, financial news, or even twitters on social network to provide useful information for stock investors. In this paper, we focus on the problem that uses the textual information in financial reports and numerical informa- tion of companies to predict the financial risk. We use the textual information in financial report of companies to predict the financial risk in the following year. We utilize stock volatility to measure financial risk. In the first part of the thesis, we use a finance-specific sentiment lexicon to improve the pre- diction models that are trained only textual information of financial reports. Then we also provide a sentiment analysis to the results. In the second part of the thesis, we attempt to combine the textual information and the numeri- cal information, such as stock returns to further improve the performance of the prediction models. In specific, in the proposed approach each company instance associated with its financial textual information will be weighted by its stock returns by using the cost-sensitive learning techniques. Our experi- mental results show that, finance-specific sentiment lexicon models conduct comparable performance to those on the original texts, which confirms the importance of financial sentiment words on risk prediction. More impor- tantly, the learned models suggest strong correlations between financial sen- timent words and risk of companies. In addition, our cost-sensitive results significantly improve the cost-insensitive results. As a result, these findings identify the impact of sentiment words in financial reports, and the numerical information can be utilized as the cost weights of learning techniques.
|
Page generated in 0.0826 seconds