Global ETD Search

181	Learning and Relevance in Information Retrieval: A Study in the Application of Exploration and User Knowledge to Enhance Performance Hyman, Harvey Stuart 01 January 2012 (has links) This dissertation examines the impact of exploration and learning upon eDiscovery information retrieval; it is written in three parts. Part I contains foundational concepts and background on the topics of information retrieval and eDiscovery. This part informs the reader about the research frameworks, methodologies, data collection, and instruments that guide this dissertation. Part II contains the foundation, development and detailed findings of Study One, "The Relationship of Exploration with Knowledge Acquisition." This part of the dissertation reports on experiments designed to measure user exploration of a randomly selected subset of a corpus and its relationship with performance in the information retrieval (IR) result. The IR results are evaluated against a set of scales designed to measure behavioral IR factors and individual innovativeness. The findings reported in Study One suggest a new explanation for the relationship between recall and precision, and provide insight into behavioral measures that can be used to predict user IR performance. Part II also reports on a secondary set of experiments performed on a technique for filtering IR results by using "elimination terms." These experiments have been designed to develop and evaluate the elimination term method as a means to improve precision without loss of recall in the IR result. Part III contains the foundation, and development of Study Three, "A New System for eDiscovery IR Based on Context Learning and Relevance." This section reports on a set of experiments performed on an IT artifact, Legal Intelligence®, developed during this dissertation. The artifact developed for Study Three uses a learning tool for context and relevance to improve the IR extraction process by allowing the user to adjust the IR search structure based on iterative document extraction samples. The artifact has been developed based on the needs of the business community of practitioners in the domain of eDiscovery; it has been instantiated and tested during Study Three and has produced significant results supporting its feasibility for use. Part III contains conclusions and steps for future research extending beyond this dissertation. Context Exploration Information Retrieval Learning Relevance Feedback Text Mining American Studies Arts and Humanities Library and Information Science
182	Evidence of Things Not Seen: A Semi-Automated Descriptive Phrase and Frame Analysis of Texts about the Herbicide Agent Orange Hopton, Sarah Beth 01 January 2015 (has links) From 1961 to 1971 the United States and the Republic of South Vietnam used chemicals to defoliate the coastal and upload forest areas of Viet Nam. The most notorious of these chemicals was named Agent Orange, a weaponized herbicide made up of two chemicals that, when combined, produced a toxic byproduct called TCDD-dioxin. Studied suggest that TCDD-dioxin causes significant human health problems in exposed American and Vietnamese veterans, and possibly their children (Agency, U.S. Environmental Protection, 2011). In the years since the end of the Vietnam War, volumes of discourse about Agent Orange has been generated, much of which is now digitally archived and machine-readable, providing rich sites of study ideal for “big data” text mining, extraction and computation. This study uses a combination of tools and text mining scripts developed in Python to study the descriptive phrases four discourse communities used across 45 years of discourse to talk about key issues in the debates over Agent Orange. Findings suggests these stakeholders describe and frame in significantly different ways, with Congress focused on taking action, the New York Times article and editorial corpus focused on controversy, and the Vietnamese News Agency focused on victimization. Findings also suggest that while new tools and methods make lighter work of mining large sets of corpora, a mixed-methods approach yields the most reliable insights. Though fully automated text analysis is still a distant reality, this method was designed to study potential effects of rhetoric on public policy and advocacy initiatives across large corpora of texts and spans of time. Agent Orange digital discourse analysis social network analysis text mining Other Communication Rhetoric
183	探討美國上市公司MD&A揭露與財務表現一致性之決定因素 / Explore the Determinants of the Consistency between US Listed Companies’ MD&A Disclosure and Financial Performance 李宸昕, Lee, Chen Hsin Unknown Date (has links) 本研究透過文字探勘對美國企業2004年至2014年的MD&A資訊進行分析，並搭配財務資訊相互比較，分析美國企業所揭露的MD&A語調一致性，接著透過實證研究分析造成美國企業MD&A語調一致性結果的原因。MD&A非量化資訊運用Loughran and McDonald正負向詞典、TFIDF、K-Means等技術進行分析，並結合財務資訊分析，分析美國企業2004年至2014年的MD&A資訊；再利用企業績效變異度、企業規模與企業成立年數等變數，來分析影響公司MD&A揭露誇大與否的因素。研究結果顯示，企業規模、企業風險程度、分析師追蹤人數與企業成立年數皆會深深影響MD&A語調的一致性。除了主要實證分析結果外，另外搭配三組穩健性測試來測試模型的敏感性。本研究希望讓資訊使用者運用企業所揭露的MD&A資訊時，能做更多適當的調整，考慮公司MD&A的揭露是否有過度樂觀誇大或是過度悲觀的情勢，並且可以藉此做出正確的經濟決策。 / This study presented a way to analyze the MD&A information of US listed companies from 2004 to 2014 via text mining techniques such as Loughran and McDonald Word Count and TFIDF. Then I cross compare a company’s MD&A information with its financial information using K-Means and establish an index to capture the consistency between the two types of information. Finally, I develop empirical model with explanatory variables such as volatility of earnings, company scale, company’s age, etc. for the consistency index. According to the empirical results, company scale, company operating risks, analyst coverage, and company’s age are significantly related to the MD&A consistency. Three robustness checks demonstrate the similar results. The results suggest investors an additional way of using MD&A other than merely reading it. Investors should consider whether the MD&A is overstated or understated while using it in their investment decisions. MD&A K-Means 文字探勘 MD&A K-Means text mining
184	以文字探勘為基礎之財務風險分析方法研究 / Exploring Financial Risk via Text Mining Approaches 劉澤 Unknown Date (has links) 近年來有許多研究將機器學習應用於財務方面的股價走勢與風險預測。透過分析股票價格、財報的文字資訊、財經新聞或者更即時的推特推文,都有不同的應用方式可以做出一定程度的投資風險評估與股價走勢預測。在這篇論文中,我們著重在財務報表中的文字資訊,並利用文字資訊於財務風險評估的問題上。我們以財報中的文字資訊預測上市公司的風險程度,在此論文中我們選用股價波動度作為衡量財務風險的評量方法。在文字的處理上,我們首先利用財金領域的情緒字典改善原有的文字模型,情緒分析的研究指出情緒字能更有效率地反應文章中的意見或是對於事件的看法,因而能有效地降低文字資訊的雜訊並且提升財報文字資訊預測時的準確率。其次,我們嘗試以權重的方式將股價與投資報酬率等數值資訊帶入機器學習模型中,在學習模型時我們根據公司財報中的數值資訊,給予不同公司財報中的文字資訊權重,並且透過不同權重設定的支持向量機將財報中的文字資訊結合。根據我們的實驗結果顯示,財務情緒字典能有效地代表財報中的文字資訊,同時,財務情緒字與公司的風險高度相關。在財務情緒字以權重的方式將股價與投資報酬率結合的實驗結果中,數值資訊顯著地提升了風險預測的準確率。 / In recent years, there have been some studies using machine learning techniques to predict stock tendency and investment risks in finance. There have also been some applications that analyze the textual information in fi- nancial reports, financial news, or even twitters on social network to provide useful information for stock investors. In this paper, we focus on the problem that uses the textual information in financial reports and numerical informa- tion of companies to predict the financial risk. We use the textual information in financial report of companies to predict the financial risk in the following year. We utilize stock volatility to measure financial risk. In the first part of the thesis, we use a finance-specific sentiment lexicon to improve the pre- diction models that are trained only textual information of financial reports. Then we also provide a sentiment analysis to the results. In the second part of the thesis, we attempt to combine the textual information and the numeri- cal information, such as stock returns to further improve the performance of the prediction models. In specific, in the proposed approach each company instance associated with its financial textual information will be weighted by its stock returns by using the cost-sensitive learning techniques. Our experi- mental results show that, finance-specific sentiment lexicon models conduct comparable performance to those on the original texts, which confirms the importance of financial sentiment words on risk prediction. More impor- tantly, the learned models suggest strong correlations between financial sen- timent words and risk of companies. In addition, our cost-sensitive results significantly improve the cost-insensitive results. As a result, these findings identify the impact of sentiment words in financial reports, and the numerical information can be utilized as the cost weights of learning techniques. 文字探勘財務風險 Text Mining Financial Risk
185	Statistical Text Analysis for Social Science O'Connor, Brendan T. 01 August 2014 (has links) What can text corpora tell us about society? How can automatic text analysis algorithms efficiently and reliably analyze the social processes revealed in language production? This work develops statistical text analyses of dynamic social and news media datasets to extract indicators of underlying social phenomena, and to reveal how social factors guide linguistic production. This is illustrated through three case studies: first, examining whether sentiment expressed in social media can track opinion polls on economic and political topics (Chapter 3); second, analyzing how novel online slang terms can be very specific to geographic and demographic communities, and how these social factors affect their transmission over time (Chapters 4 and 5); and third, automatically extracting political events from news articles, to assist analyses of the interactions of international actors over time (Chapter 6). We demonstrate a variety of computational, linguistic, and statistical tools that are employed for these analyses, and also contribute MiTextExplorer, an interactive system for exploratory analysis of text data against document covariates, whose design was informed by the experience of researching these and other similar works (Chapter 2). These case studies illustrate recurring themes toward developing text analysis as a social science methodology: computational and statistical complexity, and domain knowledge and linguistic assumptions. computational social science natural language processing text mining quantitative text analysis machine learning probabilistic graphical models
186	Understanding the Form and Function of Neuronal Physiological Diversity Tripathy, Shreejoy J. 31 October 2013 (has links) For decades electrophysiologists have recorded and characterized the biophysical properties of a rich diversity of neuron types. This diversity of neuron types is critical for generating functionally important patterns of brain activity and implementing neural computations. In this thesis, I developed computational methods towards quantifying neuron diversity and applied these methods for understanding the functional implications of within-type neuron variability and across-type neuron diversity. First, I developed a means for defining the functional role of differences among neurons of the same type. Namely, I adapted statistical neuron models, termed generalized linear models, to precisely capture how the membranes of individual olfactory bulb mitral cells transform afferent stimuli to spiking responses. I then used computational simulations to construct virtual populations of biophysically variable mitral cells to study the functional implications of within-type neuron variability. I demonstrate that an intermediate amount of intrinsic variability enhances coding of noisy afferent stimuli by groups of biophysically variable mitral cells. These results suggest that within-type neuron variability, long considered to be a disadvantageous consequence of biological imprecision, may serve a functional role in the brain. Second, I developed a methodology for quantifying the rich electrophysiological diversity across the majority of the neuron types throughout the mammalian brain. Using semi-automated text-mining, I built a database, Neuro- Electro, of neuron type specific biophysical properties extracted from the primary research literature. This data is available at http://neuroelectro.org, which provides a publicly accessible interface where this information can be viewed. Though the extracted physiological data is highly variable across studies, I demonstrate that knowledge of article-specific experimental conditions can significantly explain the observed variance. By applying simple analyses to the dataset, I find that there exist 5-7 major neuron super-classes which segregate on the basis of known functional roles. Moreover, by integrating the NeuroElectro dataset with brain-wide gene expression data from the Allen Brain Atlas, I show that biophysically-based neuron classes correlate highly with patterns of gene expression among voltage gated ion channels and neurotransmitters. Furthermore, this work lays the conceptual and methodological foundations for substantially enhanced data sharing in neurophysiological investigations in the future. neuron diversity neuron coding stimulus decoding olfactory bulb neurophysiology text mining
187	Language Engineering for Information Extraction Schierle, Martin 10 January 2012 (has links) (PDF) Accompanied by the cultural development to an information society and knowledge economy and driven by the rapid growth of the World Wide Web and decreasing prices for technology and disk space, the world\'s knowledge is evolving fast, and humans are challenged with keeping up. Despite all efforts on data structuring, a large part of this human knowledge is still hidden behind the ambiguities and fuzziness of natural language. Especially domain language poses new challenges by having specific syntax, terminology and morphology. Companies willing to exploit the information contained in such corpora are often required to build specialized systems instead of being able to rely on off the shelf software libraries and data resources. The engineering of language processing systems is however cumbersome, and the creation of language resources, annotation of training data and composition of modules is often enough rather an art than a science. The scientific field of Language Engineering aims at providing reliable information, approaches and guidelines of how to design, implement, test and evaluate language processing systems. Language engineering architectures have been a subject of scientific work for the last two decades and aim at building universal systems of easily reusable components. Although current systems offer comprehensive features and rely on an architectural sound basis, there is still little documentation about how to actually build an information extraction application. Selection of modules, methods and resources for a distinct usecase requires a detailed understanding of state of the art technology, application demands and characteristics of the input text. The main assumption underlying this work is the thesis that a new application can only occasionally be created by reusing standard components from different repositories. This work recapitulates existing literature about language resources, processing resources and language engineering architectures to derive a theory about how to engineer a new system for information extraction from a (domain) corpus. This thesis was initiated by the Daimler AG to prepare and analyze unstructured information as a basis for corporate quality analysis. It is therefore concerned with language engineering in the area of Information Extraction, which targets the detection and extraction of specific facts from textual data. While other work in the field of information extraction is mainly concerned with the extraction of location or person names, this work deals with automotive components, failure symptoms, corrective measures and their relations in arbitrary arity. The ideas presented in this work will be applied, evaluated and demonstrated on a real world application dealing with quality analysis on automotive domain language. To achieve this goal, the underlying corpus is examined and scientifically characterized, algorithms are picked with respect to the derived requirements and evaluated where necessary. The system comprises language identification, tokenization, spelling correction, part of speech tagging, syntax parsing and a final relation extraction step. The extracted information is used as an input to data mining methods such as an early warning system and a graph based visualization for interactive root cause analysis. It is finally investigated how the unstructured data facilitates those quality analysis methods in comparison to structured data. The acceptance of these text based methods in the company\'s processes further proofs the usefulness of the created information extraction system. Textanalyse Qualitätsanalyse Informationsextraktion Text Mining Information Extraction Quality Analysis Language Engineering ddc:000
188	運用kNN文字探勘分析智慧型終端App群集之研究 / The study of analyzing smart handheld device App's clusters by using kNN text mining 曾國傑, Tseng, Kuo Chieh Unknown Date (has links) 隨著智慧型終端設備日益普及，使用者對App需求逐漸增加，各大企業也因此開創了一種新的互動性行銷方式。同時，App下載所帶來的龐大商機也促使許多開發人員紛紛加入App的開發行列，造成App的數量呈現爆炸性成長，而讓使用者在面對種類繁多的App時，無法做出有效率的選擇。故本研究將透過文字探勘與kNN集群分析技術，分析網友發表的App推薦文並將App進行分群；再藉由參數的調整，期望能透過衡量指標的評估來獲得最佳品質之分群，以便作為使用者選擇App之參考依據。為了使大量App進行分群以解決使用者「資訊超載」的問題，本研究以App Store之遊戲類App為分析對象，蒐集了439篇App推薦文章，並依App推薦對象之異同，將其合併成357篇App推薦文章；接著，透過文字探勘技術將文章轉換成可相互比較的向量空間模型，再利用kNN群集分析對其進行分群。同時，藉由參數組合中k值與文件相似度門檻值的調整來獲得最佳品質之分群；其分群品質的評估則透過平均群內相似度等指標來進行衡量；而為了提升分群品質，本研究採用「多階段分群」，以分群後各群集內的文章數量來判斷是否進行再分群或群集合併。本研究結果顯示第一階段分群在k值為10、文件相似度門檻值為0.025時，能獲得最佳之分群品質。而在後續階段的分群過程中，因群集內文章數減少，故將k值降低並逐漸提高文件相似度門檻值以獲得分群效果。第二階段結束後，可針對已達到分群停止條件之群集進行關鍵詞彙萃取，並可歸類出「棒球/射擊」與「投擲飛行」等6種App類型；其後階段依循相同分群規則可獲得「守城塔防」等14種App類型。分群結束後，共可分出36個群集並獲得20種App類型。分群過程中，平均群內相似度逐漸增加；平均群間相似度則逐漸下降；分群品質衡量指標由第一階段分群後的12.65%提升到第五階段結束時的75.81%。由本研究可知分群之後相似度高的App會逐漸聚集成群，所獲得之各群集命名結果將能作為使用者選擇App之參考依據；App軟體開發人員也能從各群集之關鍵詞彙中了解使用者所注重的遊戲元素，改善App內容以更符合使用者之需求。而以本研究結果為基礎，透過建立專業詞庫改善分群品質、利用文件摘要技術加強使用者對各群集之了解，或建立App推薦系統等皆可做為未來研究之方向。 / With the popularity of Smart Handheld Devices are increasing, the needs of “App” are spreading. Developers whom devote themselves to this opportunity are also rising, making the total number of Apps growing rapidly. Facing these kind of situation, users couldn’t choose the App they need efficiently. This research uses text mining and kNN Clustering technique analyzing the recommendation reviews of App by netizen then clustering the App recommendation articles; Through the adjustments of parameters, we expect to evaluate the measurement indicators to obtain the best quality cluster to use as a basis for users to select Apps. In order to solve the information overload for the user, we analyzed apps of the “Games” category form App store and sorted out to 357 App recommendation articles to use as our analysis target. Then we used text mining technique to process the articles and uses kNN clustering analysis to sort out the articles. Simultaneously, we fine tuning the measurement indicators to find the optimal cluster. This research uses multi-phase clustering technique to assure the quality of each cluster. We discriminate 36 clusters and 20 categories from the clustering results. During the clustering process, the Mean of Intra-cluster Similarity increases gradually; in the contrary, the Mean of Inter-cluster Similarity reduces. The “Cluster Quality” increases from 12.65% significantly to 75.81%. In conclusion, similar Apps will gradually been clustered by its similarities, and can be used to be a reference by its cluster’s name. The App developers can also understands the game elements which the users pay greater attentions and tailored their contents to match the needs of the users according to the key phrases from each cluster. In further discussion, building specialized terms database of App to improve the quality of the clustering, using summarization technique to robust user understanding of each cluster, or to build up App recommendation system is liking to be further studied via using the results by this research. App kNN 群集分析文字探勘 App kNN Clustering Text Mining
189	Vom WWW zur Kollokation praxisorientiertes Verfahren zur Kollokations- und Terminologieakquisation für Übersetzer und Dolmetscher Dörr, Simone January 2005 (has links) Zugl.: Heidelberg, Univ., Diplomarbeit, 2005 / Titel auf der Beil.
190	Text Mining im Customer Relationship Management / Rentzmann, René. January 2008 (has links) Kath. Universiẗat, Diss.--Eichstätt-Ingolstadt, 2007.

Search results