Global ETD Search

171	Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithms Wright, Lindsey 01 May 2018 (has links) Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media. text mining sentiment analysis decision tree random forest Other Applied Mathematics
172	Modeling words for online sexual behavior surveillance and clinical text information extraction Fries, Jason Alan 01 July 2015 (has links) How do we model the meaning of words? In domains like information retrieval, words have classically been modeled as discrete entities using 1-of-n encoding, a representation that elides most of a word's syntactic and semantic structure. Recent research, however, has begun exploring more robust representations called word embeddings. Embeddings model words as a parameterized function mapping into an n-dimensional continuous space and implicitly encode a number of interesting semantic and syntactic properties. This dissertation examines two application areas where existing, state-of-the-art terminology modeling improves the task of information extraction (IE) -- the process of transforming unstructured data into structured form. We show that a large amount of word meaning can be learned directly from very large document collections. First, we explore the feasibility of mining sexual health behavior data directly from the unstructured text of online “hookup" requests. The Internet has fundamentally changed how individuals locate sexual partners. The rise of dating websites, location-aware smartphone apps like Grindr and Tinder that facilitate casual sexual encounters (“hookups"), as well as changing trends in sexual health practices all speak to the shifting cultural dynamics surrounding sex in the digital age. These shifts also coincide with an increase in the incidence rate of sexually transmitted infections (STIs) in subpopulations such as young adults, racial and ethnic minorities, and men who have sex with men (MSM). The reasons for these increases and their possible connections to Internet cultural dynamics are not completely understood. What is apparent, however, is that sexual encounters negotiated online complicate many traditional public health intervention strategies such as contact tracing and partner notification. These circumstances underline the need to examine online sexual communities using computational tools and techniques -- as is done with other social networks -- to provide new insight and direction for public health surveillance and intervention programs. One of the central challenges in this task is constructing lexical resources that reflect how people actually discuss and negotiate sex online. Using a 2.5-year collection of over 130 million Craigslist ads (a large venue for MSM casual sexual encounters), we discuss computational methods for automatically learning terminology characterizing risk behaviors in the MSM community. These approaches range from keyword-based dictionaries and topic modeling to semi-supervised methods using word embeddings for query expansion and sequence labeling. These methods allow us to gather information similar (in part) to the types of questions asked in public health risk assessment surveys, but automatically aggregated directly from communities of interest, in near real-time, and at geographically high-resolution. We then address the methodological limitations of this work, as well as the fundamental validation challenges posed by the lack of large-scale sexual sexual behavior survey data and limited availability of STI surveillance data. Finally, leveraging work on terminology modeling in Craigslist, we present new research exploring representation learning using 7 years of University of Iowa Hospitals and Clinics (UIHC) clinical notes. Using medication names as an example, we show that modeling a low-dimensional representation of a medication's neighboring words, i.e., a word embedding, encodes a large amount of non-obvious semantic information. Embeddings, for example, implicitly capture a large degree of the hierarchical structure of drug families as well as encode relational attributes of words, such as generic and brand names of medications. These representations -- learned in a completely unsupervised fashion -- can then be used as features in other machine learning tasks. We show that incorporating clinical word embeddings in a benchmark classification task of medication labeling leads to a 5.4% increase in F1-score over a baseline of random initialization and a 1.9% over just using non-UIHC training data. This research suggests clinical word embeddings could be shared for use in other institutions and other IE tasks. health care information extraction machine learning neural networks public health text mining Computer Sciences
173	Computational methods for mining health communications in web 2.0 Bhattacharya, Sanmitra 01 May 2014 (has links) Data from social media platforms are being actively mined for trends and patterns of interests. Problems such as sentiment analysis and prediction of election outcomes have become tremendously popular due to the unprecedented availability of social interactivity data of different types. In this thesis we address two problems that have been relatively unexplored. The first problem relates to mining beliefs, in particular health beliefs, and their surveillance using social media. The second problem relates to investigation of factors associated with engagement of U.S. Federal Health Agencies via Twitter and Facebook. In addressing the first problem we propose a novel computational framework for belief surveillance. This framework can be used for 1) surveillance of any given belief in the form of a probe, and 2) automatically harvesting health-related probes. We present our estimates of support, opposition and doubt for these probes some of which represent true information, in the sense that they are supported by scientific evidence, others represent false information and the remaining represent debatable propositions. We show for example that the levels of support in false and debatable probes are surprisingly high. We also study the scientific novelty of these probes and find that some of the harvested probes with sparse scientific evidence may indicate novel hypothesis. We also show the suitability of off-the-shelf classifiers for belief surveillance. We find these classifiers are quite generalizable and can be used for classifying newly harvested probes. Finally, we show the ability of harvesting and tracking probes over time. Although our work is focused in health care, the approach is broadly applicable to other domains as well. For the second problem, our specific goals are to study factors associated with the amount and duration of engagement of organizations. We use negative binomial hurdle regression models and Cox proportional hazards survival models for these. For Twitter, the hurdle analysis shows that presence of user-mention is positively associated with the amount of engagement while negative sentiment has inverse association. Content of tweets is also equally important for engagement. The survival analyses indicate that engagement duration is positively associated with follower count. For Facebook, both hurdle and survival analyses show that number of page likes and positive sentiment are correlated with higher and prolonged engagement while few content types are negatively correlated with engagement. We also find patterns of engagement that are consistent across Twitter and Facebook. health informatics information retrieval machine learning public health social media text mining Computer Sciences
174	Mining for evidence in enterprise corpora Almquist, Brian Alan 01 May 2011 (has links) The primary research aim of this dissertation is to identify the strategies that best meet the information retrieval needs as expressed in the "e-discovery" scenario. This task calls for a high-recall system that, in response to a request for all available relevant documents to a legal complaint, effectively prioritizes documents from an enterprise document collection in order of likelihood of relevance. High recall information retrieval strategies, such as those employed for e-discovery and patent or medical literature searches, reflect high costs when relevant documents are missed, but they also carry high document review costs. Our approaches parallel the evaluation opportunities afforded by the TREC Legal Track. Within the ad hoc framework, we propose an approach that includes query field selection, techniques for mitigating OCR error, term weighting strategies, query language reduction, pseudo-relevance feedback using document metadata and terms extracted from documents, merging result sets, and biasing results to favor documents responsive to lawyer-negotiated queries. We conduct several experiments to identify effective parameters for each of these strategies. Within the relevance feedback framework, we use an active learning approach informed by signals from collected prior relevance judgments and ranking data. We train a classifier to prioritize the unjudged documents retrieved using different ad hoc information retrieval techniques applied to the same topic. We demonstrate significant improvements over heuristic rank aggregation strategies when choosing from a relatively small pool of documents. With a larger pool of documents, we validate the effectiveness of the merging strategy as a means to increase recall, but that sparseness of judgment data prevents effective ranking by the classifier-based ranker. We conclude our research by optimizing the classifier-based ranker and applying it to other high recall datasets. Our concluding experiments consider the potential benefits to be derived by modifying the merged runs using methods derived from social choice models. We find that this technique, Local Kemenization, is hampered by the large number of documents and the minimal number of contributing result sets to the ranked list. This two-stage approach to high-recall information retrieval tasks continues to offer a rich set of research questions for future research. data mining e-discovery information retrieval text mining
175	Improving the performance of Hierarchical Hidden Markov Models on Information Extraction tasks Chou, Lin-Yi January 2006 (has links) This thesis presents novel methods for creating and improving hierarchical hidden Markov models. The work centers around transforming a traditional tree structured hierarchical hidden Markov model (HHMM) into an equivalent model that reuses repeated sub-trees. This process temporarily breaks the tree structure constraint in order to leverage the benefits of combining repeated sub-trees. These benefits include lowered cost of testing and an increased accuracy of the final model-thus providing the model with greater performance. The result is called a merged and simplified hierarchical hidden Markov model (MSHHMM). The thesis goes on to detail four techniques for improving the performance of MSHHMMs when applied to information extraction tasks, in terms of accuracy and computational cost. Briefly, these techniques are: a new formula for calculating the approximate probability of previously unseen events; pattern generalisation to transform observations, thus increasing testing speed and prediction accuracy; restructuring states to focus on state transitions; and an automated flattening technique for reducing the complexity of HHMMs. The basic model and four improvements are evaluated by applying them to the well-known information extraction tasks of Reference Tagging and Text Chunking. In both tasks, MSHHMMs show consistently good performance across varying sizes of training data. In the case of Reference Tagging, the accuracy of the MSHHMM is comparable to other methods. However, when the volume of training data is limited, MSHHMMs maintain high accuracy whereas other methods show a significant decrease. These accuracy gains were achieved without any significant increase in processing time. For the Text Chunking task the accuracy of the MSHHMM was again comparable to other methods. However, the other methods incurred much higher processing delays compared to the MSHHMM. The results of these practical experiments demonstrate the benefits of the new method-increased accuracy, lower computation costs, and better performance. hidden Markov model hierarchical hidden Markov model information extraction text mining
176	Improving scalability and accuracy of text mining in grid environment Zhai, Yuzheng January 2009 (has links) The advance in technologies such as massive storage devices and high speed internet has led to an enormous increase in the volume of available documents in electronic form. These documents represent information in a complex and rich manner that cannot be analysed using conventional statistical data mining methods. Consequently, text mining is developed as a growing new technology for discovering knowledge from textual data and managing textual information. Processing and analysing textual information can potentially obtain valuable and important information, yet these tasks also requires enormous amount of computational resources due to the sheer size of the data available. Therefore, it is important to enhance the existing methodologies to achieve better scalability, efficiency and accuracy. / The emerging Grid technology shows promising results in solving the problem of scalability by splitting the works from text clustering algorithms into a number of jobs, each to be executed separately and simultaneously on different computing resources. That allows for a substantial decrease in the processing time and maintaining the similar level of quality at the same time. / To improve the quality of the text clustering results, a new document encoding method is introduced that takes into consideration of the semantic similarities of the words. In this way, documents that are similar in content will be more likely to be group together. / One of the ultimate goals of text mining is to help us to gain insights to the problem and to assist in the decision making process together with other source of information. Hence we tested the effectiveness of incorporating text mining method in the context of stock market prediction. This is achieved by integrating the outcomes obtained from text mining with the ones from data mining, which results in a more accurate forecast than using any single method.
177	Ett verktyg för konstruktion av ontologier från text / A Tool for Facilitating Ontology Construction from Texts Chétrit, Héloèise January 2004 (has links) <p>With the growth of information stored over Internet, especially in the biological field, and with discoveries being made daily in this domain, scientists are faced with an overwhelming amount of articles. Reading all published articles is a tedious and time-consuming process. Therefore a way to summarise the information in the articles is needed. A solution is the derivation of an ontology representing the knowledge enclosed in the set of articles and allowing to browse through them. </p><p>In this thesis we present the tool Ontolo, which allows to build an initial ontology of a domain by inserting a set of articles related to that domain in the system. The quality of the ontology construction has been tested by comparing our ontology results for keywords to the ones provided by the Gene Ontology for the same keywords. </p><p>The obtained results are quite promising for a first prototype of the system as it finds many common terms on both ontologies for justa few hundred of inserted articles.</p> Datalogi Ontology construction Text Mining Information Retrieval Bio-Informatics Bio-Ontologies Datalogi Computer science Datalogi
178	應用資料探勘技術於食譜分享社群網站進行內容分群之研究 / A user-based content clustering system using data mining techniques on a recipe sharing website 林宜儒 Unknown Date (has links) 本研究以一個食譜分享社群網站為研究對象，針對網站上所提供的食譜建立了運用 kNN 分群演算法的自動分群機制，並利用該網站上使用者的使用行為進行分群後群集的特徵描述參考。本研究以三個階段建立了一針對食譜領域進行自動分群的資訊系統。第一階段為資料處理，在取得食譜網站上所提供的食譜資料後，雖然已經有相對結構化的格式可直接進行分群運算，然而由使用者所輸入的內容，仍有錯別字、贅詞、與食譜本身直接關連性不高等情形，因此必須進行處理。第二階段為資料分群，利用文字探勘進行內容特徵值的萃取，接著再以資料探勘的技術進行分群，分群的結果將會依群內的特徵、群間的相似度作為分群品質的主要指標。第三階段則為群集特徵分析，利用網站上使用者收藏食譜並加以分類的行為，運用統計的方式找出該群集的可能分類名稱。本研究實際以 500 篇食譜進行分群實驗，在最佳的一次分群結果中，可得到 10 個食譜群集、平均群內相似度為 0.4482，每個群集可觀察出明顯的相似特徵，並且可藉由網站上使用者的收藏行為，標註出其群集特徵，例如湯品、甜點、麵包、中式料理等類別。由於網站依照schema.org 所提供的食譜格式標準，針對網站上每一篇食譜內容進行了內容欄位的標記，本研究所實作之食譜分群機制，未來亦可運用在其他同樣採用 schema.org 所提供標準之同類型網站。文字探勘資料分群 text mining data clustering
179	Cross-language Ontology Learning : Incorporating and Exploiting Cross-language Data in the Ontology Learning Process Hjelm, Hans January 2009 (has links) An ontology is a knowledge-representation structure, where words, terms or concepts are defined by their mutual hierarchical relations. Ontologies are becoming ever more prevalent in the world of natural language processing, where we currently see a tendency towards using semantics for solving a variety of tasks, particularly tasks related to information access. Ontologies, taxonomies and thesauri (all related notions) are also used in various variants by humans, to standardize business transactions or for finding conceptual relations between terms in, e.g., the medical domain. The acquisition of machine-readable, domain-specific semantic knowledge is time consuming and prone to inconsistencies. The field of ontology learning therefore provides tools for automating the construction of domain ontologies (ontologies describing the entities and relations within a particular field of interest), by analyzing large quantities of domain-specific texts. This thesis studies three main topics within the field of ontology learning. First, we examine which sources of information are useful within an ontology learning system and how the information sources can be combined effectively. Secondly, we do this with a special focus on cross-language text collections, to see if we can learn more from studying several languages at once, than we can from a single-language text collection. Finally, we investigate new approaches to formal and automatic evaluation of the quality of a learned ontology. We demonstrate how to combine information sources from different languages and use them to train automatic classifiers to recognize lexico-semantic relations. The cross-language data is shown to have a positive effect on the quality of the learned ontologies. We also give theoretical and experimental results, showing that our ontology evaluation method is a good complement to and in some aspects improves on the evaluation measures in use today. / För att köpa boken skicka en beställning till exp@ling.su.se/ To order the book send an e-mail to exp@ling.su.se Ontologies Ontology learning Distributional semantics Knowledge acquisition Text mining Computational linguistics Datorlingvistik
180	Enhanced Web Search Engines with Query-Concept Bipartite Graphs Chen, Yan 16 August 2010 (has links) With rapid growth of information on the Web, Web search engines have gained great momentum for exploiting valuable Web resources. Although keywords-based Web search engines provide relevant search results in response to users’ queries, future enhancement is still needed. Three important issues include (1) search results can be diverse because ambiguous keywords in queries can be interpreted to different meanings; (2) indentifying keywords in long queries is difficult for search engines; and (3) generating query-specific Web page summaries is desirable for Web search results’ previews. Based on clickthrough data, this thesis proposes a query-concept bipartite graph for representing queries’ relations, and applies the queries’ relations to applications such as (1) personalized query suggestions, (2) long queries Web searches and (3) query-specific Web page summarization. Experimental results show that query-concept bipartite graphs are useful for performance improvement for the three applications. Queries Query-concept bipartite graph Web search engine Text mining Computational intelligence. Computer Sciences

Search results