Spelling suggestions: "subject:"text data mining"" "subject:"next data mining""
1 |
A Text Mining Framework Linking Technical Intelligence from Publication Databases to Strategic Technology DecisionsCourseault, Cherie Renee 12 April 2004 (has links)
This research developed a comprehensive methodology to quickly monitor key technical intelligence areas, provided a method that cleanses and consolidates information into an understandable, concise picture of topics of interest, thus bridging issues of managing technology and text mining. This research evaluated and altered some existing analysis methods, and developed an overall framework for answering technical intelligence questions. A six-step approach worked through the various stages of the Intelligence and Text Data Mining Processes to address issues that hindered the use of Text Data Mining in the Intelligence Cycle and the actual use of that intelligence in making technology decisions. A questionnaire given to 34 respondents from four different industries identified the information most important to decision-makers as well as clusters of common interests. A bibliometric/text mining tool applied to journal publication databases, profiled technology trends and presented that information in the context of the stated needs from the questionnaire.
In addition to identifying the information that is important to decision-makers, this research improved the methods for analyzing information. An algorithm was developed that removed common non-technical terms and delivered at least an 89% precision rate in identifying synonymous terms. Such identifications are important to improving accuracy when mining free text, thus enabling the provision of the more specific information desired by the decision-makers. This level of precision was consistent across five different technology areas and three different databases. The result is the ability to use abstract phrases in analysis, which allows the more detailed nature of abstracts to be captured in clustering, while portraying the broad relationships as well.
|
2 |
Incorporating semantic and syntactic information into document representation for document clusteringWang, Yong 06 August 2005 (has links)
Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.
|
3 |
Empirical studies of financial and labor economicsLi, Mengmeng 12 August 2016 (has links)
This dissertation consists of three essays in financial and labor economics. It provides empirical evidence for testing the efficient market hypothesis in some financial markets and for analyzing the trends of power couples’ concentration in large metropolitan areas.
The first chapter investigates the Bitcoin market’s efficiency by examining the correlation between social media information and Bitcoin future returns. First, I extract Twitter sentiment information from the text analysis of more than 130,000 Bitcoin-related tweets. Granger causality tests confirm that market sentiment information affects Bitcoin returns in the short run. Moreover, I find that time series models that incorporate sentiment information better forecast Bitcoin future prices. Based on the predicted prices, I also implement an investment strategy that yields a sizeable return for investors.
The second chapter examines episodes of exuberance and collapse in the Chinese stock market and the second-board market using a series of extended right-tailed augmented Dickey-Fuller tests. The empirical results suggest that multiple “bubbles” occurred in the Chinese stock market, although insufficient evidence is found to claim the same for the second-board market.
The third chapter analyzes the trends of power couples’ concentration in large metropolitan areas of the United States between 1940 and 2010. The urbanization of college-educated couples between 1940 and 1990 was primarily due to the growth of dual-career households and the resulting severity of the co-location problem (Costa and Kahn, 2000). However, the concentration of college-educated couples in large metropolitan areas stopped increasing between 1990 and 2010. According to the results of a multinomial logit model and a triple difference-in-difference model, this is because the co-location effect faded away after 1990.
|
4 |
Cross-Lingual and Low-Resource Sentiment AnalysisFarra, Noura January 2019 (has links)
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment.
|
5 |
Exploring Problems in Water and Health by Text Mining of Online InformationZhang, Yiding 30 September 2019 (has links)
No description available.
|
6 |
Clinician Decision Support Dashboard: Extracting value from Electronic Medical RecordsSethi, Iccha 07 May 2012 (has links)
Medical records are rapidly being digitized to electronic medical records. Although Electronic Medical Records (EMRs) improve administration, billing, and logistics, an open research problem remains as to how doctors can leverage EMRs to enhance patient care. This thesis describes a system that analyzes a patient's evolving EMR in context with available biomedical knowledge and the accumulated experience recorded in various text sources including the EMRs of other patients. The aim of the Clinician Decision Support (CDS) Dashboard is to provide interactive, automated, actionable EMR text-mining tools that help improve both the patient and clinical care staff experience. The CDS Dashboard, in a secure network, helps physicians find de-identified electronic medical records similar to their patient's medical record thereby aiding them in diagnosis, treatment, prognosis and outcomes. It is of particular value in cases involving complex disorders, and also allows physicians to explore relevant medical literature, recent research findings, clinical trials and medical cases. A pilot study done with medical students at the Virginia Tech Carilion School of Medicine and Research Institute (VTC) showed that 89% of them found the CDS Dashboard to be useful in aiding patient care for doctors and 81% of them found it useful for aiding medical students pedagogically. Additionally, over 81% of the medical students found the tool user friendly. The CDS Dashboard is constructed using a multidisciplinary approach including: computer science, medicine, biomedical research, and human-machine interfacing. Our multidisciplinary approach combined with the high usability scores obtained from VTC indicated the CDS Dashboard has a high potential value to clinicians and medical students. / Master of Science
|
7 |
Tracking domain knowledge based on segmented textual sourcesKalledat, Tobias 11 May 2009 (has links)
Die hier vorliegende Forschungsarbeit hat zum Ziel, Erkenntnisse über den Einfluss der Vorverarbeitung auf die Ergebnisse der Wissensgenerierung zu gewinnen und konkrete Handlungsempfehlungen für die geeignete Vorverarbeitung von Textkorpora in Text Data Mining (TDM) Vorhaben zu geben. Der Fokus liegt dabei auf der Extraktion und der Verfolgung von Konzepten innerhalb bestimmter Wissensdomänen mit Hilfe eines methodischen Ansatzes, der auf der waagerechten und senkrechten Segmentierung von Korpora basiert. Ergebnis sind zeitlich segmentierte Teilkorpora, welche die Persistenzeigenschaft der enthaltenen Terme widerspiegeln. Innerhalb jedes zeitlich segmentierten Teilkorpus können jeweils Cluster von Termen gebildet werden, wobei eines diejenigen Terme enthält, die bezogen auf das Gesamtkorpus nicht persistent sind und das andere Cluster diejenigen, die in allen zeitlichen Segmenten vorkommen. Auf Grundlage einfacher Häufigkeitsmaße kann gezeigt werden, dass allein die statistische Qualität eines einzelnen Korpus es erlaubt, die Vorverarbeitungsqualität zu messen. Vergleichskorpora sind nicht notwendig. Die Zeitreihen der Häufigkeitsmaße zeigen signifikante negative Korrelationen zwischen dem Cluster von Termen, die permanent auftreten, und demjenigen das die Terme enthält, die nicht persistent in allen zeitlichen Segmenten des Korpus vorkommen. Dies trifft ausschließlich auf das optimal vorverarbeitete Korpus zu und findet sich nicht in den anderen Test Sets, deren Vorverarbeitungsqualität gering war. Werden die häufigsten Terme unter Verwendung domänenspezifischer Taxonomien zu Konzepten gruppiert, zeigt sich eine signifikante negative Korrelation zwischen der Anzahl unterschiedlicher Terme pro Zeitsegment und den einer Taxonomie zugeordneten Termen. Dies trifft wiederum nur für das Korpus mit hoher Vorverarbeitungsqualität zu. Eine semantische Analyse auf einem mit Hilfe einer Schwellenwert basierenden TDM Methode aufbereiteten Datenbestand ergab signifikant unterschiedliche Resultate an generiertem Wissen, abhängig von der Qualität der Datenvorverarbeitung. Mit den in dieser Forschungsarbeit vorgestellten Methoden und Maßzahlen ist sowohl die Qualität der verwendeten Quellkorpora, als auch die Qualität der angewandten Taxonomien messbar. Basierend auf diesen Erkenntnissen werden Indikatoren für die Messung und Bewertung von Korpora und Taxonomien entwickelt sowie Empfehlungen für eine dem Ziel des nachfolgenden Analyseprozesses adäquate Vorverarbeitung gegeben. / The research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.
|
8 |
The development of accented English synthetic voicesMalatji, Promise Tshepiso January 2019 (has links)
Thesis (M. Sc. (Computer Science)) --University of Limpopo, 2019 / A Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice.
|
9 |
Towards the French Biomedical Ontology Enrichment / Vers l'enrichissement d'ontologies biomédicales françaisesLossio-Ventura, Juan Antonio 09 November 2015 (has links)
En biomedicine, le domaine du « Big Data » (l'infobésité) pose le problème de l'analyse de gros volumes de données hétérogènes (i.e. vidéo, audio, texte, image). Les ontologies biomédicales, modèle conceptuel de la réalité, peuvent jouer un rôle important afin d'automatiser le traitement des données, les requêtes et la mise en correspondance des données hétérogènes. Il existe plusieurs ressources en anglais mais elles sont moins riches pour le français. Le manque d'outils et de services connexes pour les exploiter accentue ces lacunes. Dans un premier temps, les ontologies ont été construites manuellement. Au cours de ces dernières années, quelques méthodes semi-automatiques ont été proposées. Ces techniques semi-automatiques de construction/enrichissement d'ontologies sont principalement induites à partir de textes en utilisant des techniques du traitement du langage naturel (TALN). Les méthodes de TALN permettent de prendre en compte la complexité lexicale et sémantique des données biomédicales : (1) lexicale pour faire référence aux syntagmes biomédicaux complexes à considérer et (2) sémantique pour traiter l'induction du concept et du contexte de la terminologie. Dans cette thèse, afin de relever les défis mentionnés précédemment, nous proposons des méthodologies pour l'enrichissement/la construction d'ontologies biomédicales fondées sur deux principales contributions.La première contribution est liée à l'extraction automatique de termes biomédicaux spécialisés (complexité lexicale) à partir de corpus. De nouvelles mesures d'extraction et de classement de termes composés d'un ou plusieurs mots ont été proposées et évaluées. L'application BioTex implémente les mesures définies.La seconde contribution concerne l'extraction de concepts et le lien sémantique de la terminologie extraite (complexité sémantique). Ce travail vise à induire des concepts pour les nouveaux termes candidats et de déterminer leurs liens sémantiques, c'est-à-dire les positions les plus pertinentes au sein d'une ontologie biomédicale existante. Nous avons ainsi proposé une approche d'extraction de concepts qui intègre de nouveaux termes dans l'ontologie MeSH. Les évaluations, quantitatives et qualitatives, menées par des experts et non experts, sur des données réelles soulignent l'intérêt de ces contributions. / Big Data for biomedicine domain deals with a major issue, the analyze of large volume of heterogeneous data (e.g. video, audio, text, image). Ontology, conceptual models of the reality, can play a crucial role in biomedical to automate data processing, querying, and matching heterogeneous data. Various English resources exist but there are considerably less available in French and there is a strong lack of related tools and services to exploit them. Initially, ontologies were built manually. In recent years, few semi-automatic methodologies have been proposed. The semi-automatic construction/enrichment of ontologies are mostly induced from texts by using natural language processing (NLP) techniques. NLP methods have to take into account lexical and semantic complexity of biomedical data : (1) lexical refers to complex phrases to take into account, (2) semantic refers to sense and context induction of the terminology.In this thesis, we propose methodologies for enrichment/construction of biomedical ontologies based on two main contributions, in order to tackle the previously mentioned challenges. The first contribution is about the automatic extraction of specialized biomedical terms (lexical complexity) from corpora. New ranking measures for single- and multi-word term extraction methods have been proposed and evaluated. In addition, we present BioTex software that implements the proposed measures. The second contribution concerns the concept extraction and semantic linkage of the extracted terminology (semantic complexity). This work seeks to induce semantic concepts of new candidate terms, and to find the semantic links, i.e. relevant location of new candidate terms, in an existing biomedical ontology. We proposed a methodology that extracts new terms in MeSH ontology. The experiments conducted on real data highlight the relevance of the contributions.
|
10 |
Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3Kroeze, J.H. (Jan Hendrik) 28 July 2008 (has links)
The thesis discusses a series of related techniques that prepare and transform raw linguistic data for advanced processing in order to unveil hidden grammatical patterns. A threedimensional array is identified as a suitable data structure to build a data cube to capture multidimensional linguistic data in a computer's temporary storage facility. It also enables online analytical processing, like slicing, to be executed on this data cube in order to reveal various subsets and presentations of the data. XML is investigated as a suitable mark-up language to permanently store such an exploitable databank of Biblical Hebrew linguistic data. This concept is illustrated by tagging a phonetic transcription of Genesis 1:1-2:3 on various linguistic levels and manipulating this databank. Transferring the data set between an XML file and a threedimensional array creates a stable environment allowing editing and advanced processing of the data in order to confirm existing knowledge or to mine for new, yet undiscovered, linguistic features. Two experiments are executed to demonstrate possible text-mining procedures. Finally, visualisation is discussed as a technique that enhances interaction between the human researcher and the computerised technologies supporting the process of knowledge creation. Although the data set is very small there are exciting indications that the compilation and analysis of aggregate linguistic data may assist linguists to perform rigorous research, for example regarding the definitions of semantic functions and the mapping of these functions onto the syntactic module. / Thesis (PhD (Information Technology))--University of Pretoria, 2008. / Information Science / unrestricted
|
Page generated in 0.1122 seconds