Spelling suggestions: "subject:"[een] TEXT-DATA"" "subject:"[enn] TEXT-DATA""
1 |
A Text Mining Framework Linking Technical Intelligence from Publication Databases to Strategic Technology DecisionsCourseault, Cherie Renee 12 April 2004 (has links)
This research developed a comprehensive methodology to quickly monitor key technical intelligence areas, provided a method that cleanses and consolidates information into an understandable, concise picture of topics of interest, thus bridging issues of managing technology and text mining. This research evaluated and altered some existing analysis methods, and developed an overall framework for answering technical intelligence questions. A six-step approach worked through the various stages of the Intelligence and Text Data Mining Processes to address issues that hindered the use of Text Data Mining in the Intelligence Cycle and the actual use of that intelligence in making technology decisions. A questionnaire given to 34 respondents from four different industries identified the information most important to decision-makers as well as clusters of common interests. A bibliometric/text mining tool applied to journal publication databases, profiled technology trends and presented that information in the context of the stated needs from the questionnaire.
In addition to identifying the information that is important to decision-makers, this research improved the methods for analyzing information. An algorithm was developed that removed common non-technical terms and delivered at least an 89% precision rate in identifying synonymous terms. Such identifications are important to improving accuracy when mining free text, thus enabling the provision of the more specific information desired by the decision-makers. This level of precision was consistent across five different technology areas and three different databases. The result is the ability to use abstract phrases in analysis, which allows the more detailed nature of abstracts to be captured in clustering, while portraying the broad relationships as well.
|
2 |
Incorporating semantic and syntactic information into document representation for document clusteringWang, Yong 06 August 2005 (has links)
Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.
|
3 |
Empirical studies of financial and labor economicsLi, Mengmeng 12 August 2016 (has links)
This dissertation consists of three essays in financial and labor economics. It provides empirical evidence for testing the efficient market hypothesis in some financial markets and for analyzing the trends of power couples’ concentration in large metropolitan areas.
The first chapter investigates the Bitcoin market’s efficiency by examining the correlation between social media information and Bitcoin future returns. First, I extract Twitter sentiment information from the text analysis of more than 130,000 Bitcoin-related tweets. Granger causality tests confirm that market sentiment information affects Bitcoin returns in the short run. Moreover, I find that time series models that incorporate sentiment information better forecast Bitcoin future prices. Based on the predicted prices, I also implement an investment strategy that yields a sizeable return for investors.
The second chapter examines episodes of exuberance and collapse in the Chinese stock market and the second-board market using a series of extended right-tailed augmented Dickey-Fuller tests. The empirical results suggest that multiple “bubbles” occurred in the Chinese stock market, although insufficient evidence is found to claim the same for the second-board market.
The third chapter analyzes the trends of power couples’ concentration in large metropolitan areas of the United States between 1940 and 2010. The urbanization of college-educated couples between 1940 and 1990 was primarily due to the growth of dual-career households and the resulting severity of the co-location problem (Costa and Kahn, 2000). However, the concentration of college-educated couples in large metropolitan areas stopped increasing between 1990 and 2010. According to the results of a multinomial logit model and a triple difference-in-difference model, this is because the co-location effect faded away after 1990.
|
4 |
Text analytics in business environments: a managerial and methodological approachMarcolin, Carla Bonato January 2018 (has links)
O processo de tomada de decisão, em diferentes ambientes gerenciais, enfrenta um momento de mudança no contexto organizacional. Nesse sentido, Business Analytics pode ser visto como uma área que permite alavancar o valor dos dados, contendo ferramentas importantes para o processo de tomada de decisão. No entanto, a presença de dados em diferentes formatos representa um desafio. Nesse contexto de variabilidade, os dados de texto têm atraído a atenção das organizações, já que milhares de pessoas se expressam diariamente neste formato, em muitas aplicações e ferramentas disponíveis. Embora diversas técnicas tenham sido desenvolvidas pela comunidade de ciência da computação, há amplo espaço para melhorar a utilização organizacional de tais dados de texto, especialmente quando se volta para o suporte à tomada de decisões. No entanto, apesar da importância e disponibilidade de dados em formato textual para apoiar decisões, seu uso não é comum devido à dificuldade de análise e interpretação que o volume e o formato de dados em texto apresentam. Assim, o objetivo desta tese é desenvolver e avaliar um framework voltado ao uso de dados de texto em processos decisórios, apoiando-se em diversas técnicas de processamento de linguagem natural (PNL). Os resultados apresentam a validade do framework, usando como instância de demonstração de sua aplicabilidade o setor de turismo através da plataforma TripAdvisor, bem como a validação interna de performance e a aceitação por parte dos gestores da área consultados. / The decision-making process, in different management environments, faces a moment of change in the organizational context. In this sense, Business Analytics can be seen as an area that leverages the value of data, containing important tools for the decision-making process. However, the presence of data in different formats poses a challenge. In this context of variability, text data has attracted the attention of organizations, as thousands of people express themselves daily in this format in many applications and tools available. Although several techniques have been developed by the computer science community, there is ample scope to improve the organizational use of such text data, especially when it comes to decision-making support. However, despite the importance and availability of textual data to support decisions, its use is not common because of the analysis and interpretation challenge that the volume and the unstructured format of text data presents. Thus, the aim of this dissertation is to develop and evaluate a framework to contribute with the expansion and development of text analytics in decision-making processes, based on several natural language processing (NLP) techniques. The results presents the validity of the framework, using as a demonstration of its applicability the tourism sector through the TripAdvisor platform, as well as the internal validation of performance and the acceptance by managers.
|
5 |
An investigation into fuzzy clustering quality and speed : fuzzy C-means with effective seedingStetco, Adrian January 2017 (has links)
Cluster analysis, the automatic procedure by which large data sets can be split into similar groups of objects (clusters), has innumerable applications in a wide range of problem domains. Improvements in clustering quality (as captured by internal validation indexes) and speed (number of iterations until cost function convergence), the main focus of this work, have many desirable consequences. They can result, for example, in faster and more precise detection of illness onset based on symptoms or it could provide investors with a rapid detection and visualization of patterns in financial time series and so on. Partitional clustering, one of the most popular ways of doing cluster analysis, can be classified into two main categories: hard (where the clusters discovered are disjoint) and soft (also known as fuzzy; clusters are non-disjoint, or overlapping). In this work we consider how improvements in the speed and solution quality of the soft partitional clustering algorithm Fuzzy C-means (FCM) can be achieved through more careful and informed initialization based on data content. By carefully selecting the cluster centers in a way which disperses the initial cluster centers through the data space, the resulting FCM++ approach samples starting cluster centers during the initialization phase. The cluster centers are well spread in the input space, resulting in both faster convergence times and higher quality solutions. Moreover, we allow the user to specify a parameter indicating how far and apart the cluster centers should be picked in the dataspace right at the beginning of the clustering procedure. We show FCM++'s superior behaviour in both convergence times and quality compared with existing methods, on a wide rangeof artificially generated and real data sets. We consider a case study where we propose a methodology based on FCM++for pattern discovery on synthetic and real world time series data. We discuss a method to utilize both Pearson correlation and Multi-Dimensional Scaling in order to reduce data dimensionality, remove noise and make the dataset easier to interpret and analyse. We show that by using FCM++ we can make an positive impact on the quality (with the Xie Beni index being lower in nine out of ten cases for FCM++) and speed (with on average 6.3 iterations compared with 22.6 iterations) when trying to cluster these lower dimensional, noise reduced, representations of the time series. This methodology provides a clearer picture of the cluster analysis results and helps in detecting similarly behaving time series which could otherwise come from any domain. Further, we investigate the use of Spherical Fuzzy C-Means (SFCM) with the seeding mechanism used for FCM++ on news text data retrieved from a popular British newspaper. The methodology allows us to visualize and group hundreds of news articles based on the topics discussed within. The positive impact made by SFCM++ translates into a faster process (with on average 12.2 iterations compared with the 16.8 needed by the standard SFCM) and a higher quality solution (with the Xie Beni being lower for SFCM++ in seven out of every ten runs).
|
6 |
Cross-Lingual and Low-Resource Sentiment AnalysisFarra, Noura January 2019 (has links)
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment.
|
7 |
Text analytics in business environments: a managerial and methodological approachMarcolin, Carla Bonato January 2018 (has links)
O processo de tomada de decisão, em diferentes ambientes gerenciais, enfrenta um momento de mudança no contexto organizacional. Nesse sentido, Business Analytics pode ser visto como uma área que permite alavancar o valor dos dados, contendo ferramentas importantes para o processo de tomada de decisão. No entanto, a presença de dados em diferentes formatos representa um desafio. Nesse contexto de variabilidade, os dados de texto têm atraído a atenção das organizações, já que milhares de pessoas se expressam diariamente neste formato, em muitas aplicações e ferramentas disponíveis. Embora diversas técnicas tenham sido desenvolvidas pela comunidade de ciência da computação, há amplo espaço para melhorar a utilização organizacional de tais dados de texto, especialmente quando se volta para o suporte à tomada de decisões. No entanto, apesar da importância e disponibilidade de dados em formato textual para apoiar decisões, seu uso não é comum devido à dificuldade de análise e interpretação que o volume e o formato de dados em texto apresentam. Assim, o objetivo desta tese é desenvolver e avaliar um framework voltado ao uso de dados de texto em processos decisórios, apoiando-se em diversas técnicas de processamento de linguagem natural (PNL). Os resultados apresentam a validade do framework, usando como instância de demonstração de sua aplicabilidade o setor de turismo através da plataforma TripAdvisor, bem como a validação interna de performance e a aceitação por parte dos gestores da área consultados. / The decision-making process, in different management environments, faces a moment of change in the organizational context. In this sense, Business Analytics can be seen as an area that leverages the value of data, containing important tools for the decision-making process. However, the presence of data in different formats poses a challenge. In this context of variability, text data has attracted the attention of organizations, as thousands of people express themselves daily in this format in many applications and tools available. Although several techniques have been developed by the computer science community, there is ample scope to improve the organizational use of such text data, especially when it comes to decision-making support. However, despite the importance and availability of textual data to support decisions, its use is not common because of the analysis and interpretation challenge that the volume and the unstructured format of text data presents. Thus, the aim of this dissertation is to develop and evaluate a framework to contribute with the expansion and development of text analytics in decision-making processes, based on several natural language processing (NLP) techniques. The results presents the validity of the framework, using as a demonstration of its applicability the tourism sector through the TripAdvisor platform, as well as the internal validation of performance and the acceptance by managers.
|
8 |
Text analytics in business environments: a managerial and methodological approachMarcolin, Carla Bonato January 2018 (has links)
O processo de tomada de decisão, em diferentes ambientes gerenciais, enfrenta um momento de mudança no contexto organizacional. Nesse sentido, Business Analytics pode ser visto como uma área que permite alavancar o valor dos dados, contendo ferramentas importantes para o processo de tomada de decisão. No entanto, a presença de dados em diferentes formatos representa um desafio. Nesse contexto de variabilidade, os dados de texto têm atraído a atenção das organizações, já que milhares de pessoas se expressam diariamente neste formato, em muitas aplicações e ferramentas disponíveis. Embora diversas técnicas tenham sido desenvolvidas pela comunidade de ciência da computação, há amplo espaço para melhorar a utilização organizacional de tais dados de texto, especialmente quando se volta para o suporte à tomada de decisões. No entanto, apesar da importância e disponibilidade de dados em formato textual para apoiar decisões, seu uso não é comum devido à dificuldade de análise e interpretação que o volume e o formato de dados em texto apresentam. Assim, o objetivo desta tese é desenvolver e avaliar um framework voltado ao uso de dados de texto em processos decisórios, apoiando-se em diversas técnicas de processamento de linguagem natural (PNL). Os resultados apresentam a validade do framework, usando como instância de demonstração de sua aplicabilidade o setor de turismo através da plataforma TripAdvisor, bem como a validação interna de performance e a aceitação por parte dos gestores da área consultados. / The decision-making process, in different management environments, faces a moment of change in the organizational context. In this sense, Business Analytics can be seen as an area that leverages the value of data, containing important tools for the decision-making process. However, the presence of data in different formats poses a challenge. In this context of variability, text data has attracted the attention of organizations, as thousands of people express themselves daily in this format in many applications and tools available. Although several techniques have been developed by the computer science community, there is ample scope to improve the organizational use of such text data, especially when it comes to decision-making support. However, despite the importance and availability of textual data to support decisions, its use is not common because of the analysis and interpretation challenge that the volume and the unstructured format of text data presents. Thus, the aim of this dissertation is to develop and evaluate a framework to contribute with the expansion and development of text analytics in decision-making processes, based on several natural language processing (NLP) techniques. The results presents the validity of the framework, using as a demonstration of its applicability the tourism sector through the TripAdvisor platform, as well as the internal validation of performance and the acceptance by managers.
|
9 |
Exploring Problems in Water and Health by Text Mining of Online InformationZhang, Yiding 30 September 2019 (has links)
No description available.
|
10 |
Clinician Decision Support Dashboard: Extracting value from Electronic Medical RecordsSethi, Iccha 07 May 2012 (has links)
Medical records are rapidly being digitized to electronic medical records. Although Electronic Medical Records (EMRs) improve administration, billing, and logistics, an open research problem remains as to how doctors can leverage EMRs to enhance patient care. This thesis describes a system that analyzes a patient's evolving EMR in context with available biomedical knowledge and the accumulated experience recorded in various text sources including the EMRs of other patients. The aim of the Clinician Decision Support (CDS) Dashboard is to provide interactive, automated, actionable EMR text-mining tools that help improve both the patient and clinical care staff experience. The CDS Dashboard, in a secure network, helps physicians find de-identified electronic medical records similar to their patient's medical record thereby aiding them in diagnosis, treatment, prognosis and outcomes. It is of particular value in cases involving complex disorders, and also allows physicians to explore relevant medical literature, recent research findings, clinical trials and medical cases. A pilot study done with medical students at the Virginia Tech Carilion School of Medicine and Research Institute (VTC) showed that 89% of them found the CDS Dashboard to be useful in aiding patient care for doctors and 81% of them found it useful for aiding medical students pedagogically. Additionally, over 81% of the medical students found the tool user friendly. The CDS Dashboard is constructed using a multidisciplinary approach including: computer science, medicine, biomedical research, and human-machine interfacing. Our multidisciplinary approach combined with the high usability scores obtained from VTC indicated the CDS Dashboard has a high potential value to clinicians and medical students. / Master of Science
|
Page generated in 0.0418 seconds