• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 238
  • 124
  • 44
  • 38
  • 30
  • 29
  • 24
  • 24
  • 13
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • Tagged with
  • 619
  • 619
  • 141
  • 128
  • 115
  • 113
  • 87
  • 86
  • 85
  • 81
  • 80
  • 76
  • 65
  • 64
  • 64
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messages

Šebesta, Jan January 2009 (has links)
In the present work we study possibilities of automatic sorting of incoming email communication. Our primary goal is to distinguish information about oncoming workshops and conferences, job off ers and published books. We are trying to develop tool to mine the information from data from professional mailing lists. Off ers in the mailing lists come in html, rtf or plain text format, but the information in it is written in common spoken language. We are developing the system so it will use text mining methods to extract the information and save it structured form. Than we will be able to work with it. We are examining the handling of the mails by user and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the bearing information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well{arranged way, sort and search according to the user needs.
52

Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messages

Šebesta, Jan January 2011 (has links)
We study possibilities of automatic sorting of incoming e-mails. Our primary goal is to distinguish information about oncoming workshops and conferences, job offers and published books. We are developing mining tool for extracting the information from data originated in profession-specific mailing lists. Offers in the mailing lists come in html, rtf or plain text format. The messages are written in common spoken language. We have developed the system so it will use text mining methods to extract the information and save it structured form. Then we will be able to work with it. We are examining how user handles the mail and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the transported information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well-arranged way, sort and search according to the user needs.
53

Text Mining of Supreme Administrative Court Jurisdictions

Feinerer, Ingo, Hornik, Kurt January 2007 (has links) (PDF)
Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Austrian supreme administrative court jurisdictions concerning dues and taxes. We analyze the law corpora using R with the new text mining package tm. Applications include clustering the jurisdiction documents into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction properties. The findings are compared to results obtained by law experts. / Series: Research Report Series / Department of Statistics and Mathematics
54

The Role of Work Experiences in College Student Leadership Development: Evidence From a National Dataset and a Text Mining Approach to Examining Beliefs About Leadership

Lewis, Jonathan Scott January 2017 (has links)
Thesis advisor: Heather Rowan-Kenyon / Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about relationships between employment and leadership development, a key student learning outcome. This study addressed these gaps in two ways, using a national sample of 77,489 students from the 2015 Multi-Institutional Study of Leadership. First, it employed quasi-experimental methods and hierarchical linear modeling (HLM) to investigate relationships between work variables (i.e., working status, work location, and hours worked) and both capacity and self-efficacy for leadership. Work location for students employed on-campus was disaggregated into 14 functional departments to allow for more nuanced analysis. Second, this study used text mining methods to examine the language that participants used to define leadership, which enabled a rich comparison between students’ conceptualizations and contemporary leadership theory. Results from HLM analysis suggested that working for pay is associated with lower self-reported leadership capacity, as defined by the social change model of leadership development, and that this relationship varies by workplace location and across institutional characteristics. The association between working status and self-efficacy for leadership was found to be practically non-significant, and hours worked per week were unrelated to either outcome. Results from text mining analysis suggested that most students conceptualize leadership using language that resonates with the industrial paradigm of leadership theory— leadership resides in a person with authority, who enacts specific behaviors and directs a group toward a goal. Disaggregated findings suggested that students who work off-campus consider leadership differently, using language consonant with contemporary, post-industrial scholarship—leadership is a dynamic, relational, non-coercive process that results in personal growth and positive change. In sum, the findings both echo and challenge aspects of existing research on leadership and working college students. Future research should explore off-campus work environments in greater detail, while practitioners and scholars who supervise students should aim to infuse post-industrial conceptualizations into on-campus work environments. / Thesis (PhD) — Boston College, 2017. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Leadership and Higher Education.
55

Généralisation de données textuelles adaptée à la classification automatique / Toward new features for text mining

Tisserant, Guillaume 14 April 2015 (has links)
La classification de documents textuels est une tâche relativement ancienne. Très tôt, de nombreux documents de différentes natures ont été regroupés dans le but de centraliser la connaissance. Des systèmes de classement et d'indexation ont alors été créés. Ils permettent de trouver facilement des documents en fonction des besoins des lecteurs. Avec la multiplication du nombre de documents et l'apparition de l'informatique puis d'internet, la mise en œuvre de systèmes de classement des textes devient un enjeu crucial. Or, les données textuelles, de nature complexe et riche, sont difficiles à traiter de manière automatique. Dans un tel contexte, cette thèse propose une méthodologie originale pour organiser l'information textuelle de façon à faciliter son accès. Nos approches de classification automatique de textes mais aussi d'extraction d'informations sémantiques permettent de retrouver rapidement et avec pertinence une information recherchée.De manière plus précise, ce manuscrit présente de nouvelles formes de représentation des textes facilitant leur traitement pour des tâches de classification automatique. Une méthode de généralisation partielle des données textuelles (approche GenDesc) s'appuyant sur des critères statistiques et morpho-syntaxiques est proposée. Par ailleurs, cette thèse s'intéresse à la construction de syntagmes et à l'utilisation d'informations sémantiques pour améliorer la représentation des documents. Nous démontrerons à travers de nombreuses expérimentations la pertinence et la généricité de nos propositions qui permettent une amélioration des résultats de classification. Enfin, dans le contexte des réseaux sociaux en fort développement, une méthode de génération automatique de HashTags porteurs de sémantique est proposée. Notre approche s'appuie sur des mesures statistiques, des ressources sémantiques et l'utilisation d'informations syntaxiques. Les HashTags proposés peuvent alors être exploités pour des tâches de recherche d'information à partir de gros volumes de données. / We have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data.
56

Machine Learning Algorithms for the Analysis of Social Media and Detection of Malicious User Generated Content

Unknown Date (has links)
One of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while websites like Amazon and Yelp allow users to share their opinions on products via online reviews. The information available through these platforms can provide insight into a host of relevant topics through the process of machine learning. Speci - cally, this process involves text mining for sentiment analysis, which is an application domain of machine learning involving the extraction of emotion from text. Unfortunately, there are still those with malicious intent and with the changes to how we communicate and conduct business, comes changes to their malicious practices. Social bots and fake reviews plague the web, providing incorrect information and swaying the opinion of unaware readers. The detection of these false users or posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated content. This dissertation expands the current research in sentiment analysis, fake online review detection and election prediction. We examine cross-domain sentiment analysis using tweets and reviews. Novel techniques combining ensemble and feature selection methods are proposed for the domain of online spam review detection. We investigate the ability for the Twitter platform to predict the United States 2016 presidential election. In addition, we determine how social bots in uence this prediction. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2018. / FAU Electronic Theses and Dissertations Collection
57

Enhancement of Deep Neural Networks and Their Application to Text Mining

Unknown Date (has links)
Many current application domains of machine learning and arti cial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology, and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. A text classi er attempts to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text; however, learning high-level concepts from text, such as those found in many applications of text classi- cation, is a challenging task due to the many challenges associated with text mining and classi cation. Recently, classi ers trained using arti cial neural networks have been shown to be e ective for a variety of text mining tasks. Convolutional neural networks have been trained to classify text from character-level input, automatically learn high-level abstract representations and avoiding the need for human engineered features. This dissertation proposes two new techniques for character-level learning, log(m) character embedding and convolutional window classi cation. Log(m) embedding is a new character-vector representation for text data that is more compact and memory e cient than previous embedding vectors. Convolutional window classi cation is a technique for classifying long documents, i.e. documents with lengths exceeding the input dimension of the neural network. Additionally, we investigate the performance of convolutional neural networks combined with long short-term memory networks, explore how document length impacts classi cation performance and compare performance of neural networks against non-neural network-based learners in text classi cation tasks. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2018. / FAU Electronic Theses and Dissertations Collection
58

A semi-automated framework for the analytical use of gene-centric data with biological ontologies

He, Xin January 2017 (has links)
Motivation Translational bioinformatics(TBI) has been defined as ‘the development and application of informatics methods that connect molecular entities to clinical entities’ [1], which has emerged as a systems theory approach to bridge the huge wealth of biomedical data into clinical actions using a combination of innovations and resources across the entire spectrum of biomedical informatics approaches [2]. The challenge for TBI is the availability of both comprehensive knowledge based on genes and the corresponding tools that allow their analysis and exploitation. Traditionally, biological researchers usually study one or only a few genes at a time, but in recent years high throughput technologies such as gene expression microarrays, protein mass-spectrometry and next-generation DNA and RNA sequencing have emerged that allow the simultaneous measurement of changes on a genome-wide scale. These technologies usually result in large lists of interesting genes, but meaningful biological interpretation remains a major challenge. Over the last decade, enrichment analysis has become standard practice in the analysis of such gene lists, enabling systematic assessment of the likelihood of differential representation of defined groups of genes compared to suitably annotated background knowledge. The success of such analyses are highly dependent on the availability and quality of the gene annotation data. For many years, genes were annotated by different experts using inconsistent, non-standard terminologies. Large amounts of variation and duplication in these unstructured annotation sets, made them unsuitable for principled quantitative analysis. More recently, a lot of effort has been put into the development and use of structured, domain specific vocabularies to annotate genes. The Gene Ontology is one of the most successful examples of this where genes are annotated with terms from three main clades; biological process, molecular function and cellular component. However, there are many other established and emerging ontologies to aid biological data interpretation, but are rarely used. For the same reason, many bioinformatic tools only support analysis analysis using the Gene Ontology. The lack of annotation coverage and the support for them in existing analytical tools to aid biological interpretation of data has become a major limitation to their utility and uptake. Thus, automatic approaches are needed to facilitate the transformation of unstructured data to unlock the potential of all ontologies, with corresponding bioinformatics tools to support their interpretation. Approaches In this thesis, firstly, similar to the approach in [3,4], I propose a series of computational approaches implemented in a new tool OntoSuite-Miner to address the ontology based gene association data integration challenge. This approach uses NLP based text mining methods for ontology based biomedical text mining. What differentiates my approach from other approaches is that I integrate two of the most wildly used NLP modules into the framework, not only increasing the confidence of the text mining results, but also providing an annotation score for each mapping, based on the number of pieces of evidence in the literature and the number of NLP modules that agreed with the mapping. Since heterogeneous data is important in understanding human disease, the approach was designed to be generic, thus the ontology based annotation generation can be applied to different sources and can be repeated with different ontologies. Secondly, in respect of the second challenge proposed by TBI, to increase the statistical power of the annotation enrichment analysis, I propose OntoSuite-Analytics, which integrates a collection of enrichment analysis methods into a unified open-source software package named topOnto, in the statistical programming language R. The package supports enrichment analysis across multiple ontologies with a set of implemented statistical/topological algorithms, allowing the comparison of enrichment results across multiple ontologies and between different algorithms. Results The methodologies described above were implemented and a Human Disease Ontology (HDO) based gene annotation database was generated by mining three publicly available database, OMIM, GeneRIF and Ensembl variation. With the availability of the HDO annotation and the corresponding ontology enrichment analysis tools in topOnto, I profiled 277 gene classes with human diseases and generated ‘disease environments’ for 1310 human diseases. The exploration of the disease profiles and disease environment provides an overview of known disease knowledge and provides new insights into disease mechanisms. The integration of multiple ontologies into a disease context demonstrates how ‘orthogonal’ ontologies can lead to biological insight that would have been missed by more traditional single ontology analysis.
59

Avaliação das capacidades dinâmicas através de técnicas de business analytcs

Scherer, Jonatas Ost January 2017 (has links)
O desenvolvimento das capacidades dinâmicas habilita a empresa à inovar de forma mais eficiente, e por conseguinte, melhorar seu desempenho. Esta tese apresenta um framework para mensuração do grau de desenvolvimento das capacidades dinâmicas da empresa. Através de técnicas de text mining uma bag of words específica para as capacidades dinâmicas é proposta, bem como, baseado na literatura é proposto um conjunto de rotinas para avaliar a operacionalização e desenvolvimento das capacidades dinâmicas. Para avaliação das capacidades dinâmicas, foram aplicadas técnicas de text mining utilizando como fonte de dados os relatórios anuais de catorze empresas aéreas. Através da aplicação piloto foi possível realizar um diagnóstico das empresas aéreas e do setor. O trabalho aborda uma lacuna da literatura das capacidades dinâmicas, ao propor um método quantitativo para sua mensuração, assim como, a proposição de uma bag of words específica para as capacidades dinâmicas. Em termos práticos, a proposição pode contribuir para a tomada de decisões estratégicas embasada em dados, possibilitando assim inovar com mais eficiência e melhorar desempenho da firma. / The development of dynamic capabilities enables the company to innovate more efficiently and therefore improves its performance. This thesis presents a framework for measuring the dynamic capabilities development. Text mining techniques were used to propose a specific bag of words for dynamic capabilities. Furthermore, based on the literature, a group of routines is proposed to evaluate the operationalization and development of dynamic capabilities. In order to evaluate the dynamic capabilities, text mining techniques were applied using the annual reports of fourteen airlines as the data source. Through this pilot application it was possible to carry out a diagnosis of the airlines and the sector as well. The thesis approaches a dynamic capabilities literature gap by proposing a quantitative method for its measurement, as well as, the proposition of a specific bag of words for dynamic capabilities. The proposition can contribute to strategic decision making based on data, allowing firms to innovate more efficiently and improve performance.
60

Data-driven temporal information extraction with applications in general and clinical domains

Filannino, Michele January 2016 (has links)
The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. However, Temporal Information Extraction (TIE) is a challenging task because of the amount of types of expressions (durations, frequencies, times, dates) and their high morphological variability and ambiguity. As far as the approaches are concerned, the most common among the existing ones is rule-based, while data-driven ones are under-explored. This thesis introduces a novel domain-independent data-driven TIE strategy. The identification strategy is based on machine learning sequence labelling classifiers on features selected through an extensive exploration. Results are further optimised using an a posteriori label-adjustment pipeline. The normalisation strategy is rule-based and builds on a pre-existing system. The methodology has been applied to both specific (clinical) and generic domain, and has been officially benchmarked at the i2b2/2012 and TempEval-3 challenges, ranking respectively 3rd and 1st. The results prove the TIE task to be more challenging in the clinical domain (overall accuracy 63%) rather than in the general domain (overall accuracy 69%).Finally, this thesis also presents two applications of TIE. One of them introduces the concept of temporal footprint of a Wikipedia article, and uses it to mine the life span of persons. In the other case, TIE techniques are used to improve pre-existing information retrieval systems by filtering out temporally irrelevant results.

Page generated in 0.056 seconds