• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 221
  • 123
  • 44
  • 38
  • 30
  • 29
  • 19
  • 13
  • 12
  • 7
  • 6
  • 5
  • 5
  • 4
  • 2
  • Tagged with
  • 572
  • 572
  • 131
  • 121
  • 112
  • 107
  • 89
  • 85
  • 80
  • 80
  • 71
  • 70
  • 67
  • 62
  • 62
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Securing Cyberspace: Analyzing Cybercriminal Communities through Web and Text Mining Perspectives

Benjamin, Victor January 2016 (has links)
Cybersecurity has become one of the most pressing issues facing society today. In particular, cybercriminals often congregate within online communities to exchange knowledge and assets. As a result, there has been a strong interest in recent years in developing a deeper understanding on cybercriminal behaviors, the global cybercriminal supply chain, emerging threats, and various other cybersecurity-related activities. However, few works in recent years have focused on identifying, collecting, and analyzing cybercriminal contents. Despite the high societal impact of cybercriminal community research, only a few studies have leveraged these rich data sources in their totality, and those that do often resort to manual data collection and analysis techniques. In this dissertation, I address two broad research questions: 1) In what ways can I advance cybersecurity as a science by scrutinizing the contents of online cybercriminal communities? and 2) How can I make use of computational methodologies to identify, collect, and analyze cybercriminal communities in an automated and scalable manner? To these ends, the dissertation comprises four essays. The first essay introduces a set of computational methodologies and research guidelines for conducting cybercriminal community research. To this point, there has been no literature establishing a clear route for non-technical and non-security researchers to begin studying such communities. The second essay examines possible motives for prolonged participation by individuals within cybercriminal communities. The third essay develops new neural network language model (NNLM) capabilities and applies them to cybercriminal community data in order to understand hacker-specific language evolution and to identify emerging threats. The last essay focuses on developing a NNLM-based framework for identifying information dissemination among varying international cybercriminal populations by examining multilingual cybercriminal forums. These essays help further establish cybersecurity as a science.
42

Metodolgía para estimar el impacto que generan las llamadas realizadas en un call center en la fuga de los clientes utilizando técnicas de text mining

Sepúlveda Jullian, Catalina January 2015 (has links)
Ingeniera Civil Industrial / La industria de las telecomunicaciones está en constante crecimiento debido al desarrollo de las tecnologías y a la necesidad creciente de las personas de estar conectadas. Por lo mismo es que presenta un alto grado de competitividad y los clientes son libres de elegir la opción que más les acomode y cumpla con sus expectativas. De esta forma la predicción de fuga, y con ello la retención de clientes, son factores fundamentales para el éxito de una compañía. Sin embargo, dados los altos grados de competitividad entre las distintas empresas, se hace necesario innovar en cuanto a modelos de fuga utilizando nuevas fuentes de información, como lo son las llamadas al Call Center. Es así como el objetivo general de este trabajo es medir el impacto que generan las llamadas realizadas en el Call Center en la predicción de fuga de los clientes. Para lograr lo anterior se cuenta con información de las interacciones que tienen los clientes con el Call Center, específicamente el texto de cada llamada. Para extraer información sobre el contenido de las llamadas se aplicó un modelo de detección de tópicos sobre el texto para así conocer los temas tratados y utilizar esta información en los modelos de fuga. Los resultados obtenidos luego de realizar diversos modelos logit de predicción de fuga, muestran que al utilizar tanto la información de las llamadas como la del cliente (demográfica y transaccional), el modelo es superior en accuracy en un 8.7% a uno que no utiliza esta nueva fuente de información. Además el modelo con ambos tipos de variables presenta un error tipo I un 25% menor a un modelo que no incluye el contenido de las llamadas. Tras los análisis realizados es posible concluir que las llamadas al Call Center sí son relevantes y de ayuda al momento de predecir la fuga de un cliente, ya que logran aumentar la capacidad predictiva y ajuste del modelo. Además de que entregan nueva información sobre el comportamiento del cliente y es posible detectar aquellos tópicos que puedan estar asociados con la fuga, lo que permite tomar acciones correctivas.
43

Health Data Analytics: Data and Text Mining Approaches for Pharmacovigilance

Liu, Xiao, Liu, Xiao January 2016 (has links)
Pharmacovigilance is defined as the science and activities relating to the detection, assessment, understanding, and prevention of adverse drug events (WHO 2004). Post-approval adverse drug events are a major health concern. They attribute to about 700,000 emergency department visits, 120,000 hospitalizations, and $75 billion in medical costs annually (Yang et al. 2014). However, certain adverse drug events are preventable if detected early. Timely and accurate pharmacovigilance in the post-approval period is an urgent goal of the public health system. The availability of various sources of healthcare data for analysis in recent years opens new opportunities for the data-driven pharmacovigilance research. In an attempt to leverage the emerging healthcare big data, pharmacovigilance research is facing a few challenges. Most studies in pharmacovigilance focus on structured and coded data, and therefore miss important textual data from patient social media and clinical documents in EHR. Most prior studies develop drug safety surveillance systems using a single data source with only one data mining algorithm. The performance of such systems is hampered by the bias in data and the pitfalls of the data mining algorithms adopted. In my dissertation, I address two broad research questions: 1) How do we extract rich adverse drug event related information in textual data for active drug safety surveillance? 2) How do we design an integrated pharmacovigilance system to improve the decision-making process for drug safety regulatory intervention? To these ends, the dissertation comprises three essays. The first essay examines how to develop a high-performance information extraction framework for patient reports of adverse drug events in health social media. I found that medical entity extraction, drug-event relation extraction, and report source classification are necessary components for this task. In the second essay, I address the scalability issue of using social media for pharmacovigilance by proposing a distant supervision approach for information extraction. In the last essay, I develop a MetaAlert framework for pharmacovigilance with advanced text mining and data mining techniques to provide timely and accurate detection of adverse drug reactions. Models, frameworks, and design principles proposed in these essays advance not only pharmacovigilance research, but also more broadly contribute to health IT, business analytics, and design science research.
44

Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messages

Šebesta, Jan January 2009 (has links)
In the present work we study possibilities of automatic sorting of incoming email communication. Our primary goal is to distinguish information about oncoming workshops and conferences, job off ers and published books. We are trying to develop tool to mine the information from data from professional mailing lists. Off ers in the mailing lists come in html, rtf or plain text format, but the information in it is written in common spoken language. We are developing the system so it will use text mining methods to extract the information and save it structured form. Than we will be able to work with it. We are examining the handling of the mails by user and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the bearing information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well{arranged way, sort and search according to the user needs.
45

Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messages

Šebesta, Jan January 2011 (has links)
We study possibilities of automatic sorting of incoming e-mails. Our primary goal is to distinguish information about oncoming workshops and conferences, job offers and published books. We are developing mining tool for extracting the information from data originated in profession-specific mailing lists. Offers in the mailing lists come in html, rtf or plain text format. The messages are written in common spoken language. We have developed the system so it will use text mining methods to extract the information and save it structured form. Then we will be able to work with it. We are examining how user handles the mail and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the transported information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well-arranged way, sort and search according to the user needs.
46

Text Mining of Supreme Administrative Court Jurisdictions

Feinerer, Ingo, Hornik, Kurt January 2007 (has links) (PDF)
Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Austrian supreme administrative court jurisdictions concerning dues and taxes. We analyze the law corpora using R with the new text mining package tm. Applications include clustering the jurisdiction documents into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction properties. The findings are compared to results obtained by law experts. / Series: Research Report Series / Department of Statistics and Mathematics
47

The Role of Work Experiences in College Student Leadership Development: Evidence From a National Dataset and a Text Mining Approach to Examining Beliefs About Leadership

Lewis, Jonathan Scott January 2017 (has links)
Thesis advisor: Heather Rowan-Kenyon / Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about relationships between employment and leadership development, a key student learning outcome. This study addressed these gaps in two ways, using a national sample of 77,489 students from the 2015 Multi-Institutional Study of Leadership. First, it employed quasi-experimental methods and hierarchical linear modeling (HLM) to investigate relationships between work variables (i.e., working status, work location, and hours worked) and both capacity and self-efficacy for leadership. Work location for students employed on-campus was disaggregated into 14 functional departments to allow for more nuanced analysis. Second, this study used text mining methods to examine the language that participants used to define leadership, which enabled a rich comparison between students’ conceptualizations and contemporary leadership theory. Results from HLM analysis suggested that working for pay is associated with lower self-reported leadership capacity, as defined by the social change model of leadership development, and that this relationship varies by workplace location and across institutional characteristics. The association between working status and self-efficacy for leadership was found to be practically non-significant, and hours worked per week were unrelated to either outcome. Results from text mining analysis suggested that most students conceptualize leadership using language that resonates with the industrial paradigm of leadership theory— leadership resides in a person with authority, who enacts specific behaviors and directs a group toward a goal. Disaggregated findings suggested that students who work off-campus consider leadership differently, using language consonant with contemporary, post-industrial scholarship—leadership is a dynamic, relational, non-coercive process that results in personal growth and positive change. In sum, the findings both echo and challenge aspects of existing research on leadership and working college students. Future research should explore off-campus work environments in greater detail, while practitioners and scholars who supervise students should aim to infuse post-industrial conceptualizations into on-campus work environments. / Thesis (PhD) — Boston College, 2017. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Leadership and Higher Education.
48

Généralisation de données textuelles adaptée à la classification automatique / Toward new features for text mining

Tisserant, Guillaume 14 April 2015 (has links)
La classification de documents textuels est une tâche relativement ancienne. Très tôt, de nombreux documents de différentes natures ont été regroupés dans le but de centraliser la connaissance. Des systèmes de classement et d'indexation ont alors été créés. Ils permettent de trouver facilement des documents en fonction des besoins des lecteurs. Avec la multiplication du nombre de documents et l'apparition de l'informatique puis d'internet, la mise en œuvre de systèmes de classement des textes devient un enjeu crucial. Or, les données textuelles, de nature complexe et riche, sont difficiles à traiter de manière automatique. Dans un tel contexte, cette thèse propose une méthodologie originale pour organiser l'information textuelle de façon à faciliter son accès. Nos approches de classification automatique de textes mais aussi d'extraction d'informations sémantiques permettent de retrouver rapidement et avec pertinence une information recherchée.De manière plus précise, ce manuscrit présente de nouvelles formes de représentation des textes facilitant leur traitement pour des tâches de classification automatique. Une méthode de généralisation partielle des données textuelles (approche GenDesc) s'appuyant sur des critères statistiques et morpho-syntaxiques est proposée. Par ailleurs, cette thèse s'intéresse à la construction de syntagmes et à l'utilisation d'informations sémantiques pour améliorer la représentation des documents. Nous démontrerons à travers de nombreuses expérimentations la pertinence et la généricité de nos propositions qui permettent une amélioration des résultats de classification. Enfin, dans le contexte des réseaux sociaux en fort développement, une méthode de génération automatique de HashTags porteurs de sémantique est proposée. Notre approche s'appuie sur des mesures statistiques, des ressources sémantiques et l'utilisation d'informations syntaxiques. Les HashTags proposés peuvent alors être exploités pour des tâches de recherche d'information à partir de gros volumes de données. / We have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data.
49

Machine Learning Algorithms for the Analysis of Social Media and Detection of Malicious User Generated Content

Unknown Date (has links)
One of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while websites like Amazon and Yelp allow users to share their opinions on products via online reviews. The information available through these platforms can provide insight into a host of relevant topics through the process of machine learning. Speci - cally, this process involves text mining for sentiment analysis, which is an application domain of machine learning involving the extraction of emotion from text. Unfortunately, there are still those with malicious intent and with the changes to how we communicate and conduct business, comes changes to their malicious practices. Social bots and fake reviews plague the web, providing incorrect information and swaying the opinion of unaware readers. The detection of these false users or posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated content. This dissertation expands the current research in sentiment analysis, fake online review detection and election prediction. We examine cross-domain sentiment analysis using tweets and reviews. Novel techniques combining ensemble and feature selection methods are proposed for the domain of online spam review detection. We investigate the ability for the Twitter platform to predict the United States 2016 presidential election. In addition, we determine how social bots in uence this prediction. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2018. / FAU Electronic Theses and Dissertations Collection
50

A semi-automated framework for the analytical use of gene-centric data with biological ontologies

He, Xin January 2017 (has links)
Motivation Translational bioinformatics(TBI) has been defined as ‘the development and application of informatics methods that connect molecular entities to clinical entities’ [1], which has emerged as a systems theory approach to bridge the huge wealth of biomedical data into clinical actions using a combination of innovations and resources across the entire spectrum of biomedical informatics approaches [2]. The challenge for TBI is the availability of both comprehensive knowledge based on genes and the corresponding tools that allow their analysis and exploitation. Traditionally, biological researchers usually study one or only a few genes at a time, but in recent years high throughput technologies such as gene expression microarrays, protein mass-spectrometry and next-generation DNA and RNA sequencing have emerged that allow the simultaneous measurement of changes on a genome-wide scale. These technologies usually result in large lists of interesting genes, but meaningful biological interpretation remains a major challenge. Over the last decade, enrichment analysis has become standard practice in the analysis of such gene lists, enabling systematic assessment of the likelihood of differential representation of defined groups of genes compared to suitably annotated background knowledge. The success of such analyses are highly dependent on the availability and quality of the gene annotation data. For many years, genes were annotated by different experts using inconsistent, non-standard terminologies. Large amounts of variation and duplication in these unstructured annotation sets, made them unsuitable for principled quantitative analysis. More recently, a lot of effort has been put into the development and use of structured, domain specific vocabularies to annotate genes. The Gene Ontology is one of the most successful examples of this where genes are annotated with terms from three main clades; biological process, molecular function and cellular component. However, there are many other established and emerging ontologies to aid biological data interpretation, but are rarely used. For the same reason, many bioinformatic tools only support analysis analysis using the Gene Ontology. The lack of annotation coverage and the support for them in existing analytical tools to aid biological interpretation of data has become a major limitation to their utility and uptake. Thus, automatic approaches are needed to facilitate the transformation of unstructured data to unlock the potential of all ontologies, with corresponding bioinformatics tools to support their interpretation. Approaches In this thesis, firstly, similar to the approach in [3,4], I propose a series of computational approaches implemented in a new tool OntoSuite-Miner to address the ontology based gene association data integration challenge. This approach uses NLP based text mining methods for ontology based biomedical text mining. What differentiates my approach from other approaches is that I integrate two of the most wildly used NLP modules into the framework, not only increasing the confidence of the text mining results, but also providing an annotation score for each mapping, based on the number of pieces of evidence in the literature and the number of NLP modules that agreed with the mapping. Since heterogeneous data is important in understanding human disease, the approach was designed to be generic, thus the ontology based annotation generation can be applied to different sources and can be repeated with different ontologies. Secondly, in respect of the second challenge proposed by TBI, to increase the statistical power of the annotation enrichment analysis, I propose OntoSuite-Analytics, which integrates a collection of enrichment analysis methods into a unified open-source software package named topOnto, in the statistical programming language R. The package supports enrichment analysis across multiple ontologies with a set of implemented statistical/topological algorithms, allowing the comparison of enrichment results across multiple ontologies and between different algorithms. Results The methodologies described above were implemented and a Human Disease Ontology (HDO) based gene annotation database was generated by mining three publicly available database, OMIM, GeneRIF and Ensembl variation. With the availability of the HDO annotation and the corresponding ontology enrichment analysis tools in topOnto, I profiled 277 gene classes with human diseases and generated ‘disease environments’ for 1310 human diseases. The exploration of the disease profiles and disease environment provides an overview of known disease knowledge and provides new insights into disease mechanisms. The integration of multiple ontologies into a disease context demonstrates how ‘orthogonal’ ontologies can lead to biological insight that would have been missed by more traditional single ontology analysis.

Page generated in 0.0527 seconds