Spelling suggestions: "subject:"[een] TEXT MINING"" "subject:"[enn] TEXT MINING""
41 |
Using text mining to identify crime patterns from Arabic crime news report corpusAlruily, Meshrif January 2012 (has links)
Most text mining techniques have been proposed only for English text, and even here, most research has been conducted on specific texts related to special contexts within the English language, such as politics, medicine and crime. In contrast, although Arabic is a widely spoken language, few mining tools have been developed to process Arabic text, and some Arabic domains have not been studied at all. In fact, Arabic is a language with a very complex morphology because it is highly inflectional, and therefore, dealing with texts written in Arabic is highly complicated. This research studies the crime domain in the Arabic language, exploiting unstructured text using text mining techniques. Developing a system for extracting important information from crime reports would be useful for police investigators, for accelerating the investigative process (instead of reading entire reports) as well as for conducting further or wider analyses. We propose the Crime Profiling System (CPS) to extract crime-related information (crime type, crime location and nationality of persons involved in the event), automatically construct dictionaries for the existing information, cluster crime documents based on certain attributes and utilize visualisation techniques to assist in crime data analysis. The proposed information extraction approach is novel, and it relies on computational linguistic techniques to identify the abovementioned information, i.e. without using predefined dictionaries (e.g. lists of location names) and annotated corpus. The language used in crime reporting is studied to identify patterns of interest using a corpus-based approach. Frequency analysis, collocation analysis and concordance analysis are used to perform the syntactic analysis in order to discover the local grammar. Moreover, the Self Organising Map (SOM) approach is adopted in order to perform the clustering and visualisation tasks for crime documents based on crime type, location or nationality. This clustering technique is improved because only refined data containing meaningful keywords extracted through the information extraction process are inputted into it, i.e. the data is cleaned by removing noise. As a result, a huge reduction in the quantity of data fed into the SOM is obtained, consequently, saving memory, data loading time and the execution time needed to perform the clustering. Therefore, the computation of the SOM is accelerated. Finally, the quantization error is reduced, which leads to high quality clustering. The outcome of the clustering stage is also visualised and the system is able to provide statistical information in the form of graphs and tables about crimes committed within certain periods of time and within a particular area.
|
42 |
Securing Cyberspace: Analyzing Cybercriminal Communities through Web and Text Mining PerspectivesBenjamin, Victor January 2016 (has links)
Cybersecurity has become one of the most pressing issues facing society today. In particular, cybercriminals often congregate within online communities to exchange knowledge and assets. As a result, there has been a strong interest in recent years in developing a deeper understanding on cybercriminal behaviors, the global cybercriminal supply chain, emerging threats, and various other cybersecurity-related activities. However, few works in recent years have focused on identifying, collecting, and analyzing cybercriminal contents. Despite the high societal impact of cybercriminal community research, only a few studies have leveraged these rich data sources in their totality, and those that do often resort to manual data collection and analysis techniques. In this dissertation, I address two broad research questions: 1) In what ways can I advance cybersecurity as a science by scrutinizing the contents of online cybercriminal communities? and 2) How can I make use of computational methodologies to identify, collect, and analyze cybercriminal communities in an automated and scalable manner? To these ends, the dissertation comprises four essays. The first essay introduces a set of computational methodologies and research guidelines for conducting cybercriminal community research. To this point, there has been no literature establishing a clear route for non-technical and non-security researchers to begin studying such communities. The second essay examines possible motives for prolonged participation by individuals within cybercriminal communities. The third essay develops new neural network language model (NNLM) capabilities and applies them to cybercriminal community data in order to understand hacker-specific language evolution and to identify emerging threats. The last essay focuses on developing a NNLM-based framework for identifying information dissemination among varying international cybercriminal populations by examining multilingual cybercriminal forums. These essays help further establish cybersecurity as a science.
|
43 |
Metodolgía para estimar el impacto que generan las llamadas realizadas en un call center en la fuga de los clientes utilizando técnicas de text miningSepúlveda Jullian, Catalina January 2015 (has links)
Ingeniera Civil Industrial / La industria de las telecomunicaciones está en constante crecimiento debido al desarrollo de las tecnologías y a la necesidad creciente de las personas de estar conectadas. Por lo mismo es que presenta un alto grado de competitividad y los clientes son libres de elegir la opción que más les acomode y cumpla con sus expectativas.
De esta forma la predicción de fuga, y con ello la retención de clientes, son factores fundamentales para el éxito de una compañía. Sin embargo, dados los altos grados de competitividad entre las distintas empresas, se hace necesario innovar en cuanto a modelos de fuga utilizando nuevas fuentes de información, como lo son las llamadas al Call Center. Es así como el objetivo general de este trabajo es medir el impacto que generan las llamadas realizadas en el Call Center en la predicción de fuga de los clientes.
Para lograr lo anterior se cuenta con información de las interacciones que tienen los clientes con el Call Center, específicamente el texto de cada llamada. Para extraer información sobre el contenido de las llamadas se aplicó un modelo de detección de tópicos sobre el texto para así conocer los temas tratados y utilizar esta información en los modelos de fuga.
Los resultados obtenidos luego de realizar diversos modelos logit de predicción de fuga, muestran que al utilizar tanto la información de las llamadas como la del cliente (demográfica y transaccional), el modelo es superior en accuracy en un 8.7% a uno que no utiliza esta nueva fuente de información. Además el modelo con ambos tipos de variables presenta un error tipo I un 25% menor a un modelo que no incluye el contenido de las llamadas.
Tras los análisis realizados es posible concluir que las llamadas al Call Center sí son relevantes y de ayuda al momento de predecir la fuga de un cliente, ya que logran aumentar la capacidad predictiva y ajuste del modelo. Además de que entregan nueva información sobre el comportamiento del cliente y es posible detectar aquellos tópicos que puedan estar asociados con la fuga, lo que permite tomar acciones correctivas.
|
44 |
Health Data Analytics: Data and Text Mining Approaches for PharmacovigilanceLiu, Xiao, Liu, Xiao January 2016 (has links)
Pharmacovigilance is defined as the science and activities relating to the detection, assessment, understanding, and prevention of adverse drug events (WHO 2004). Post-approval adverse drug events are a major health concern. They attribute to about 700,000 emergency department visits, 120,000 hospitalizations, and $75 billion in medical costs annually (Yang et al. 2014). However, certain adverse drug events are preventable if detected early. Timely and accurate pharmacovigilance in the post-approval period is an urgent goal of the public health system. The availability of various sources of healthcare data for analysis in recent years opens new opportunities for the data-driven pharmacovigilance research. In an attempt to leverage the emerging healthcare big data, pharmacovigilance research is facing a few challenges. Most studies in pharmacovigilance focus on structured and coded data, and therefore miss important textual data from patient social media and clinical documents in EHR. Most prior studies develop drug safety surveillance systems using a single data source with only one data mining algorithm. The performance of such systems is hampered by the bias in data and the pitfalls of the data mining algorithms adopted. In my dissertation, I address two broad research questions: 1) How do we extract rich adverse drug event related information in textual data for active drug safety surveillance? 2) How do we design an integrated pharmacovigilance system to improve the decision-making process for drug safety regulatory intervention? To these ends, the dissertation comprises three essays. The first essay examines how to develop a high-performance information extraction framework for patient reports of adverse drug events in health social media. I found that medical entity extraction, drug-event relation extraction, and report source classification are necessary components for this task. In the second essay, I address the scalability issue of using social media for pharmacovigilance by proposing a distant supervision approach for information extraction. In the last essay, I develop a MetaAlert framework for pharmacovigilance with advanced text mining and data mining techniques to provide timely and accurate detection of adverse drug reactions. Models, frameworks, and design principles proposed in these essays advance not only pharmacovigilance research, but also more broadly contribute to health IT, business analytics, and design science research.
|
45 |
Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messagesŠebesta, Jan January 2009 (has links)
In the present work we study possibilities of automatic sorting of incoming email communication. Our primary goal is to distinguish information about oncoming workshops and conferences, job off ers and published books. We are trying to develop tool to mine the information from data from professional mailing lists. Off ers in the mailing lists come in html, rtf or plain text format, but the information in it is written in common spoken language. We are developing the system so it will use text mining methods to extract the information and save it structured form. Than we will be able to work with it. We are examining the handling of the mails by user and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the bearing information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well{arranged way, sort and search according to the user needs.
|
46 |
Dolování dat z příchozích zpráv elektronické pošty / Data mining from incoming e-mail messagesŠebesta, Jan January 2011 (has links)
We study possibilities of automatic sorting of incoming e-mails. Our primary goal is to distinguish information about oncoming workshops and conferences, job offers and published books. We are developing mining tool for extracting the information from data originated in profession-specific mailing lists. Offers in the mailing lists come in html, rtf or plain text format. The messages are written in common spoken language. We have developed the system so it will use text mining methods to extract the information and save it structured form. Then we will be able to work with it. We are examining how user handles the mail and apply the knowledge in the development. We solve the problems with obtaining of the messages, distinguishing language and encoding and estimating the type of message. After recognition of the transported information we are able to mine data. In the end we save the mined information to the database, which allows us to display it in well-arranged way, sort and search according to the user needs.
|
47 |
Text Mining of Supreme Administrative Court JurisdictionsFeinerer, Ingo, Hornik, Kurt January 2007 (has links) (PDF)
Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Austrian supreme administrative court jurisdictions concerning dues and taxes. We analyze the law corpora using R with the new text mining package tm. Applications include clustering the jurisdiction documents into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction properties. The findings are compared to results obtained by law experts. / Series: Research Report Series / Department of Statistics and Mathematics
|
48 |
The Role of Work Experiences in College Student Leadership Development: Evidence From a National Dataset and a Text Mining Approach to Examining Beliefs About LeadershipLewis, Jonathan Scott January 2017 (has links)
Thesis advisor: Heather Rowan-Kenyon / Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about relationships between employment and leadership development, a key student learning outcome. This study addressed these gaps in two ways, using a national sample of 77,489 students from the 2015 Multi-Institutional Study of Leadership. First, it employed quasi-experimental methods and hierarchical linear modeling (HLM) to investigate relationships between work variables (i.e., working status, work location, and hours worked) and both capacity and self-efficacy for leadership. Work location for students employed on-campus was disaggregated into 14 functional departments to allow for more nuanced analysis. Second, this study used text mining methods to examine the language that participants used to define leadership, which enabled a rich comparison between students’ conceptualizations and contemporary leadership theory. Results from HLM analysis suggested that working for pay is associated with lower self-reported leadership capacity, as defined by the social change model of leadership development, and that this relationship varies by workplace location and across institutional characteristics. The association between working status and self-efficacy for leadership was found to be practically non-significant, and hours worked per week were unrelated to either outcome. Results from text mining analysis suggested that most students conceptualize leadership using language that resonates with the industrial paradigm of leadership theory— leadership resides in a person with authority, who enacts specific behaviors and directs a group toward a goal. Disaggregated findings suggested that students who work off-campus consider leadership differently, using language consonant with contemporary, post-industrial scholarship—leadership is a dynamic, relational, non-coercive process that results in personal growth and positive change. In sum, the findings both echo and challenge aspects of existing research on leadership and working college students. Future research should explore off-campus work environments in greater detail, while practitioners and scholars who supervise students should aim to infuse post-industrial conceptualizations into on-campus work environments. / Thesis (PhD) — Boston College, 2017. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Leadership and Higher Education.
|
49 |
Généralisation de données textuelles adaptée à la classification automatique / Toward new features for text miningTisserant, Guillaume 14 April 2015 (has links)
La classification de documents textuels est une tâche relativement ancienne. Très tôt, de nombreux documents de différentes natures ont été regroupés dans le but de centraliser la connaissance. Des systèmes de classement et d'indexation ont alors été créés. Ils permettent de trouver facilement des documents en fonction des besoins des lecteurs. Avec la multiplication du nombre de documents et l'apparition de l'informatique puis d'internet, la mise en œuvre de systèmes de classement des textes devient un enjeu crucial. Or, les données textuelles, de nature complexe et riche, sont difficiles à traiter de manière automatique. Dans un tel contexte, cette thèse propose une méthodologie originale pour organiser l'information textuelle de façon à faciliter son accès. Nos approches de classification automatique de textes mais aussi d'extraction d'informations sémantiques permettent de retrouver rapidement et avec pertinence une information recherchée.De manière plus précise, ce manuscrit présente de nouvelles formes de représentation des textes facilitant leur traitement pour des tâches de classification automatique. Une méthode de généralisation partielle des données textuelles (approche GenDesc) s'appuyant sur des critères statistiques et morpho-syntaxiques est proposée. Par ailleurs, cette thèse s'intéresse à la construction de syntagmes et à l'utilisation d'informations sémantiques pour améliorer la représentation des documents. Nous démontrerons à travers de nombreuses expérimentations la pertinence et la généricité de nos propositions qui permettent une amélioration des résultats de classification. Enfin, dans le contexte des réseaux sociaux en fort développement, une méthode de génération automatique de HashTags porteurs de sémantique est proposée. Notre approche s'appuie sur des mesures statistiques, des ressources sémantiques et l'utilisation d'informations syntaxiques. Les HashTags proposés peuvent alors être exploités pour des tâches de recherche d'information à partir de gros volumes de données. / We have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data.
|
50 |
Machine Learning Algorithms for the Analysis of Social Media and Detection of Malicious User Generated ContentUnknown Date (has links)
One of the de ning characteristics of the modern Internet is its massive connectedness,
with information and human connection simply a few clicks away. Social
media and online retailers have revolutionized how we communicate and purchase
goods or services. User generated content on the web, through social media, plays
a large role in modern society; Twitter has been in the forefront of political discourse,
with politicians choosing it as their platform for disseminating information,
while websites like Amazon and Yelp allow users to share their opinions on products
via online reviews. The information available through these platforms can provide
insight into a host of relevant topics through the process of machine learning. Speci -
cally, this process involves text mining for sentiment analysis, which is an application
domain of machine learning involving the extraction of emotion from text.
Unfortunately, there are still those with malicious intent and with the changes
to how we communicate and conduct business, comes changes to their malicious practices.
Social bots and fake reviews plague the web, providing incorrect information
and swaying the opinion of unaware readers. The detection of these false users or
posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated
content.
This dissertation expands the current research in sentiment analysis, fake online
review detection and election prediction. We examine cross-domain sentiment
analysis using tweets and reviews. Novel techniques combining ensemble and feature
selection methods are proposed for the domain of online spam review detection. We
investigate the ability for the Twitter platform to predict the United States 2016 presidential
election. In addition, we determine how social bots in
uence this prediction. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2018. / FAU Electronic Theses and Dissertations Collection
|
Page generated in 0.0529 seconds