• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 243
  • 124
  • 44
  • 38
  • 30
  • 29
  • 24
  • 24
  • 13
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • Tagged with
  • 624
  • 624
  • 144
  • 130
  • 119
  • 114
  • 90
  • 88
  • 86
  • 81
  • 80
  • 76
  • 67
  • 65
  • 65
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
101

Cluster Analysis of Discussions on Internet Forums / Klusteranalys av Diskussioner på Internetforum

Holm, Rasmus January 2016 (has links)
The growth of textual content on internet forums over the last decade have been immense which have resulted in users struggling to find relevant information in a convenient and quick way. The activity of finding information from large data collections is known as information retrieval and many tools and techniques have been developed to tackle common problems. Cluster analysis is a technique for grouping similar objects into smaller groups (clusters) such that the objects within a cluster are more similar than objects between clusters. We have investigated the clustering algorithms, Graclus and Non-Exhaustive Overlapping k-means (NEO-k-means), on textual data taken from Reddit, a social network service. One of the difficulties with the aforementioned algorithms is that both have an input parameter controlling how many clusters to find. We have used a greedy modularity maximization algorithm in order to estimate the number of clusters that exist in discussion threads. We have shown that it is possible to find subtopics within discussions and that in terms of execution time, Graclus has a clear advantage over NEO-k-means.
102

Towards Secure and Trustworthy Cyberspace: Social Media Analytics on Hacker Communities

Li, Weifeng, Li, Weifeng January 2017 (has links)
Social media analytics is a critical research area spawned by the increasing availability of rich and abundant online user-generated content. So far, social media analytics has had a profound impact on organizational decision making in many aspects, including product and service design, market segmentation, customer relationship management, and more. However, the cybersecurity sector is behind other sectors in benefiting from the business intelligence offered by social media analytics. Given the role of hacker communities in cybercrimes and the prevalence of hacker communities, there is an urgent need for developing hacker social media analytics capable of gathering cyber threat intelligence from hacker communities for exchanging hacking knowledge and tools. My dissertation addressed two broad research questions: (1) How do we help organizations gain cyber threat intelligence through social media analytics on hacker communities? And (2) how do we advance social media analytics research by developing innovative algorithms and models for hacker communities? Using cyber threat intelligence as a guiding principle, emphasis is placed on the two major components in hacker communities: threat actors and their cybercriminal assets. To these ends, the dissertation is arranged in two parts. The first part of the dissertation focuses on gathering cyber threat intelligence on threat actors. In the first essay, I identify and profile two types of key sellers in hacker communities: malware sellers and stolen data sellers, both of which are responsible for data breach incidents. In the second essay, I develop a method for recovering social interaction networks, which can be further used for detecting major hacker groups, and identifying their specialties and key members. The second part of the dissertation seeks to develop cyber threat intelligence on cybercriminal assets. In the third essay, a novel supervised topic model is proposed to further address the language complexities in hacker communities. In the fourth essay, I propose the development of an innovative emerging topic detection model. Models, frameworks, and design principles developed in this dissertation not only advance social media analytics research, but also broadly contribute to IS security application and design science research.
103

Probabilistic Models of Topics and Social Events

Wei, Wei 01 December 2016 (has links)
Structured probabilistic inference has shown to be useful in modeling complex latent structures of data. One successful way in which this technique has been applied is in the discovery of latent topical structures of text data, which is usually referred to as topic modeling. With the recent popularity of mobile devices and social networking, we can now easily acquire text data attached to meta information, such as geo-spatial coordinates and time stamps. This metadata can provide rich and accurate information that is helpful in answering many research questions related to spatial and temporal reasoning. However, such data must be treated differently from text data. For example, spatial data is usually organized in terms of a two dimensional region while temporal information can exhibit periodicities. While some work existing in the topic modeling community that utilizes some of the meta information, these models largely focused on incorporating metadata into text analysis, rather than providing models that make full use of the joint distribution of metainformation and text. In this thesis, I propose the event detection problem, which is a multidimensional latent clustering problem on spatial, temporal and topical data. I start with a simple parametric model to discover independent events using geo-tagged Twitter data. The model is then improved toward two directions. First, I augmented the model using Recurrent Chinese Restaurant Process (RCRP) to discover events that are dynamic in nature. Second, I studied a model that can detect events using data from multiple media sources. I studied the characteristics of different media in terms of reported event times and linguistic patterns. The approaches studied in this thesis are largely based on Bayesian nonparametric methods to deal with steaming data and unpredictable number of clusters. The research will not only serve the event detection problem itself but also shed light into a more general structured clustering problem in spatial, temporal and textual data.
104

Personalized Medicine through Automatic Extraction of Information from Medical Texts

Frunza, Oana Magdalena January 2012 (has links)
The wealth of medical-related information available today gives rise to a multidimensional source of knowledge. Research discoveries published in prestigious venues, electronic-health records data, discharge summaries, clinical notes, etc., all represent important medical information that can assist in the medical decision-making process. The challenge that comes with accessing and using such vast and diverse sources of data stands in the ability to distil and extract reliable and relevant information. Computer-based tools that use natural language processing and machine learning techniques have proven to help address such challenges. This current work proposes automatic reliable solutions for solving tasks that can help achieve a personalized-medicine, a medical practice that brings together general medical knowledge and case-specific medical information. Phenotypic medical observations, along with data coming from test results, are not enough when assessing and treating a medical case. Genetic, life-style, background and environmental data also need to be taken into account in the medical decision process. This thesis’s goal is to prove that natural language processing and machine learning techniques represent reliable solutions for solving important medical-related problems. From the numerous research problems that need to be answered when implementing personalized medicine, the scope of this thesis is restricted to four, as follows: 1. Automatic identification of obesity-related diseases by using only textual clinical data; 2. Automatic identification of relevant abstracts of published research to be used for building systematic reviews; 3. Automatic identification of gene functions based on textual data of published medical abstracts; 4. Automatic identification and classification of important medical relations between medical concepts in clinical and technical data. This thesis investigation on finding automatic solutions for achieving a personalized medicine through information identification and extraction focused on individual specific problems that can be later linked in a puzzle-building manner. A diverse representation technique that follows a divide-and-conquer methodological approach shows to be the most reliable solution for building automatic models that solve the above mentioned tasks. The methodologies that I propose are supported by in-depth research experiments and thorough discussions and conclusions.
105

Monitoring Tweets for Depression to Detect At-Risk Users

Jamil, Zunaira January 2017 (has links)
According to the World Health Organization, mental health is an integral part of health and well-being. Mental illness can affect anyone, rich or poor, male or female. One such example of mental illness is depression. In Canada 5.3% of the population had presented a depressive episode in the past 12 months. Depression is difficult to diagnose, resulting in high under-diagnosis. Diagnosing depression is often based on self-reported experiences, behaviors reported by relatives, and a mental status examination. Currently, author- ities use surveys and questionnaires to identify individuals who may be at risk of depression. This process is time-consuming and costly. We propose an automated system that can identify at-risk users from their public social media activity. More specifically, we identify at-risk users from Twitter. To achieve this goal we trained a user-level classifier using Support Vector Machine (SVM) that can detect at-risk users with a recall of 0.8750 and a precision of 0.7778. We also trained a tweet-level classifier that predicts if a tweet indicates distress. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% distress tweets and 95% non-distress tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier uses SVM and performs with a recall of 0.8020 and a precision of 0.1237. Our system can be used by authorities to find a focused group of at-risk users. It is not a platform for labeling an individual as a patient with depres- sion, but only a platform for raising an alarm so that the relevant authorities could take necessary interventions to further analyze the predicted user to confirm his/her state of mental health. We respect the ethical boundaries relating to the use of social media data and therefore do not use any user identification information in our research.
106

Extraction and representation of key characteristics from epidemiological literature

Karystianis, George January 2014 (has links)
Epidemiological studies are rich in information that could improve the understanding of concept complexity of a health problem, and are important sources for evidence based medicine. However, epidemiologists experience difficulties in recognising and aggregating key characteristics in related research due to an increasing number of published articles. The main aim of this dissertation is to explore how text mining techniques can assist epidemiologists to identify important pieces of information and detect and integrate key knowledge for further research and exploration via concept maps. Concept maps are widely used in medicine for exploration and representation as a relatively formal, easy to design and understand knowledge representation model. To support this aim, we have developed a methodology for the extraction of key epidemiological characteristics from all types of epidemiological research articles in order to visualise, explore and aggregate concepts related to a health care problem. A generic rule-based approach was designed and implemented for the identification of mentions of six key characteristics, including study design, population, exposure, outcome, covariate and effect size. The system also relies on automatic term recognition and biomedical dictionaries to identify concepts of interests. In order to facilitate knowledge integration and aggregation, extracted characteristics are further normalized and mapped to existing resources. Study design mentions are mapped to an expanded version of the Ontology of Clinical Research (OCRe), whereas exposure, outcome and covariate mentions are mapped to Unified Medical Language System (UMLS) semantic groups and categories. Population mentions are mapped to age groups, gender and nationality/ethnicity, and effect size mentions are normalised with the regards to the used metric and confidence interval and related concept. The evaluation has shown reliable results, with an average micro F-score of 87% for recognition of epidemiological mentions and 91% for normalisation. Normalised concepts are further organised in an automatically generated concept map, which has three sections for exposures, outcomes and covariates. To demonstrate the potential of the developed methodology, it was applied to a large-scale corpus of epidemiological research abstracts related to obesity. Obesity was chosen as a case study since it has emerged as one of the most important global health problems of the 21st century. Using the concepts extracted from the corpus, we have built a searchable database of key epidemiological characteristics explored in obesity and an automatically generated concept map represented the normalized exposures, outcomes and covariates. An epidemiological workbench (EpiTeM) was designed to enable further exploration and inspection of the normalized extracted data, with direct links to the literature. The generated results also allow exploration of trends in obesity research and can facilitate understanding of its concept complexity. For example, we have noted the most frequent concepts and the most common pairs of characteristics that have been studied in obesity epidemiology. Finally, this thesis also discusses a number of challenges for text mining of epidemiological literature and suggests various opportunities for future work.
107

Automatic structure and keyphrase analysis of scientific publications

Constantin, Alexandru January 2014 (has links)
Purpose. This work addresses an escalating problem within the realm of scientific publishing, that stems from accelerated publication rates of article formats difficult to process automatically. The amount of manual labour required to organise a comprehensive corpus of relevant literature has long been impractical. This has, in effect, reduced research efficiency and delayed scientific advancement. Two complementary approaches meant to alleviate this problem are detailed and improved upon beyond the current state-of-the-art, namely logical structure recovery of articles and keyphrase extraction. Methodology. The first approach targets the issue of flat-format publishing. It performs a structural analysis of the camera-ready PDF article and recognises its fine-grained organisation over logical units. The second approach is the application of a keyphrase extraction algorithm that relies on rhetorical information from the recovered structure to better contour an article’s true points of focus. A recount of the scientific article’s function, content and structure is provided, along with insights into how different logical components such as section headings or the bibliography can be automatically identified and utilised for higher-quality keyphrase extraction. Findings. Structure recovery can be carried out independently of an article’s formatting specifics, by exploiting conventional dependencies between logical components. In addition, access to an article’s logical structure is beneficial across term extraction approaches, reducing input noise and facilitating the emphasis of regions of interest. Value. The first part of this work details a novel method for recovering the rhetorical structure of scientific articles that is competitive with state-of-the-art machine learning techniques, yet requires no layout-specific tuning or prior training. The second part showcases a keyphrase extraction algorithm that outperforms other solutions in an established benchmark, yet does not rely on collection statistics or external knowledge sources in order to be proficient.
108

Mejora de proceso de evaluación y co-creación basada en técnica de text-analytics

Rojas Valenzuela, Manuel Humberto January 2016 (has links)
Magíster en Ingeniería de Negocios con Tecnologías de Información / El emprendimiento a nivel mundial y particularmente en chile presenta una problemática constante, debido a las dificultades propias de emprender se suma la necesidad de disponer de capitales de riesgo que posibiliten llevar adelante una idea de negocio. Si en lo anterior se considera que en nuestro país cerca del 96% de las empresas formalmente existentes corresponde a Mypes micros y pequeñas empresas las cuales por su tamaño no disponen de acceso directo a fuentes de financiación tradicionales, la propuesta de participar y ser acreedor de uno de los fondos concursables de capitales semillas, CORFO, Indap, capitales abeja o Crece, resulta ser una solución viable para mantener el emprendimiento o simplemente dejarlo olvidado en un baúl por falta de recursos económicos. CSASESORES es una organización creada en el año 2011 bajo el reconocimiento de esta necesidad, con el objetivo de ser un factor de cambio que permita aportar en el crecimiento de los emprendimientos y Mypes en Chile. En sus breves años la organización ha sido acreedora de uno de los capitales semillas de emprendimiento en la región metropolitana y ha aportado activamente en el desarrollo de más de 50 ideas de negocios que finalmente han resultado ganadoras en la asignación de capital de riesgo de capitales semilla SERCOTEC. Para apoyar esta iniciativa se ha diseñado un proyecto que pretende fundar las bases para los procesos de gestión de ideas de negocios de la organización creciente; permitiendo además implementar soluciones tecnológicas que posibiliten automatizar unos de los procesos más extensos, que corresponde a la evaluación, comprensión y mejora de emprendimientos que finalmente son presentados a postulaciones de fondos concursables de capitales semillas. Los resultados preliminares obtenidos son alentadores, ya que la aplicación de la técnica de Text Mining y Latent Semantic Analysis permitió identificar cerca de diez clúster con sus temáticas durante el proceso de evaluación en los ámbitos de las fortalezas y debilidades de iniciativas de capitales semillas. Junto con lo anterior se logró descubrir un conjunto de relaciones semánticas estrechas, tanto en las fortalezas como también en el ámbito de las debilidades de las iniciativas evaluadas, estas relaciones se encuentran visibles y documentadas gracias a la utilización de la técnica de Latent Semantic Analysis. / 14/7/2021
109

Unsupervised Method for Disease Named Entity Recognition

Almutairi, Abeer N. 06 November 2019 (has links)
Diseases take a central role in biomedical research; many studies aim to enable access to disease information, by designing named entity recognition models to make use of the available information. Disease recognition is a problem that has been tackled by various approaches of which the most famous are the lexical and supervised approaches. However, the aforementioned approaches have many drawbacks as their performance is affected by the amount of human-annotated data set available. Moreover, lexicalapproachescannotdistinguishbetweenrealmentionsofdiseasesand mentionsofotherentitiesthatsharethesamenameoracronym. Thechallengeofthis project is to find a model that can combine the strengths of the lexical approaches and supervised approaches, to design a named entity recognizer. We demonstrate that our model can accurately identify disease name mentions in text, by using word embedding to capture context information of each mention, which enables the model todistinguishifitisarealdiseasementionornot. Weevaluateourmodelusingagold standard data set which showed high precision of 84% and accuracy of 96%. Finally, we compare the performance of our model to different statistical name entity recognition models, and the results show that our model outperforms the unsupervised lexical approaches.
110

Automatic Protein Function Annotation Through Text Mining

Toonsi, Sumyyah 25 August 2019 (has links)
The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.

Page generated in 0.2138 seconds