271 |
Development of a Hepatitis C Virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevanceKojo, Kwofie Samuel January 2011 (has links)
Philosophiae Doctor - PhD / To ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and (ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/. DESHCV is a text mining system implemented using named concept recognition and cooccurrence based approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance. Additionally, users can retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis. HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles. Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma (HCC) related genes, comprehensive reviews on HCV biology and drug development, functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers, gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via “string searches”. The utility of HCVpro has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin. Furthermore, eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their diagnostic potential. The data archived in HCVpro can be utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists. / South Africa
|
272 |
Information extraction from pharmaceutical literatureBatista-Navarro, Riza Theresa Bautista January 2014 (has links)
With the constantly growing amount of biomedical literature, methods for automatically distilling information from unstructured data, collectively known as information extraction, have become indispensable. Whilst most biomedical information extraction efforts in the last decade have focussed on the identification of gene products and interactions between them, the biomedical text mining community has recently extended their scope to capture associations between biomedical and chemical entities with the aim of supporting applications in drug discovery. This thesis is the first comprehensive study focussing on information extraction from pharmaceutical chemistry literature. In this research, we describe our work on (1) recognising names of chemical compounds and drugs, facilitated by the incorporation of domain knowledge; (2) exploring different coreference resolution paradigms in order to recognise co-referring expressions given a full-text article; and (3) defining drug-target interactions as events and distilling them from pharmaceutical chemistry literature using event extraction methods.
|
273 |
Text mining molecular interactions and their context for studying diseaseJamieson, Daniel January 2014 (has links)
Molecular interactions enable us to understand the complexity of the human living system and how it can be exploited or malfunction to cause disease. The biomedical literature presents detailed knowledge of molecular functions and therefore represents a valuable reservoir of data for studying disease. However, extracting this data efficiently is difficult as it is spread over millions of publications in text that is not machine-readable. In this thesis we investigate how text mining can be used to automatically extract data for molecular interactions and their context relevant to disease. We focus on two globally relevant classes of diseases of which manifest from contrasting mechanisms: pain-related diseases and diseases caused by pathogenic organisms. Using HIV-1 as a case study, we first show that text mining can be used to partially recreate a large, manually curated database of HIV-1-human molecular interactions derived from the literature. We highlight both weaknesses in the quality of the data produced by the text-mining approach and strengths in it being able to extract this data rapidly, identifying instances missed in the manual curation and its potential as a support tool. We then expand on this approach by showing how an entirely new database of protein interactions relevant to pain can be created efficiently and accurately using text mining to generate the data and manual curation to validate the data quality. The following chapter then presents an analysis of 1,002 unique pain-related protein-protein interactions derived from this database, showing that it is of greater relevance to pain research than databases of pain interactions created from other common starting points. We highlight its value by, for example, identifying new drug repurposing opportunities and exploring differences in specific pain diseases using the contextual detail afforded by the text mining. Finally, we expand further on our approach to extracting molecular interactions from the literature, by showing how interactions between human proteins and pathogens can be curated across pathogenic organisms. We demonstrate how these techniques can be used to expand our knowledge of human pathogen related interaction data already stored in public databases, by identifying 42 new HIV-1-human molecular interactions, 108 new interactions between pathogen species and human proteins and 33 new human proteins that were found to interact with pathogens. Together, the results show that contexualised text mining, when supported by manual curation, can be used to extract molecular interactions for contrasting disease types in an efficient and accurate manner.
|
274 |
Unsupervised discovery of relations for analysis of textual data in digital forensicsLouis, Anita Lily 23 August 2010 (has links)
This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright / Dissertation (MSc)--University of Pretoria, 2010. / Computer Science / unrestricted
|
275 |
Text Analytics of Social Media: Sentiment Analysis, Event Detection and SummarizationShen, Chao 31 October 2014 (has links)
In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users.
Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs.
In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.
|
276 |
A strategy for a systematic approach to biomarker discovery validation : a study on lung cancer microarray data setDol, Zulkifli January 2015 (has links)
Cancer is a serious threat to human health and is now one of major causes of death worldwide. However, the complexity of the cancer makes the development of new and specific diagnostic tools particularly challenging. A number of different strategies have been developed for biomarker discovery in cancer using microarray data. The problem that typically needs to be addressed is the scale of the data sets; we simply do not have (or are likely to obtain) sufficient data for classical machine learning approaches for biomarker discovery to be properly validated. Obtaining a biomarker that is specific to a particular cancer is also very challenging. The initial promise that was held out for gene microarray work for the development of cancer biomarkers has not yet yielded the hoped for breakthroughs. This work discusses the construction of a strategy for a systematic approach to biomarker discovery validation using lung cancer gene expression microarray data based around non-small cell cancer and in patients which either stayed disease free after surgery (a five year window) or in which the disease progressed and re-occurred. As a means of assisting the validation purposes we have therefore looked at new methodologies for using existing biological knowledge to support machine learning biomarker discovery techniques. We employ text mining strategy using previously published literature for correlating biological concepts to a given phenotype. Pathway driven approaches through the use of Web Services and workflows, enabled the large-scale dataset to be analysed systematically. The results showed that it was possible, at least using this specific data set, to clearly differentiate between progressive disease and disease free patients using a set of biomarkers implicated in neuroendocrine signaling. A validation of the biomarkers identified was attempted in three separately published data sets. This analysis showed that although there was support for some of our findings in one of these data sets, this appeared to be a function of the close similarity in experimental design followed rather than through specific of the analysis method developed.
|
277 |
Extrakce informací z textuMichalko, Boris January 2008 (has links)
Cieľom tejto práce je preskúmať dostupné systémy pre extrakciu informácií a možnosti ich použitia v projekte MedIEQ. Teoretickú časť obsahuje úvod do oblasti extrakcie informácií. Popisujem účel, potreby a použitie a vzťah k iným úlohám spracovania prirodzeného jazyka. Prechádzam históriou, nedávnym vývojom, meraním výkonnosti a jeho kritikou. Taktiež popisujem všeobecnú architektúru IE systému a základné úlohy, ktoré má riešiť, s dôrazom na extrakciu entít. V praktickej časti sa nacházda prehľad algoritmov používaných v systémoch pre extrakciu informácií. Opisujem oba typy algoritmov ? pravidlové aj štatistické. V ďalšej kapitole je zoznam a krátky popis existujúcich voľných systémov. Nakoniec robím vlastný experiment s dvomi systémami ? LingPipe a GATE na vybraných korpusoch. Meriam rôzne výkonnostné štatistiky. Taktiež som vytvoril malý slovník a regulárny výraz pre email aby som demonštroval taktiež pravidlá pre extrahovanie určitých špecifických informácií.
|
278 |
Discourse causality recognition in the biomedical domainMihaila, Claudiu January 2014 (has links)
With the advent of online publishing of scientific research came an avalanche of electronic resources and repositories containing knowledge encoded in some form or another. In the domain of biomedical sciences, research is now being published at a faster-than-ever pace, with several thousand articles per day. It is impossible for any human being to process that amount of information in due time, let alone apply it to their own needs. Thus appeared the necessity of being able to automatically retrieve relevant documents and extract useful information from text. Although it is now possible to distil essential factual knowledge from text, it is difficult to interpret the connections between the extracted facts. These connections, also known as discourse relations, make the text coherent and cohesive, and their automatic discovery can lead to a better understanding of the conveyed knowledge. One fundamental discourse relation is causality, as it is the one which explains reasons and allows for inferences to be made. This thesis is the first comprehensive study which focusses on recognising discourse causality in biomedical scientific literature. We first construct a manually annotated corpus of discourse causality and analyse its characteristics. Then, a methodology for automatically recognising causal relations using text mining and natural language processing techniques is presented. Furthermore, we investigate the automatic identification of additional information about the polarity, certainty, knowledge type and source of causal relations. The entire methodology is evaluated by empirical experiments, whose results show that it is possible to successfully extract causal relations from biomedical literature. Finally, we provide an example of a direct application of our research and offer ideas for further research directions and possible improvements to our methodology.
|
279 |
Essays on Data Driven Insights from Crowd Sourcing, Social Media and Social NetworksVelichety, Srikar, Velichety, Srikar January 2016 (has links)
The beginning of this decade has seen a phenomenal raise in the amount of data generated in the world. While this increase provides us with opportunities to understand various aspects of human behavior and mechanisms behind new phenomena, the technologies, statistical techniques and theories required to gain an in depth and comprehensive understanding haven't progressed at an equal pace. As little as 5 years back, we used to deal with problems where there is insufficient prior social science or economic theory and the interest is only in prediction of the outcome or where there is an appropriate social science or economic theory and the interest is in explaining a given phenomenon. Today, we deal with problems where there is insufficient social science or economic theory but the interest is in explaining a given phenomenon. This creates a big challenge the solution to which is of equal interest to both academics and practitioners. In my research, I contribute towards addressing these challenges by building exploratory frameworks that leverage a variety of techniques including social network analysis, text and data mining, econometrics, statistical computing and visualization. My three essay dissertation focuses on understanding the antecedents to the quality of user generated content and on subscription and un-subscription behavior of users from lists on Social Media. Using a data science approach on population sized samples from Wikipedia and Twitter, I demonstrate the power of customized exploratory analyses in uncovering facts that social science or economic theory doesn't dictate and show how metrics from these analyses can be used to build prediction models with higher accuracy. I also demonstrate a method for combining exploration, prediction and explanatory modeling and propose to extend this methodology to provide causal inference. This dissertation has general implications for building better predictive and explanatory models and for mining text efficiently in Social Media.
|
280 |
Organização flexível de documentos / Flexible organization of documentsTatiane Nogueira Rios 25 March 2013 (has links)
Diversos métodos têm sido desenvolvidos para a organização da crescente quantidade de documentos textuais. Esses métodos frequentemente fazem uso de algoritmos de agrupamento para organizar documentos que referem-se a um mesmo assunto em um mesmo grupo, supondo que conteúdos de documentos de um mesmo grupo são similares. Porém, existe a possibilidade de que documentos pertencentes a grupos distintos também apresentem características semelhantes. Considerando esta situação, há a necessidade de desenvolver métodos que possibilitem a organização flexível de documentos, ou seja, métodos que possibilitem que documentos sejam organizados em diferentes grupos com diferentes graus de compatibilidade. O agrupamento fuzzy de documentos textuais apresenta-se como uma técnica adequada para este tipo de organização, uma vez que algoritmos de agrupamento fuzzy consideram que um mesmo documento pode ser compatível com mais de um grupo. Embora tem-se desenvolvido algoritmos de agrupamento fuzzy que possibilitam a organização flexível de documentos, tal organização é avaliada em termos do desempenho do agrupamento de documentos. No entanto, considerando que grupos de documentos devem possuir descritores que identifiquem adequadamente os tópicos representados pelos mesmos, de maneira geral os descritores de grupos tem sido extraídos utilizando alguma heurística sobre um conjunto pequeno de documentos, realizando assim, uma avaliação simples sobre o significado dos grupos extraídos. No entanto, uma apropriada extração e avaliação de descritores de grupos é importante porque os mesmos são termos representantes da coleção que identificam os tópicos abordados nos documentos. Portanto, em aplicações em que o agrupamento fuzzy é utilizado para a organização flexível de documentos, uma descrição apropriada dos grupos obtidos é tão importante quanto um bom agrupamento, uma vez que, neste tipo de agrupamento, um mesmo descritor pode indicar o conteúdo de mais de um grupo. Essa necessidade motivou esta tese, cujo objetivo foi investigar e desenvolver métodos para a extração de descritores de grupos fuzzy para a organização flexível de documentos. Para cumprir esse objetivo desenvolveu se: i) o método SoftO-FDCL (Soft Organization - Fuzzy Description Comes Last ), pelo qual descritores de grupos fuzzy at são extraídos após o processo de agrupamento fuzzy, visando identicar tópicos da organização flexível de documentos independentemente do algoritmo de agrupamento fuzzy utilizado; ii) o método SoftO-wFDCL ( Soft Organization - weighted Fuzzy Description Comes Last ), pelo qual descritores de grupos fuzzy at também são extraídos após o processo de agrupamento fuzzy utilizando o grau de pertinência dos documentos em cada grupo, obtidos do agrupamento fuzzy, como fator de ponderação dos termos candidatos a descritores; iii) o método HSoftO-FDCL (Hierarchical Soft Organization - Fuzzy Description Comes Last ), pelo qual descritores de grupos fuzzy hierárquicos são extraídos após o processo de agrupamento hierárquico fuzzy, identificando tópicos da organização hierárquica flexível de documentos. Adicionalmente, apresenta-se nesta tese uma aplicação do método SoftO-FDCL no contexto do programa de educação médica continuada canadense, reforçando a utilidade e aplicabilidade da organização flexível de documentos / Several methods have been developed to organize the growing number of textual documents. Such methods frequently use clustering algorithms to organize documents with similar topics into clusters. However, there are situations when documents of dffierent clusters can also have similar characteristics. In order to overcome this drawback, it is necessary to develop methods that permit a soft document organization, i.e., clustering documents into different clusters according to different compatibility degrees. Among the techniques that we can use to develop methods in this sense, we highlight fuzzy clustering algorithms (FCA). By using FCA, one of the most important steps is the evaluation of the yield organization, which is performed considering that all analyzed topics are adequately identified by cluster descriptors. In general, cluster descriptors are extracted using some heuristic over a small number of documents. The adequate extraction and evaluation of cluster descriptors is important because they are terms that represent the collection and identify the topics of the documents. Therefore, an adequate description of the obtained clusters is as important as a good clustering, since the same descriptor might identify one or more clusters. Hence, the development of methods to extract descriptors from fuzzy clusters obtained for soft organization of documents motivated this thesis. Aiming at investigating such methods, we developed: i) the SoftO-FDCL (Soft Organization - Fuzzy Description Comes Last) method, in which descriptors of fuzzy clusters are extracted after clustering documents, identifying topics regardless the adopted fuzzy clustering algorithm; ii) the SoftO-wFDCL (Soft Organization - weighted Fuzzy Description Comes Last) method, in which descriptors of fuzzy clusters are also extracted after the fuzzy clustering process using the membership degrees of the documents as a weighted factor for the candidate descriptors; iii) the HSoftO-FDCL (Hierarchical Soft Organization - Fuzzy Description Comes Last) method, in which descriptors of hierarchical fuzzy clusters are extracted after the hierarchical fuzzy clustering process, identifying topics by means of a soft hierarchical organization of documents. Besides presenting these new methods, this thesis also discusses the application of the SoftO-FDCL method on documents produced by the Canadian continuing medical education program, presenting the utility and applicability of the soft organization of documents in real-world scenario
|
Page generated in 0.0419 seconds