Spelling suggestions: "subject:"textmining"" "subject:"detemining""
271 |
中國古典詩歌對應探勘及詞彙分析工具 / Tools for Pattern Comparison and Word Analysis of Chinese Classical Poetry黃植琨 Unknown Date (has links)
本研究以《詩經》、《楚辭》、《全唐詩》、《全宋詩》及《全宋詞》等,數位化的文本資料作為基礎,運用資訊技術,建構分析文獻間借鑒的工具。工具採用字串或詞彙比對的方式,使用者可以透過設定,過濾出可能的對應關係,特別是《全唐詩》、《全宋詩》和《全宋詞》間字面上的類似之處。本研究參考人文領域的研究,用以評估工具的效果。同時,我們也藉由資訊科學的角度,統計如唐詩和宋代詩詞間的對應關係,亦透過如《詩經》和《詩經》、《楚辭》和《楚辭》、《全唐詩》和《全唐詩》、《全宋詞》和《全宋詞》、《全宋詩》和《全宋詩》的對應關係,挖掘同一時代文人作品的對應。另外,本研究也嘗試中國古典詩歌的斷詞,以及分析詩歌中詞彙的語意,未來也希望能夠透過語意進行詩歌比對。本研究雖不如傳統方法的人文研究深入,但提供從大量的語料中去蕪存菁,以及統計等相關服務,節省人文研究分析整理文本所需的時間,用數位的力量輔助人文領域的相關研究。
|
272 |
Development of a Hepatitis C Virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevanceKojo, Kwofie Samuel January 2011 (has links)
Philosophiae Doctor - PhD / To ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and (ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/. DESHCV is a text mining system implemented using named concept recognition and cooccurrence based approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance. Additionally, users can retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis. HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles. Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma (HCC) related genes, comprehensive reviews on HCV biology and drug development, functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers, gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via “string searches”. The utility of HCVpro has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin. Furthermore, eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their diagnostic potential. The data archived in HCVpro can be utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists. / South Africa
|
273 |
Information extraction from pharmaceutical literatureBatista-Navarro, Riza Theresa Bautista January 2014 (has links)
With the constantly growing amount of biomedical literature, methods for automatically distilling information from unstructured data, collectively known as information extraction, have become indispensable. Whilst most biomedical information extraction efforts in the last decade have focussed on the identification of gene products and interactions between them, the biomedical text mining community has recently extended their scope to capture associations between biomedical and chemical entities with the aim of supporting applications in drug discovery. This thesis is the first comprehensive study focussing on information extraction from pharmaceutical chemistry literature. In this research, we describe our work on (1) recognising names of chemical compounds and drugs, facilitated by the incorporation of domain knowledge; (2) exploring different coreference resolution paradigms in order to recognise co-referring expressions given a full-text article; and (3) defining drug-target interactions as events and distilling them from pharmaceutical chemistry literature using event extraction methods.
|
274 |
Text mining molecular interactions and their context for studying diseaseJamieson, Daniel January 2014 (has links)
Molecular interactions enable us to understand the complexity of the human living system and how it can be exploited or malfunction to cause disease. The biomedical literature presents detailed knowledge of molecular functions and therefore represents a valuable reservoir of data for studying disease. However, extracting this data efficiently is difficult as it is spread over millions of publications in text that is not machine-readable. In this thesis we investigate how text mining can be used to automatically extract data for molecular interactions and their context relevant to disease. We focus on two globally relevant classes of diseases of which manifest from contrasting mechanisms: pain-related diseases and diseases caused by pathogenic organisms. Using HIV-1 as a case study, we first show that text mining can be used to partially recreate a large, manually curated database of HIV-1-human molecular interactions derived from the literature. We highlight both weaknesses in the quality of the data produced by the text-mining approach and strengths in it being able to extract this data rapidly, identifying instances missed in the manual curation and its potential as a support tool. We then expand on this approach by showing how an entirely new database of protein interactions relevant to pain can be created efficiently and accurately using text mining to generate the data and manual curation to validate the data quality. The following chapter then presents an analysis of 1,002 unique pain-related protein-protein interactions derived from this database, showing that it is of greater relevance to pain research than databases of pain interactions created from other common starting points. We highlight its value by, for example, identifying new drug repurposing opportunities and exploring differences in specific pain diseases using the contextual detail afforded by the text mining. Finally, we expand further on our approach to extracting molecular interactions from the literature, by showing how interactions between human proteins and pathogens can be curated across pathogenic organisms. We demonstrate how these techniques can be used to expand our knowledge of human pathogen related interaction data already stored in public databases, by identifying 42 new HIV-1-human molecular interactions, 108 new interactions between pathogen species and human proteins and 33 new human proteins that were found to interact with pathogens. Together, the results show that contexualised text mining, when supported by manual curation, can be used to extract molecular interactions for contrasting disease types in an efficient and accurate manner.
|
275 |
Unsupervised discovery of relations for analysis of textual data in digital forensicsLouis, Anita Lily 23 August 2010 (has links)
This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright / Dissertation (MSc)--University of Pretoria, 2010. / Computer Science / unrestricted
|
276 |
Text Analytics of Social Media: Sentiment Analysis, Event Detection and SummarizationShen, Chao 31 October 2014 (has links)
In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users.
Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs.
In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.
|
277 |
A strategy for a systematic approach to biomarker discovery validation : a study on lung cancer microarray data setDol, Zulkifli January 2015 (has links)
Cancer is a serious threat to human health and is now one of major causes of death worldwide. However, the complexity of the cancer makes the development of new and specific diagnostic tools particularly challenging. A number of different strategies have been developed for biomarker discovery in cancer using microarray data. The problem that typically needs to be addressed is the scale of the data sets; we simply do not have (or are likely to obtain) sufficient data for classical machine learning approaches for biomarker discovery to be properly validated. Obtaining a biomarker that is specific to a particular cancer is also very challenging. The initial promise that was held out for gene microarray work for the development of cancer biomarkers has not yet yielded the hoped for breakthroughs. This work discusses the construction of a strategy for a systematic approach to biomarker discovery validation using lung cancer gene expression microarray data based around non-small cell cancer and in patients which either stayed disease free after surgery (a five year window) or in which the disease progressed and re-occurred. As a means of assisting the validation purposes we have therefore looked at new methodologies for using existing biological knowledge to support machine learning biomarker discovery techniques. We employ text mining strategy using previously published literature for correlating biological concepts to a given phenotype. Pathway driven approaches through the use of Web Services and workflows, enabled the large-scale dataset to be analysed systematically. The results showed that it was possible, at least using this specific data set, to clearly differentiate between progressive disease and disease free patients using a set of biomarkers implicated in neuroendocrine signaling. A validation of the biomarkers identified was attempted in three separately published data sets. This analysis showed that although there was support for some of our findings in one of these data sets, this appeared to be a function of the close similarity in experimental design followed rather than through specific of the analysis method developed.
|
278 |
Extrakce informací z textuMichalko, Boris January 2008 (has links)
Cieľom tejto práce je preskúmať dostupné systémy pre extrakciu informácií a možnosti ich použitia v projekte MedIEQ. Teoretickú časť obsahuje úvod do oblasti extrakcie informácií. Popisujem účel, potreby a použitie a vzťah k iným úlohám spracovania prirodzeného jazyka. Prechádzam históriou, nedávnym vývojom, meraním výkonnosti a jeho kritikou. Taktiež popisujem všeobecnú architektúru IE systému a základné úlohy, ktoré má riešiť, s dôrazom na extrakciu entít. V praktickej časti sa nacházda prehľad algoritmov používaných v systémoch pre extrakciu informácií. Opisujem oba typy algoritmov ? pravidlové aj štatistické. V ďalšej kapitole je zoznam a krátky popis existujúcich voľných systémov. Nakoniec robím vlastný experiment s dvomi systémami ? LingPipe a GATE na vybraných korpusoch. Meriam rôzne výkonnostné štatistiky. Taktiež som vytvoril malý slovník a regulárny výraz pre email aby som demonštroval taktiež pravidlá pre extrahovanie určitých špecifických informácií.
|
279 |
Discourse causality recognition in the biomedical domainMihaila, Claudiu January 2014 (has links)
With the advent of online publishing of scientific research came an avalanche of electronic resources and repositories containing knowledge encoded in some form or another. In the domain of biomedical sciences, research is now being published at a faster-than-ever pace, with several thousand articles per day. It is impossible for any human being to process that amount of information in due time, let alone apply it to their own needs. Thus appeared the necessity of being able to automatically retrieve relevant documents and extract useful information from text. Although it is now possible to distil essential factual knowledge from text, it is difficult to interpret the connections between the extracted facts. These connections, also known as discourse relations, make the text coherent and cohesive, and their automatic discovery can lead to a better understanding of the conveyed knowledge. One fundamental discourse relation is causality, as it is the one which explains reasons and allows for inferences to be made. This thesis is the first comprehensive study which focusses on recognising discourse causality in biomedical scientific literature. We first construct a manually annotated corpus of discourse causality and analyse its characteristics. Then, a methodology for automatically recognising causal relations using text mining and natural language processing techniques is presented. Furthermore, we investigate the automatic identification of additional information about the polarity, certainty, knowledge type and source of causal relations. The entire methodology is evaluated by empirical experiments, whose results show that it is possible to successfully extract causal relations from biomedical literature. Finally, we provide an example of a direct application of our research and offer ideas for further research directions and possible improvements to our methodology.
|
280 |
Essays on Data Driven Insights from Crowd Sourcing, Social Media and Social NetworksVelichety, Srikar, Velichety, Srikar January 2016 (has links)
The beginning of this decade has seen a phenomenal raise in the amount of data generated in the world. While this increase provides us with opportunities to understand various aspects of human behavior and mechanisms behind new phenomena, the technologies, statistical techniques and theories required to gain an in depth and comprehensive understanding haven't progressed at an equal pace. As little as 5 years back, we used to deal with problems where there is insufficient prior social science or economic theory and the interest is only in prediction of the outcome or where there is an appropriate social science or economic theory and the interest is in explaining a given phenomenon. Today, we deal with problems where there is insufficient social science or economic theory but the interest is in explaining a given phenomenon. This creates a big challenge the solution to which is of equal interest to both academics and practitioners. In my research, I contribute towards addressing these challenges by building exploratory frameworks that leverage a variety of techniques including social network analysis, text and data mining, econometrics, statistical computing and visualization. My three essay dissertation focuses on understanding the antecedents to the quality of user generated content and on subscription and un-subscription behavior of users from lists on Social Media. Using a data science approach on population sized samples from Wikipedia and Twitter, I demonstrate the power of customized exploratory analyses in uncovering facts that social science or economic theory doesn't dictate and show how metrics from these analyses can be used to build prediction models with higher accuracy. I also demonstrate a method for combining exploration, prediction and explanatory modeling and propose to extend this methodology to provide causal inference. This dissertation has general implications for building better predictive and explanatory models and for mining text efficiently in Social Media.
|
Page generated in 0.0716 seconds