Global ETD Search

51	A semi-automated framework for the analytical use of gene-centric data with biological ontologies He, Xin January 2017 (has links) Motivation Translational bioinformatics(TBI) has been defined as ‘the development and application of informatics methods that connect molecular entities to clinical entities’ [1], which has emerged as a systems theory approach to bridge the huge wealth of biomedical data into clinical actions using a combination of innovations and resources across the entire spectrum of biomedical informatics approaches [2]. The challenge for TBI is the availability of both comprehensive knowledge based on genes and the corresponding tools that allow their analysis and exploitation. Traditionally, biological researchers usually study one or only a few genes at a time, but in recent years high throughput technologies such as gene expression microarrays, protein mass-spectrometry and next-generation DNA and RNA sequencing have emerged that allow the simultaneous measurement of changes on a genome-wide scale. These technologies usually result in large lists of interesting genes, but meaningful biological interpretation remains a major challenge. Over the last decade, enrichment analysis has become standard practice in the analysis of such gene lists, enabling systematic assessment of the likelihood of differential representation of defined groups of genes compared to suitably annotated background knowledge. The success of such analyses are highly dependent on the availability and quality of the gene annotation data. For many years, genes were annotated by different experts using inconsistent, non-standard terminologies. Large amounts of variation and duplication in these unstructured annotation sets, made them unsuitable for principled quantitative analysis. More recently, a lot of effort has been put into the development and use of structured, domain specific vocabularies to annotate genes. The Gene Ontology is one of the most successful examples of this where genes are annotated with terms from three main clades; biological process, molecular function and cellular component. However, there are many other established and emerging ontologies to aid biological data interpretation, but are rarely used. For the same reason, many bioinformatic tools only support analysis analysis using the Gene Ontology. The lack of annotation coverage and the support for them in existing analytical tools to aid biological interpretation of data has become a major limitation to their utility and uptake. Thus, automatic approaches are needed to facilitate the transformation of unstructured data to unlock the potential of all ontologies, with corresponding bioinformatics tools to support their interpretation. Approaches In this thesis, firstly, similar to the approach in [3,4], I propose a series of computational approaches implemented in a new tool OntoSuite-Miner to address the ontology based gene association data integration challenge. This approach uses NLP based text mining methods for ontology based biomedical text mining. What differentiates my approach from other approaches is that I integrate two of the most wildly used NLP modules into the framework, not only increasing the confidence of the text mining results, but also providing an annotation score for each mapping, based on the number of pieces of evidence in the literature and the number of NLP modules that agreed with the mapping. Since heterogeneous data is important in understanding human disease, the approach was designed to be generic, thus the ontology based annotation generation can be applied to different sources and can be repeated with different ontologies. Secondly, in respect of the second challenge proposed by TBI, to increase the statistical power of the annotation enrichment analysis, I propose OntoSuite-Analytics, which integrates a collection of enrichment analysis methods into a unified open-source software package named topOnto, in the statistical programming language R. The package supports enrichment analysis across multiple ontologies with a set of implemented statistical/topological algorithms, allowing the comparison of enrichment results across multiple ontologies and between different algorithms. Results The methodologies described above were implemented and a Human Disease Ontology (HDO) based gene annotation database was generated by mining three publicly available database, OMIM, GeneRIF and Ensembl variation. With the availability of the HDO annotation and the corresponding ontology enrichment analysis tools in topOnto, I profiled 277 gene classes with human diseases and generated ‘disease environments’ for 1310 human diseases. The exploration of the disease profiles and disease environment provides an overview of known disease knowledge and provides new insights into disease mechanisms. The integration of multiple ontologies into a disease context demonstrates how ‘orthogonal’ ontologies can lead to biological insight that would have been missed by more traditional single ontology analysis. 610.28
52	Avaliação das capacidades dinâmicas através de técnicas de business analytcs Scherer, Jonatas Ost January 2017 (has links) O desenvolvimento das capacidades dinâmicas habilita a empresa à inovar de forma mais eficiente, e por conseguinte, melhorar seu desempenho. Esta tese apresenta um framework para mensuração do grau de desenvolvimento das capacidades dinâmicas da empresa. Através de técnicas de text mining uma bag of words específica para as capacidades dinâmicas é proposta, bem como, baseado na literatura é proposto um conjunto de rotinas para avaliar a operacionalização e desenvolvimento das capacidades dinâmicas. Para avaliação das capacidades dinâmicas, foram aplicadas técnicas de text mining utilizando como fonte de dados os relatórios anuais de catorze empresas aéreas. Através da aplicação piloto foi possível realizar um diagnóstico das empresas aéreas e do setor. O trabalho aborda uma lacuna da literatura das capacidades dinâmicas, ao propor um método quantitativo para sua mensuração, assim como, a proposição de uma bag of words específica para as capacidades dinâmicas. Em termos práticos, a proposição pode contribuir para a tomada de decisões estratégicas embasada em dados, possibilitando assim inovar com mais eficiência e melhorar desempenho da firma. / The development of dynamic capabilities enables the company to innovate more efficiently and therefore improves its performance. This thesis presents a framework for measuring the dynamic capabilities development. Text mining techniques were used to propose a specific bag of words for dynamic capabilities. Furthermore, based on the literature, a group of routines is proposed to evaluate the operationalization and development of dynamic capabilities. In order to evaluate the dynamic capabilities, text mining techniques were applied using the annual reports of fourteen airlines as the data source. Through this pilot application it was possible to carry out a diagnosis of the airlines and the sector as well. The thesis approaches a dynamic capabilities literature gap by proposing a quantitative method for its measurement, as well as, the proposition of a specific bag of words for dynamic capabilities. The proposition can contribute to strategic decision making based on data, allowing firms to innovate more efficiently and improve performance. Gestão organizacional Tomada de decisão Dynamic capabilities Text mining Business analytics
53	Data-driven temporal information extraction with applications in general and clinical domains Filannino, Michele January 2016 (has links) The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. However, Temporal Information Extraction (TIE) is a challenging task because of the amount of types of expressions (durations, frequencies, times, dates) and their high morphological variability and ambiguity. As far as the approaches are concerned, the most common among the existing ones is rule-based, while data-driven ones are under-explored. This thesis introduces a novel domain-independent data-driven TIE strategy. The identification strategy is based on machine learning sequence labelling classifiers on features selected through an extensive exploration. Results are further optimised using an a posteriori label-adjustment pipeline. The normalisation strategy is rule-based and builds on a pre-existing system. The methodology has been applied to both specific (clinical) and generic domain, and has been officially benchmarked at the i2b2/2012 and TempEval-3 challenges, ranking respectively 3rd and 1st. The results prove the TIE task to be more challenging in the clinical domain (overall accuracy 63%) rather than in the general domain (overall accuracy 69%).Finally, this thesis also presents two applications of TIE. One of them introduces the concept of temporal footprint of a Wikipedia article, and uses it to mine the life span of persons. In the other case, TIE techniques are used to improve pre-existing information retrieval systems by filtering out temporally irrelevant results. 004
54	Evaluierung von Motivationsschreiben als Instrument in universitären Aufnahmeverfahren Zeeh, Julia, Ledermüller, Karl, Kobler-Weiß, Michaela January 2018 (has links) (PDF) Während Zulassungstests an Universitäten im Regelfall evaluiert werden, sind entsprechende Verfahren zur Evaluierung anderer Prozessschritte in Bewerbungsverfahren - wie die Einreichung von Motivationsschreiben - noch nicht etabliert. Um diese Lücke zu schließen, wird in diesem Beitrag ein Multi-Method-Ansatz zur Evaluierung von Motivationsschreiben vorgestellt, bei dem Text-Mining-Techniken mit inhaltsanalytischen Elementen kombiniert werden. Es wird dargelegt, wie unterschiedliche von Studierenden gesendete "Signale" mit Studienerfolg korrelieren, und aufgezeigt, dass soziodemografische Effekte bei der Bewertung von Motivationsschreiben berücksichtigt werden müssten.
55	Analyzing collaboration with large-scale scholarly data Zuo, Zhiya 01 August 2019 (has links) We have never stopped in the pursuit of science. Standing on the shoulders of the giants, we gradually make our path to build a systematic and testable body of knowledge to explain and predict the universe. Emerging from researchers’ interactions and self-organizing behaviors, scientific communities feature intensive collaborative practice. Indeed, the era of lone genius has long gone. Teams have now dominated the production and diffusion of scientific ideas. In order to understand how collaboration shapes and evolves organizations as well as individuals’ careers, this dissertation conducts analyses at both macroscopic and microscopic levels utilizing large-scale scholarly data. As self-organizing behaviors, collaborations boil down to the interactions among researchers. Understanding collaboration at individual level, as a result, is in fact a preliminary and crucial step to better understand the collective outcome at group and organization level. To start, I investigate the role of research collaboration in researchers’ careers by leveraging person-organization fit theory. Specifically, I propose prospective social ties based on faculty candidates’ future collaboration potential with future colleagues, which manifests diminishing returns on the placement quality. Moving forward, I address the question of how individual success can be better understood and accurately predicted utilizing their collaboration experience data. Findings reveal potential regularities in career trajectories for early-stage, mid-career, and senior researchers, highlighting the importance of various aspects of social capital. With large-scale scholarly data, I propose a data-driven analytics approach that leads to a deeper understanding of collaboration for both organizations and individuals. Managerial and policy implications are discussed for organizations to stimulate interdisciplinary research and for individuals to achieve better placement as well as short and long term scientific impact. Additionally, while analyzed in the context of academia, the proposed methods and implications can be generalized to knowledge-intensive industries, where collaboration are key factors to performance such as innovation and creativity. collaboration data science organizational diversity social networks text mining Bioinformatics
56	Semiautomatische Metadaten-Extraktion und Qualitätsmanagement in Workflow-Systemen zur Digitalisierung historischer Dokumente / Semi-automated Metadata Extraction and Quality Management in Workflow Systems for Digitizations of Early Documents Schöneberg, Hendrik January 2014 (has links) (PDF) Performing Named Entity Recognition on ancient documents is a time-consuming, complex and error-prone manual task. It is a prerequisite though to being able to identify related documents and correlate between named entities in distinct sources, helping to precisely recreate historic events. In order to reduce the manual effort, automated classification approaches could be leveraged. Classifying terms in ancient documents in an automated manner poses a difficult task due to the sources’ challenging syntax and poor conservation states. This thesis introduces and evaluates approaches that can cope with complex syntactial environments by using statistical information derived from a term’s context and combining it with domain-specific heuristic knowledge to perform a classification. Furthermore this thesis demonstrates how metadata generated by these approaches can be used as error heuristics to greatly improve the performance of workflow systems for digitizations of early documents. / Die Extraktion von Metadaten aus historischen Dokumenten ist eine zeitintensive, komplexe und höchst fehleranfällige Tätigkeit, die üblicherweise vom menschlichen Experten übernommen werden muss. Sie ist jedoch notwendig, um Bezüge zwischen Dokumenten herzustellen, Suchanfragen zu historischen Ereignissen korrekt zu beantworten oder semantische Verknüpfungen aufzubauen. Um den manuellen Aufwand dieser Aufgabe reduzieren zu können, sollen Verfahren der Named Entity Recognition angewendet werden. Die Klassifikation von Termen in historischen Handschriften stellt jedoch eine große Herausforderung dar, da die Domäne eine hohe Schreibweisenvarianz durch unter anderem nur konventionell vereinbarte Orthographie mit sich bringt. Diese Arbeit stellt Verfahren vor, die auch in komplexen syntaktischen Umgebungen arbeiten können, indem sie auf Informationen aus dem Kontext der zu klassifizierenden Terme zurückgreifen und diese mit domänenspezifischen Heuristiken kombinieren. Weiterhin wird evaluiert, wie die so gewonnenen Metadaten genutzt werden können, um in Workflow-Systemen zur Digitalisierung historischer Handschriften Mehrwerte durch Heuristiken zur Produktionsfehlererkennung zu erzielen. Klassifikation Information Retrieval Text Mining Arbeitsablaufplanung Data Mining ddc:000
57	Understanding the hormonal regulation of mouse lactogenesis by transcriptomics and literature analysis Ling, Maurice Han Tong January 2009 (has links) The mammary explant culture model has been a major experimental tool for studying hormonal requirements for milk protein gene expression as markers of secretory differentiation. Experiments with mammary explants from pregnant animals from many species have established that insulin, prolactin, and glucocorticoid are the minimal set of hormones required for the induction of maximal milk protein gene expression. However, the extent to which mammary explants mimic the response of the mammary gland in vivo is not clear. Recent studies have used microarray technology to study the transcriptome of mouse lactation cycle. It was demonstrated that the each phase of mouse lactation has a distinct transcriptional profile but making sense of microarray results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. / The first objective is to examine the possibility of combining literature and genomic analysis to elucidate potentially novel hypotheses for further research into lactation biology. The second objective is to evaluate the strengths and limitations of the murine mammary explant culture for the study and understanding of murine lactogenesis. The underlying question to this objective is whether the mouse mammary explant culture is a good model or representation to study mouse lactogenesis. / The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. We have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalizationspecialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. This study also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for two specialized information extraction tasks. / The performance of Muscorian was unexpected since potential errors from a series of text analysis processes is likely to adversely affect the outcome of the entire process. Most biomedical entity relationship extraction tools have used biomedical-specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect subsequent semantic analysis of the text, such as shallow parsing. A comparative study between MontyTagger, a generic POS tagger, and MedPost, a tagger trained in biomedical text, was carried out. Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did not result in a significant improvement in entity relationship extraction from text; precision of 55.6% from MontyTagger versus 56.8% from MedPost on directional relationships and 86.1% from MontyTagger compared to 81.8% from MedPost on un-directional relationships. This is unexpected as the potential for poor POS tagging by MontyTagger is likely to affect the outcome of the information extraction. An analysis of POS tagging errors demonstrated that 78.5% of tagging errors are being compensated by shallow parsing. Thus, despite 83.1% tagging accuracy, MontyTagger has a functional tagging accuracy of 94.6%. This suggests that POS tagging error does not adversely affect the information extraction task if the errors were resolved in shallow parsing through alternative POS tag use. / Microarrays had been used to examine the transcriptome of mouse lactation and a simple method for microarray analysis is correlation studies where functionally related genes exhibit similar expression profiles. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. Our results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, is generally more stringent than the 99th percentile of Poisson distribution-based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods coincide with the results from 5-mention PubGene method. However, less than 2% of the gene co-expressions analyzed by microarray were found from direct co-occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature. We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research. / The availability of transcriptomics data from time-course experiments on mouse mammary glands examined during the lactation cycle and hormone-induced lactogenesis in mammary explants has permitted an assessment of similarity of gene expression at the transcriptional level. Global transcriptome analysis using exact Wilconox signed-rank test with continuity correction and hierarchical clustering of Spearman coefficient demonstrated that hormone-induced mammary explants behave differently to mammary glands at secretory differentiation. Our results demonstrated that the mammary explant culture model mimics in vivo glands in immediate responses, such as hormone-responsive gene transcription, but generally did not mimic responses to prolonged hormonal stimulus, such as the extensive development of secretory pathways and immune responses normally associated with lactating mammary tissue. Hence, although the explant model is useful to study the immediate effects of stimulating secretory differentiation in mammary glands, it is unlikely to be suitable for the study of secretory activation.
58	A Lexicon for Gene Normalization / Ett lexicon för gennormalisering Lingemark, Maria January 2009 (has links) <p>Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system's disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall.</p> Bioinformatics Gene Normalization String Matching Text Mining Bioinformatics Bioinformatik
59	Mining Of Text In The Product Development Process Loh, Han Tong, Menon, Rakesh, Leong, Christopher K. 01 1900 (has links) In the prevailing world economy, competition is keen and firms need to have an edge over their competitors for profitability and sometimes, even for the survival of the business itself. One way to help achieve this is the capability for rapid product development on a continual basis. However, this rapidity must be accomplished without compromising vital information and feedback that are necessary. The compromise in such information and feedback at the expense of speed may result in counter-productive outcomes, thereby offsetting or even negating whatever profits that could have been derived. New ways, tools and techniques must be found to deliver such information. The widespread availability of databases within the Product Development Process (PDP) facilitates the use of data mining as one of the tools. Thus far, most of the studies on data mining within PDP have emphasised on numerical databases. Studies focusing on textual databases in this context have been relatively few. The research direction is to study real-life cases where textual databases can be mined to obtain valuable information for PDP. One suitable candidate identified for this is “voice of the customer” databases. / Singapore-MIT Alliance (SMA) data mining product development process text mining feedback
60	Extracting Structured Knowledge from Textual Data in Software Repositories Hasan, Maryam 06 1900 (has links) Software team members, as they communicate and coordinate their work with others throughout the life-cycle of their projects, generate different kinds of textual artifacts. Despite the variety of works in the area of mining software artifacts, relatively little research has focused on communication artifacts. Software communication artifacts, in addition to source code artifacts, contain useful semantic information that is not fully explored by existing approaches. This thesis, presents the development of a text analysis method and tool to extract and represent useful pieces of information from a wide range of textual data sources associated with software projects. Our text analysis system integrates Natural Language Processing techniques and statistical text analysis methods, with software domain knowledge. The extracted information is represented as RDF-style triples which constitute interesting relations between developers and software products. We applied the developed system to analyze five different textual information, i.e., source code commits, bug reports, email messages, chat logs, and wiki pages. In the evaluation of our system, we found its precision to be 82%, its recall 58%, and its F-measure 68%.

Search results