61 |
Data-driven temporal information extraction with applications in general and clinical domainsFilannino, Michele January 2016 (has links)
The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. However, Temporal Information Extraction (TIE) is a challenging task because of the amount of types of expressions (durations, frequencies, times, dates) and their high morphological variability and ambiguity. As far as the approaches are concerned, the most common among the existing ones is rule-based, while data-driven ones are under-explored. This thesis introduces a novel domain-independent data-driven TIE strategy. The identification strategy is based on machine learning sequence labelling classifiers on features selected through an extensive exploration. Results are further optimised using an a posteriori label-adjustment pipeline. The normalisation strategy is rule-based and builds on a pre-existing system. The methodology has been applied to both specific (clinical) and generic domain, and has been officially benchmarked at the i2b2/2012 and TempEval-3 challenges, ranking respectively 3rd and 1st. The results prove the TIE task to be more challenging in the clinical domain (overall accuracy 63%) rather than in the general domain (overall accuracy 69%).Finally, this thesis also presents two applications of TIE. One of them introduces the concept of temporal footprint of a Wikipedia article, and uses it to mine the life span of persons. In the other case, TIE techniques are used to improve pre-existing information retrieval systems by filtering out temporally irrelevant results.
|
62 |
Evaluierung von Motivationsschreiben als Instrument in universitären AufnahmeverfahrenZeeh, Julia, Ledermüller, Karl, Kobler-Weiß, Michaela January 2018 (has links) (PDF)
Während Zulassungstests an Universitäten im Regelfall evaluiert werden, sind entsprechende Verfahren zur Evaluierung anderer Prozessschritte in Bewerbungsverfahren - wie die Einreichung von Motivationsschreiben - noch nicht etabliert. Um diese Lücke zu schließen, wird in diesem Beitrag ein Multi-Method-Ansatz zur Evaluierung von Motivationsschreiben vorgestellt, bei dem Text-Mining-Techniken mit inhaltsanalytischen Elementen kombiniert werden. Es wird dargelegt, wie unterschiedliche von Studierenden gesendete "Signale" mit Studienerfolg korrelieren, und aufgezeigt, dass soziodemografische Effekte bei der Bewertung von Motivationsschreiben berücksichtigt werden müssten.
|
63 |
Analyzing collaboration with large-scale scholarly dataZuo, Zhiya 01 August 2019 (has links)
We have never stopped in the pursuit of science. Standing on the shoulders of the giants, we gradually make our path to build a systematic and testable body of knowledge to explain and predict the universe. Emerging from researchers’ interactions and self-organizing behaviors, scientific communities feature intensive collaborative practice. Indeed, the era of lone genius has long gone. Teams have now dominated the production and diffusion of scientific ideas. In order to understand how collaboration shapes and evolves organizations as well as individuals’ careers, this dissertation conducts analyses at both macroscopic and microscopic levels utilizing large-scale scholarly data.
As self-organizing behaviors, collaborations boil down to the interactions among researchers. Understanding collaboration at individual level, as a result, is in fact a preliminary and crucial step to better understand the collective outcome at group and organization level. To start, I investigate the role of research collaboration in researchers’ careers by leveraging person-organization fit theory. Specifically, I propose prospective social ties based on faculty candidates’ future collaboration potential with future colleagues, which manifests diminishing returns on the placement quality. Moving forward, I address the question of how individual success can be better understood and accurately predicted utilizing their collaboration experience data. Findings reveal potential regularities in career trajectories for early-stage, mid-career, and senior researchers, highlighting the importance of various aspects of social capital.
With large-scale scholarly data, I propose a data-driven analytics approach that leads to a deeper understanding of collaboration for both organizations and individuals. Managerial and policy implications are discussed for organizations to stimulate interdisciplinary research and for individuals to achieve better placement as well as short and long term scientific impact. Additionally, while analyzed in the context of academia, the proposed methods and implications can be generalized to knowledge-intensive industries, where collaboration are key factors to performance such as innovation and creativity.
|
64 |
Profiling topics on the Web for knowledge discoverySehgal, Aditya Kumar 01 January 2007 (has links)
The availability of large-scale data on the Web motivates the development of automatic algorithms to analyze topics and to identify relationships between topics. Various approaches have been proposed in the literature. Most focus on specific topics, mainly those representing people, with little attention to topics of other kinds. They are also less flexible in how they represent topics.
In this thesis we study existing methods as well as describe a different approach, based on profiles, for representing topics. A Topic Profile is analogous to a synopsis of a topic and consists of different types of features. Profiles are flexible to allow different combinations of features to be emphasized and are extensible to support new features to be incorporated without having to change the underlying logic.
More generally, topic profiles provide an abstract framework that can be used to create different types of concrete representations for topics. Different options regarding the number of documents considered for a topic or types of features extracted can be decided based on requirements of the problem as well as the characteristics of the data. Topic profiles also provide a framework to explore relationships between topics.
We compare different methods for building profiles and evaluate them in terms of their information content and their ability to predict relationships between topics. We contribute new methods in term weighting and for identifying relevant text segments in web documents.
In this thesis, we present an application of our profile-based approach to explore social networks of US senators generated from web data and compare with networks generated from voting data. We consider both general networks as well as issue-specific networks. We also apply topic profiles for identifying and ranking experts given topics of interest, as part of the 2007 TREC Expert Search task.
Overall, our results show that topic profiles provide a strong foundation for exploring different topics and for mining relationships between topics using web data. Our approach can be applied to a wide range of web knowledge discovery problems, in contrast to existing approaches that are mostly designed for specific problems.
|
65 |
Semiautomatische Metadaten-Extraktion und Qualitätsmanagement in Workflow-Systemen zur Digitalisierung historischer Dokumente / Semi-automated Metadata Extraction and Quality Management in Workflow Systems for Digitizations of Early DocumentsSchöneberg, Hendrik January 2014 (has links) (PDF)
Performing Named Entity Recognition on ancient documents is a time-consuming, complex and
error-prone manual task. It is a prerequisite though to being able to identify related documents and correlate between named entities in distinct sources, helping to precisely
recreate historic events. In order to reduce the manual effort, automated classification approaches could be leveraged. Classifying terms in ancient documents in an automated
manner poses a difficult task due to the sources’ challenging syntax and poor conservation states. This thesis introduces and evaluates approaches that can cope with complex syntactial environments by using statistical information derived from a term’s context and combining it with domain-specific heuristic knowledge to perform a classification. Furthermore this thesis demonstrates how metadata generated by these approaches can be used as error heuristics to greatly improve the performance of workflow systems for digitizations of early documents. / Die Extraktion von Metadaten aus historischen Dokumenten ist eine zeitintensive, komplexe und höchst fehleranfällige Tätigkeit, die üblicherweise vom menschlichen Experten übernommen werden muss. Sie ist jedoch notwendig, um Bezüge zwischen Dokumenten herzustellen, Suchanfragen zu historischen Ereignissen korrekt zu beantworten oder semantische Verknüpfungen aufzubauen. Um den manuellen Aufwand dieser Aufgabe reduzieren zu können, sollen Verfahren der Named Entity Recognition angewendet werden. Die Klassifikation von Termen in historischen Handschriften stellt jedoch eine große Herausforderung dar, da die Domäne eine hohe Schreibweisenvarianz durch unter anderem nur konventionell vereinbarte Orthographie mit sich bringt. Diese Arbeit stellt Verfahren vor, die auch in komplexen syntaktischen Umgebungen arbeiten können, indem sie auf Informationen aus dem Kontext der zu klassifizierenden Terme zurückgreifen und diese mit domänenspezifischen Heuristiken kombinieren. Weiterhin wird evaluiert, wie die so gewonnenen Metadaten genutzt werden können, um in Workflow-Systemen zur Digitalisierung historischer Handschriften Mehrwerte durch Heuristiken zur Produktionsfehlererkennung zu erzielen.
|
66 |
Understanding the hormonal regulation of mouse lactogenesis by transcriptomics and literature analysisLing, Maurice Han Tong January 2009 (has links)
The mammary explant culture model has been a major experimental tool for studying hormonal requirements for milk protein gene expression as markers of secretory differentiation. Experiments with mammary explants from pregnant animals from many species have established that insulin, prolactin, and glucocorticoid are the minimal set of hormones required for the induction of maximal milk protein gene expression. However, the extent to which mammary explants mimic the response of the mammary gland in vivo is not clear. Recent studies have used microarray technology to study the transcriptome of mouse lactation cycle. It was demonstrated that the each phase of mouse lactation has a distinct transcriptional profile but making sense of microarray results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. / The first objective is to examine the possibility of combining literature and genomic analysis to elucidate potentially novel hypotheses for further research into lactation biology. The second objective is to evaluate the strengths and limitations of the murine mammary explant culture for the study and understanding of murine lactogenesis. The underlying question to this objective is whether the mouse mammary explant culture is a good model or representation to study mouse lactogenesis. / The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. We have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalizationspecialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. This study also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for two specialized information extraction tasks. / The performance of Muscorian was unexpected since potential errors from a series of text analysis processes is likely to adversely affect the outcome of the entire process. Most biomedical entity relationship extraction tools have used biomedical-specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect subsequent semantic analysis of the text, such as shallow parsing. A comparative study between MontyTagger, a generic POS tagger, and MedPost, a tagger trained in biomedical text, was carried out. Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did not result in a significant improvement in entity relationship extraction from text; precision of 55.6% from MontyTagger versus 56.8% from MedPost on directional relationships and 86.1% from MontyTagger compared to 81.8% from MedPost on un-directional relationships. This is unexpected as the potential for poor POS tagging by MontyTagger is likely to affect the outcome of the information extraction. An analysis of POS tagging errors demonstrated that 78.5% of tagging errors are being compensated by shallow parsing. Thus, despite 83.1% tagging accuracy, MontyTagger has a functional tagging accuracy of 94.6%. This suggests that POS tagging error does not adversely affect the information extraction task if the errors were resolved in shallow parsing through alternative POS tag use. / Microarrays had been used to examine the transcriptome of mouse lactation and a simple method for microarray analysis is correlation studies where functionally related genes exhibit similar expression profiles. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. Our results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, is generally more stringent than the 99th percentile of Poisson distribution-based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods coincide with the results from 5-mention PubGene method. However, less than 2% of the gene co-expressions analyzed by microarray were found from direct co-occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature. We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research. / The availability of transcriptomics data from time-course experiments on mouse mammary glands examined during the lactation cycle and hormone-induced lactogenesis in mammary explants has permitted an assessment of similarity of gene expression at the transcriptional level. Global transcriptome analysis using exact Wilconox signed-rank test with continuity correction and hierarchical clustering of Spearman coefficient demonstrated that hormone-induced mammary explants behave differently to mammary glands at secretory differentiation. Our results demonstrated that the mammary explant culture model mimics in vivo glands in immediate responses, such as hormone-responsive gene transcription, but generally did not mimic responses to prolonged hormonal stimulus, such as the extensive development of secretory pathways and immune responses normally associated with lactating mammary tissue. Hence, although the explant model is useful to study the immediate effects of stimulating secretory differentiation in mammary glands, it is unlikely to be suitable for the study of secretory activation.
|
67 |
A Lexicon for Gene Normalization / Ett lexicon för gennormaliseringLingemark, Maria January 2009 (has links)
<p>Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system's disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall.</p>
|
68 |
Mining Of Text In The Product Development ProcessLoh, Han Tong, Menon, Rakesh, Leong, Christopher K. 01 1900 (has links)
In the prevailing world economy, competition is keen and firms need to have an edge over their competitors for profitability and sometimes, even for the survival of the business itself. One way to help achieve this is the capability for rapid product development on a continual basis. However, this rapidity must be accomplished without compromising vital information and feedback that are necessary. The compromise in such information and feedback at the expense of speed may result in counter-productive outcomes, thereby offsetting or even negating whatever profits that could have been derived. New ways, tools and techniques must be found to deliver such information. The widespread availability of databases within the Product Development Process (PDP) facilitates the use of data mining as one of the tools. Thus far, most of the studies on data mining within PDP have emphasised on numerical databases. Studies focusing on textual databases in this context have been relatively few. The research direction is to study real-life cases where textual databases can be mined to obtain valuable information for PDP. One suitable candidate identified for this is “voice of the customer” databases. / Singapore-MIT Alliance (SMA)
|
69 |
Extracting Structured Knowledge from Textual Data in Software RepositoriesHasan, Maryam 06 1900 (has links)
Software team members, as they communicate and coordinate their work with others throughout the life-cycle of their projects, generate different kinds of textual artifacts. Despite the variety of works in the area of mining software artifacts, relatively little research has focused on communication artifacts. Software communication artifacts, in addition to source code artifacts, contain useful semantic information that is not fully explored by existing approaches.
This thesis, presents the development of a text analysis method and tool to extract and represent useful pieces of information from a wide range of textual data sources associated with software projects. Our text analysis system integrates Natural Language Processing techniques and statistical text analysis methods, with software domain knowledge. The extracted information is represented as RDF-style triples which constitute interesting relations between developers and software products. We applied the developed system to analyze five different textual information, i.e., source code commits, bug reports, email messages, chat logs, and wiki pages. In the evaluation of our system, we found its precision to be 82%, its recall 58%, and its F-measure 68%.
|
70 |
A clustering scheme for large high-dimensional document datasetsChen, Jing-wen 09 August 2007 (has links)
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
|
Page generated in 0.0471 seconds