Global ETD Search

11	Knowledge derivation and data mining strategies for probabilistic functional integrated networks James, Katherine January 2012 (has links) One of the fundamental goals of systems biology is the experimental verification of the interactome: the entire complement of molecular interactions occurring in the cell. Vast amounts of high-throughput data have been produced to aid this effort. However these data are incomplete and contain high levels of both false positives and false negatives. In order to combat these limitations in data quality, computational techniques have been developed to evaluate the datasets and integrate them in a systematic fashion using graph theory. The result is an integrated network which can be analysed using a variety of network analysis techniques to draw new inferences about biological questions and to guide laboratory experiments. Individual research groups are interested in specific biological problems and, consequently, network analyses are normally performed with regard to a specific question. However, the majority of existing data integration techniques are global and do not focus on specific areas of biology. Currently this issue is addressed by using known annotation data (such as that from the Gene Ontology) to produce process-specific subnetworks. However, this approach discards useful information and is of limited use in poorly annotated areas of the interactome. Therefore, there is a need for network integration techniques that produce process-specific networks without loss of data. The work described here addresses this requirement by extending one of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), to incorporate a concept of biological relevance. Initially, the available functional data for the baker’s yeast Saccharomyces cerevisiae was evaluated to identify areas of bias and specificity which could be exploited during network integration. This information was used to develop an integration technique which emphasises interactions relevant to specific biological questions, using yeast ageing as an exemplar. The integration method improves performance during network-based protein functional prediction in relation to this process. Further, the process-relevant networks complement classical network integration techniques and significantly improve network analysis in a wide range of biological processes. The method developed has been used to produce novel predictions for 505 Gene Ontology biological processes. Of these predictions 41,610 are consistent with existing computational annotations, and 906 are consistent with known expert-curated annotations. The approach significantly reduces the hypothesis space for experimental validation of genes hypothesised to be involved in the oxidative stress response. Therefore, incorporation of biological relevance into network integration can significantly improve network analysis with regard to individual biological questions. 006.312
12	Developing tools and models for evaluating geospatial data integration of official and VGI data sources Al-Bakri, Maythm M. Sharky January 2012 (has links) In recent years, systems have been developed which enable users to produce, share and update information on the web effectively and freely as User Generated Content (UGC) data (including Volunteered Geographic Information (VGI)). Data quality assessment is a major concern for supporting the accurate and efficient spatial data integration required if VGI is to be used alongside official, formal, usually governmental datasets. This thesis aims to develop tools and models for the purpose of assessing such integration possibilities. Initially, in order to undertake this task, geometrical similarity of formal and informal data was examined. Geometrical analyses were performed by developing specific programme interfaces to assess the positional, linear and polygon shape similarity among reference field survey data (FS); official datasets such as data from Ordnance Survey (OS), UK and General Directorate for Survey (GDS), Iraq agencies; and VGI information such as OpenStreetMap (OSM) datasets. A discussion of the design and implementation of these tools and interfaces is presented. A methodology has been developed to assess such positional and shape similarity by applying different metrics and standard indices such as the National Standard for Spatial Data Accuracy (NSSDA) for positional quality; techniques such as buffering overlays for linear similarity; and application of moments invariant for polygon shape similarity evaluations. The results suggested that difficulties exist for any geometrical integration of OSM data with both bench mark FS and formal datasets, but that formal data is very close to reference datasets. An investigation was carried out into contributing factors such as data sources, feature types and number of data collectors that may affect the geometrical quality of OSM data and consequently affect the integration process of OSM datasets with FS, OS and GDS. Factorial designs were undertaken in this study in order to develop and implement an experiment to discover the effect of these factors individually and the interaction between each of them. The analysis found that data source is the most significant factor that affects the geometrical quality of OSM datasets, and that there are interactions among all these factors at different levels of interaction. This work also investigated the possibility of integrating feature classification of official datasets such as data from OS and GDS geospatial data agencies, and informal datasets such as OSM information. In this context, two different models were developed. The first set of analysis included the evaluation of semantic integration of corresponding feature classifications of compared datasets. The second model was concerned with assessing the ability of XML schema matching of feature classifications of tested datasets. This initially involved a tokenization process in order to split up into single words classifications that were composed of multiple words. Subsequently, encoding feature classifications as XML schema trees was undertaken. The semantic similarity, data type similarity and structural similarity were measured between the nodes of compared schema trees. Once these three similarities had been computed, a weighted combination technique has been adopted in order to obtain the overall similarity. The findings of both sets of analysis were not encouraging as far as the possibility of effectively integrating feature classifications of VGI datasets, such as OSM information, and formal datasets, such as OS and GDS datasets, is concerned. 006.312
13	Learning for text mining : tackling the cost of feature and knowledge engineering Iria, José January 2013 (has links) Over the last decade, the state-of-the-art in text mining has moved towards the adoption of machine learning as the main paradigm at the heart of approaches. Despite significant advances, machine learning based text mining solutions remain costly to design, develop and maintain for real world problems. An important component of such cost (feature engineering) concerns the effort required to understand which features or characteristics of the data can be successfully exploited in inducing a predictive model of the data. Another important component of the cost (knowledge engineering) has to do with the effort in creating labelled data, and in eliciting knowledge about the mining systems and the data itself. I present a series of approaches, methods and findings aimed at reducing the cost of creating and maintaining document classification and information extraction systems. They address the following questions: Which classes of features lead to an improved classification accuracy in the document classification and entity extraction tasks? How to reduce the amount of labelled examples needed to train machine learning based document classification and information extraction systems, so as to relieve domain experts from this costly task? How to effectively represent knowledge about these systems and the data that they manipulate, in order to make systems interoperable and results replicable? I provide the reader with the background information necessary to understand the above questions and the contributions to the state-of the- art contained herein. The contributions include: the identification of novel classes of features for the document classification task which exploit the multimedia nature of documents and lead to improved classification accuracy; a novel approach to domain adaptation for text categorization which outperforms standard supervised and semi-supervised methods while requiring considerably less supervision; and a well-founded formalism for declaratively specifying text and multimedia mining systems. 006.312
14	Το έξυπνο διαδίκτυο / Web intelligence Κατσής, Μάριος Γ. 06 September 2007 (has links) Σε αυτήν την εργασία εξετάζεται το Web Intelligence (WI), ένα νέο πεδίο ερευνάς της επιστήμης των υπολογιστών. Αναλύονται οι στόχοι του, τα πεδία που καλύπτει, οι αρχές στις οποίες βασίζεται καθώς τα οφέλη και οι προκλήσεις που θα αντιμετωπίσει. Στο κεφαλαίο 2 βλέπουμε το Web Usage Mining σαν ένα μέρος του WI και τα διάφορα προβλήματα που εγείρει, ακόμα γίνεται μια αναφορά σε θέματα προστασίας προσωπικών δεδομένων. Στο κεφαλαίο 3 παρουσιάζουμε τις κοινές μεθόδους εντοπισμού Web robots και συγκρίνονται με μια νέα μέθοδο που υπόσχεται καλύτερα αποτελέσματα. Το κεφαλαίο 4 ασχολείται με την τεχνητή νοημοσύνη και ειδικότερα με τα νευρωνικά δίκτυα. Αν και τα κεφαλαία μοιάζουν να μην έχουν και πολλά κοινά στοιχεία θα χρησιμοποιήσουμε τις αρχές και τα αποτελέσματα τους μέσα στο πλαίσιο του WI ώστε να επιτύχουμε την δημιουργία ενός συστήματος που θα μας παρέχει καλύτερης ποιότητας πληροφορίες σχετικά με την κίνηση σε ένα site. Ένα τέτοιο σύστημα υλοποιήθηκε και έχει στόχο των εντοπισμό των συνοδών που οφείλονται σε Web robot σε ένα site και την εύρεση των δημοφιλέστερων μονοπατιών των χρηστών μέσα σε αυτό και παρουσιάζεται στο κεφάλαιο 5. Τα αποτελέσματα που προκύπτουν μπορούν να χρησιμοποιηθούν για την βελτίωση του site και των υπηρεσιών που παρέχει καθώς και σε θέματα ασφάλειας. / - 006.312
15	Mining sequential patterns from probabilistic data Muzammal, Muhammad January 2012 (has links) Sequential Pattern Mining (SPM) is an important data mining problem. Although it is assumed in classical SPM that the data to be mined is deterministic, it is now recognized that data obtained from a wide variety of data sources is inherently noisy or uncertain, such as data from sensors or data being collected from the web from different (potentially conflicting) data sources. Probabilistic databases is a popular framework for modelling uncertainty. Recently, several data mining and ranking problems have been studied in probabilistic databases. To the best of our knowledge, this is the first systematic study of mining sequential patterns from probabilistic databases. In this work, we consider the kind of uncertainties that could arise in SPM. We propose four novel uncertainty models for SPM, namely tuple-level uncertainty, event-level uncertainty, source-level uncertainty and source-level uncertainty in deduplication, all of which fit into the probabilistic databases framework, and motivate them using potential real-life scenarios. We then define the interestingness predicate for two measures of interestingness, namely expected support and probabilistic frequentness. Next, we consider the computational complexity of evaluating the interestingness predicate, for various combinations of uncertainty models and interestingness measures, and show that different combinations have very different outcomes from a complexity theoretic viewpoint: whilst some cases are computationally tractable, we show other cases to be computationally intractable. We give a dynamic programming algorithm to compute the source support probability and hence the expected support of a sequence in a source-level uncertain database. We then propose optimizations to speedup the support computation task. Next, we propose probabilistic SPM algorithms based on the candidate generation and pattern growth frameworks for the source-level uncertainty model and the expected support measure. We implement these algorithms and give an empirical evaluation of the probabilistic SPM algorithms and show the scalability of these algorithms under different parameter settings using both real and synthetic datasets. Finally, we demonstrate the effectiveness of the probabilistic SPM framework at extracting meaningful patterns in the presence of noise. 006.312
16	Data mining in text streams using suffix trees Snowsill, Tristan January 2012 (has links) Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists. 006.312
17	Maximum entropy modelling for quantifying unexpectedness of data mining results Kontonasios, Kleanthis-Nikolaos January 2013 (has links) This thesis is concerned with the problem of finding subjectively interesting patterns in data. The focus is restricted to the most prominent notion of subjective interestingness, namely the unexpectedness of a pattern. A pattern is considered unexpected if it contradicts the user's prior knowledge or beliefs about the data. Recently, a general information-theoretic framework for data. mining that naturally incorporates unexpectedness was devised. The proposed approach relics on: 1. the Maximum Entropy principle for encoding the user's prior knowledge about the data or the patterns, 2. the InfRatio measure, an information-theoretic measure for evaluating the unexpectedness of a pattern and 3. a set covering algorithm for finding the most interesting set of patterns. However, this framework is intentionally phrased in abstract terms and formally applied only for limited types of data mining tasks. This thesis is meant to fill this gap, as its main contribution is the formalization of this general framework to specific data mining tasks in order to demonstrate the wide applicability of the framework ill practice. In particular, we instantiate the three main components of the framework ill order to evaluate frequent item.set.li, clusterings and patterns found in real-valued data such as biclusters and subgroups. Additionally, we provide the first literature review of interestingness mea- sures based on unexpectedness and propose a novel classification of the methods into two classes, namely the "syntactical" and "probabilistic" approaches. We show that exploiting the framework for finding subjectively interesting sets of patterns in data is a highly efficient practice in theoretical, algorithmic and computational terms. 006.312
18	Data mining and visualization GarciÌa-Osorio, CeÌsar January 2005 (has links) No description available. 006.312
19	Exploiting markup structure for intelligent search Kruschwitz, Udo January 2004 (has links) No description available. 006.312
20	Data mining ensemble hierarchy, diversity and accuracy Bian, Shun January 2006 (has links) No description available. 006.312

Search results