Global ETD Search

511	Data analysis in proteomics novel computational strategies for modeling and interpreting complex mass spectrometry data Sniatynski, Matthew John 11 1900 (has links) Contemporary proteomics studies require computational approaches to deal with both the complexity of the data generated, and with the volume of data produced. The amalgamation of mass spectrometry -- the analytical tool of choice in proteomics -- with the computational and statistical sciences is still recent, and several avenues of exploratory data analysis and statistical methodology remain relatively unexplored. The current study focuses on three broad analytical domains, and develops novel exploratory approaches and practical tools in each. Data transform approaches are the first explored. These methods re-frame data, allowing for the visualization and exploitation of features and trends that are not immediately evident. An exploratory approach making use of the correlation transform is developed, and is used to identify mass-shift signals in mass spectra. This approach is used to identify and map post-translational modifications on individual peptides, and to identify SILAC modification-containing spectra in a full-scale proteomic analysis. Secondly, matrix decomposition and projection approaches are explored; these use an eigen-decomposition to extract general trends from groups of related spectra. A data visualization approach is demonstrated using these techniques, capable of visualizing trends in large numbers of complex spectra, and a data compression and feature extraction technique is developed suitable for use in spectral modeling. Finally, a general machine learning approach is developed based on conditional random fields (CRFs). These models are capable of dealing with arbitrary sequence modeling tasks, similar to hidden Markov models (HMMs), but are far more robust to interdependent observational features, and do not require limiting independence assumptions to remain tractable. The theory behind this approach is developed, and a simple machine learning fragmentation model is developed to test the hypothesis that reproducible sequence-specific intensity ratios are present within the distribution of fragment ions originating from a common peptide bond breakage. After training, the model shows very good performance associating peptide sequences and fragment ion intensity information, lending strong support to the hypothesis. / Medicine, Faculty of / Medicine, Department of / Experimental Medicine, Division of / Graduate Proteomics Bioinformatics Machine learning Mass spectrometry
512	Computational approaches to predicting drug induced toxicity Marchese Robinson, Richard Liam January 2013 (has links) Novel approaches and models for predicting drug induced toxicity in silico are presented. Typically, these were based on Quantitative Structure-Activity Relationships (QSAR). The following endpoints were modelled: mutagenicity, carcinogenicity, inhibition of the hERG ion channel and the associated arrhythmia - Torsades de Pointes. A consensus model was developed based on Derek for WindowsTM and Toxtree and used to filter compounds as part of a collaborative effort resulting in the identification of potential starting points for anti-tuberculosis drugs. Based on the careful selection of data from the literature, binary classifiers were generated for the identification of potent hERG inhibitors. These were found to perform competitively with, or better than, those computational approaches previously presented in the literature. Some of these models were generated using Winnow, in conjunction with a novel proposal for encoding molecular structures as required by this algorithm. The Winnow models were found to perform comparably to models generated using the Support Vector Machine and Random Forest algorithms. These studies also emphasised the variability in results which may be obtained when applying the same approaches to different train/test combinations. Novel approaches to combining chemical information with Ultrafast Shape Recognition (USR) descriptors are introduced: Atom Type USR (ATUSR) and a combination between a proposed Atom Type Fingerprint (ATFP) and USR (USR-ATFP). These were applied to the task of predicting protein-ligand interactions - including the prediction of hERG inhibition. Whilst, for some of the datasets considered, either ATUSR or USR-ATFP was found to perform marginally better than all other descriptor sets to which they were compared, most differences were statistically insignificant. Further work is warranted to determine the advantages which ATUSR and USR-ATFP might offer with respect to established descriptor sets. The first attempts to construct QSAR models for Torsades de Pointes using predicted cardiac ion channel inhibitory potencies as descriptors are presented, along with the first evaluation of experimentally determined inhibitory potencies as an alternative, or complement to, standard descriptors. No (clear) evidence was found that 'predicted' ('experimental') 'IC-descriptors' improve performance. However, their value may lie in the greater interpretability they could confer upon the models. Building upon the work presented in the preceding chapters, this thesis ends with specific proposals for future research directions. 540
513	Bayesian methods for gravitational waves and neural networks Graff, Philip B. January 2012 (has links) Einstein’s general theory of relativity has withstood 100 years of testing and will soon be facing one of its toughest challenges. In a few years we expect to be entering the era of the first direct observations of gravitational waves. These are tiny perturbations of space-time that are generated by accelerating matter and affect the measured distances between two points. Observations of these using the laser interferometers, which are the most sensitive length-measuring devices in the world, will allow us to test models of interactions in the strong field regime of gravity and eventually general relativity itself. I apply the tools of Bayesian inference for the examination of gravitational wave data from the LIGO and Virgo detectors. This is used for signal detection and estimation of the source parameters. I quantify the ability of a network of ground-based detectors to localise a source position on the sky for electromagnetic follow-up. Bayesian criteria are also applied to separating real signals from glitches in the detectors. These same tools and lessons can also be applied to the type of data expected from planned space-based detectors. Using simulations from the Mock LISA Data Challenges, I analyse our ability to detect and characterise both burst and continuous signals. The two seemingly different signal types will be overlapping and confused with one another for a space-based detector; my analysis shows that we will be able to separate and identify many signals present. Data sets and astrophysical models are continuously increasing in complexity. This will create an additional computational burden for performing Bayesian inference and other types of data analysis. I investigate the application of the MOPED algorithm for faster parameter estimation and data compression. I find that its shortcomings make it a less favourable candidate for further implementation. The framework of an artificial neural network is a simple model for the structure of a brain which can “learn” functional relationships between sets of inputs and outputs. I describe an algorithm developed for the training of feed-forward networks on pre-calculated data sets. The trained networks can then be used for fast prediction of outputs for new sets of inputs. After demonstrating capabilities on toy data sets, I apply the ability of the network to classifying handwritten digits from the MNIST database and measuring ellipticities of galaxies in the Mapping Dark Matter challenge. The power of neural networks for learning and rapid prediction is also useful in Bayesian inference where the likelihood function is computationally expensive. The new BAMBI algorithm is detailed, in which our network training algorithm is combined with the nested sampling algorithm MULTINEST to provide rapid Bayesian inference. Using samples from the normal inference, a network is trained on the likelihood function and eventually used in its place. This is able to provide significant increase in the speed of Bayesian inference while returning identical results. The trained networks can then be used for extremely rapid follow-up analyses with different priors, obtaining orders of magnitude of speed increase. Learning how to apply the tools of Bayesian inference for the optimal recovery of gravitational wave signals will provide the most scientific information when the first detections are made. Complementary to this, the improvement of our analysis algorithms to provide the best results in less time will make analysis of larger and more complicated models and data sets practical. 530
514	A Location-Aware Social Media Monitoring System Ji, Liu January 2014 (has links) Social media users generate a large volume of data, which can contain meaningful and useful information. One such example is information about locations, which may be useful in applications such as marketing and security monitoring. There are two types of locations: location entities mentioned in the text of the messages and the physical locations of users. Extracting the first type of locations is not trivial because the location entities in the text are often ambiguous. In this thesis, we implement a sequential classification model with conditional random fields followed by a rule-based disambiguation model, we apply them to Twitter messages (tweets) and we show that they handle the ambiguous location entities in our dataset reasonably well. Only very few users disclose their physical locations; in order to automatically detect their locations, many approaches have been proposed using various types of information, including the tweets posted by the users. It is not easy to infer the original locations from text data, because text tends to be noisy, particularly in social media. Recently, deep learning techniques have been shown to reduce the error rate of many machine learning tasks, due to their ability to learn meaningful representations of input data. We investigate the potential of building a deep-learning architecture to infer the location of Twitter users based merely on their tweets. We find that stacked denoising auto-encoders are well suited for this task, with results comparable to state-of-the-art models. Finally, we combine the two models above with a third-party sentiment analysis tool and obtain a intelligent social media monitoring system. We show a demo of the system and that it is able to predict and visualize the locations and sentiments contained in a stream of tweets related to mobile phone brands - a typical real world e-business application. Natural Language Processing Machine Learning Social Media
515	Learning the Sub-Conceptual Layer: A Framework for One-Class Classification Sharma, Shiven January 2016 (has links) In the realm of machine learning research and application, binary classification algorithms, i.e. algorithms that attempt to induce discriminant functions between two categories of data, reign supreme. Their fundamental property is the reliance on the availability of data from all known categories in order to induce functions that can offer acceptable levels of accuracy. Unfortunately, data from so-called ``real-world'' domains sometimes do not satisfy this property. In order to tackle this, researchers focus on methods such as sampling and cost-sensitive classification to make the data more conducive for binary classifiers. However, as this thesis shall argue, there are scenarios in which even such explicit methods to rectify distributions fail. In such cases, one-class classification algorithms become a practical alternative. Unfortunately, if the domain is inherently complex, the advantage that they offer over binary classifiers becomes diminished. The work in this thesis addresses this issue, and builds a framework that allows for one-class algorithms to build efficient classifiers. In particular, this thesis introduces the notion of learning along the lines sub-concepts in the domain; the complexity in domains arises due to the presence of sub-concepts, and by learning over them explicitly rather than on the entire domain as a whole, we can produce powerful one-class classification systems. The level of knowledge regarding these sub-concepts will naturally vary by domain, and thus we develop three distinct frameworks that take the amount of domain knowledge available into account. We demonstrate these frameworks over three real-world domains. The first domain we consider is that of biometric authentication via a users swipe on a smartphone. We identify sub-concepts based on a users motion, and given that modern smartphones employ sensors that can identify motion, during learning as well as application, sub-concepts can be identified explicitly, and novel instances can be processed by the appropriate one-class classifier. The second domain is that of invasive isotope detection via gamma-ray spectra. The sub-concepts are based on environmental factors; however, the hardware employed cannot detect such concepts, and quantifying the precise source that creates these sub-concepts is difficult to ascertain. To remedy this, we introduce a novel framework in which we employ a sub-concept detector by means of a multi-class classifier, which pre-processes novel instances in order to send them to the correct one-class classifier. The third domain is that of compliance verification of the Comprehensive Test Ban Treaty (CTBT) through Xenon isotope measurements. This domain presents the worst case where sub-concepts are not known. To this end, we employ a generic version of our framework in which we simply cluster the domain and build classifiers over each cluster. In all cases, we demonstrate that learning in the context of domain concepts greatly improves the performance of one-class classifiers. machine learning one-class classification artificial intelligence
516	Exploring Mediatoil Imagery: A Content-Based Approach Saroop, Sahil January 2016 (has links) The future of Alberta’s bitumen sands, also known as “oil sands” or “tar sands,” and their place in Canada’s energy future has become a topic of much public debate. Within this debate, the print, television, and social media campaigns of those who both support and oppose developing the oil sands are particularly visible. As such, campaigns around the oil sands may be seen as influencing audience perceptions of the benefits and drawbacks of oil sands production. There is consequently a need to study the media materials of various tar sands stakeholders and explore how they differ. In this setting, it is essential to gather documents and identify content within images, which requires the use of an image retrieval technique such as a content-based image retrieval (CBIR) system. In a CBIR system, images are represented by low-level features (i.e. specific structures in the image such as points, edges, or objects), which are used to distinguish pictures from one another. The oil sands domain has to date not been mapped using CBIR systems. The research thus focuses on creating an image retrieval system, namely Mediatoil-IR, for exploring documents related to the oil sands. Our aim is to evaluate various low-level representations of the images within this context. To this end, our experimental framework employs LAB color histogram (LAB) and speeded up robust features (SURF) in order to typify the imagery. We further use machine learning techniques to improve the quality of retrieval (in terms of both accuracy and speed). To achieve this aim, the extracted features from each image are encoded in the form of vectors and used as a training set for learning classification models to organize pictures into different categories. Different algorithms were considered such as Linear SVM, Quadratic SVM, Weighted KNN, Decision Trees, Bagging, and Boosting on trees. It was shown that Quadratic SVM algorithm trained on SURF features is a good approach for building CBIR, and is used in building Mediatoil-IR. Finally, with the help of created CBIR, we were able to extract the similar documents and explore the different types of imagery used by different stakeholders. Our experimental evaluation shows that our Mediatoil-IR system is able to accurately explore the imagery used by different stakeholders. Machine Learning Image Processing CBIR Mediatoil
517	Supervised Machine Learning on a Network Scale: Application to Seismic Event Detection and Classification Reynen, Andrew January 2017 (has links) A new method using a machine learning technique is applied to event classification and detection at seismic networks. This method is applicable to a variety of network sizes and settings. The algorithm makes use of a small catalogue of known observations across the entire network. Two attributes, the polarization and frequency content, are used as input to regression. These attributes are extracted at predicted arrival times for P and S waves using only an approximate velocity model, as attributes are calculated over large time spans. This method of waveform characterization is shown to be able to distinguish between blasts and earthquakes with 99 percent accuracy using a network of 13 stations located in Southern California. The combination of machine learning with generalized waveform features is further applied to event detection in Oklahoma, United States. The event detection algorithm makes use of a pair of unique seismic phases to locate events, with a precision directly related to the sampling rate of the generalized waveform features. Over a week of data from 30 stations in Oklahoma, United States are used to automatically detect 25 times more events than the catalogue of the local geological survey, with a false detection rate of less than 2 per cent. This method provides a highly confident way of detecting and locating events. Furthermore, a large number of seismic events can be automatically detected with low false alarm, allowing for a larger automatic event catalogue with a high degree of trust. Machine Learning Seismology Data mining Earthquakes
518	Learning-based procedural content generation Roberts, Jonathan Ralph January 2014 (has links) Procedural Content Generation (PCG) has become one of the hottest topics in Computational Intelligence and Artificial Intelligence (AI) game research in the past few years. PCG is the process of automatically creating content for video games, rather than by hand, and can offer great benefits for video games companies by helping to bring costs down and quality up. By guiding the process with AI it can be enhanced further and even be made to personalize content for target players. Among the current research into PCG, search-based approaches overwhelmingly dominate. While search-based algorithms have been shown to have great promise and produce several success stories there are a number of open challenges remaining. In this thesis, we present the Learning-Based Procedural Content Generation (LBPCG) framework, which is an alternative, novel approach designed to address some of these challenges. The major difference between the LBPCG framework and contemporary approaches is that the LBPCG is designed to learn about the problem space, freeing itself from the necessity for hard-coded information by the game developers. In this thesis we apply the LBPCG to a concrete example, the classic first-person shooter Quake, and present results showing the potential of the framework in generating quality content. 006.3
519	Characterising fitness effects of gene copy number variation in yeast Norris, Matthew January 2014 (has links) Diploid organisms including yeast, most animals, and humans, typically carry two copies of each gene. Variation above or below two copies can however sometimes occur. When gene copy number reduction from two to one causes a disadvantage, that gene is considered haploinsufficient (HI). In the first part of my work, I identified associations between Saccharomyces cerevisiae gene properties and genome-scale HI phenotypes from earlier work. I compared HI profiles against 23 gene properties and found that genes with (i) greater numbers of protein interactions, (ii) greater numbers of genetic interactions, (iii) greater gene sequence conservation, and (iv) higher protein expression were significantly more likely to be HI. Additionally, HI showed negative relationships with (v) cell cycle regulation and (vi) promoter sequence conservation. I exploited the aforementioned associations using Linear Discriminant Analysis (LDA) to predict HI in existing data and guide experimental identification of 6 novel HI phenotypes, previously undetected in genome-scale screenings. I also found significant relationships between HI and two gene properties in Schizosaccharomyces pombe, relationships that hold despite the lack of conserved HI between S. cerevisiae and Sz. pombe orthologue gene pairs. These data suggest associations between HI and gene properties may be conserved in other organisms. The relationships and model presented here are a step towards understanding HI and its underlying mechanisms. Increases in copy number can occur through gene duplication. When duplication produces two functional gene copies, both experience relaxed selection and rapid mutation. This sometimes leads to interesting evolutionary events such as gain of novel function (neofunctionalisation). Previous work shows an ancient ancestor of S. cerevisiae underwent whole genome duplication (WGD) followed by massive redundant gene loss. Interestingly some duplicate pairs show retention of both copies, including the pair TUB1 and TUB3. Existing sequence data shows that TUB3 has experienced a very high rate of evolution post-WGD, suggesting neofunctionalisation. To characterise TUB3, I have carried out experiments measuring fitness effects of varying TUB1, TUB2 and TUB3 copy number across many environments. In ethanol media, some TUB1 and TUB3 null mutants interestingly show severe defects. Other data suggest stress response, ethanol tolerance, protein degradation and/or regulatory roles, which may involve the regulatory Snf1p protein kinase complex. 576
520	Detection of Long Term Vibration Deviations in GasTurbine Monitoring Data Hansson, Johan January 2020 (has links) Condition based monitoring is today essential for any machine manufacturer tobe able to detect and predict faults in their machine fleet. This reduces the maintenancecost and also reduces machine downtime. In this master’s thesis twoapproaches are evaluated to detect long term vibration deviations also called vibrationanomalies in Siemens gas turbines of type SGT-800. The first is a simplerule-based approach where a series of CUSUM test are applied to several signalsin order to check if the an vibration anomaly has occurred. The secondapproach uses three common machine learning anomaly detection algorithm todetects these vibration anomalies. The machine learning algorithms evaluatedare k-means clustering , Isolation Forest and One-class SVM. This master’s thesisconclude that these vibration anomalies can be detected with these ML modelsbut also with the rule-based model with different levels of success. A set of featureswas also obtained that was the most important for detection of vibrationanomalies. This thesis also presents which of these models are the best suitedanomaly detection and would be the most appropriate for Siemens to implement. Machine Learning Vibrations Engineering and Technology Teknik och teknologier

Search results