Global ETD Search

61	Management, visualisation & mining of quantitative proteomics data Ahmad, Yasmeen January 2012 (has links) Exponential data growth in life sciences demands cross discipline work that brings together computing and life sciences in a usable manner that can enhance knowledge and understanding in both fields. High throughput approaches, advances in instrumentation and overall complexity of mass spectrometry data have made it impossible for researchers to manually analyse data using existing market tools. By applying a user-centred approach to effectively capture domain knowledge and experience of biologists, this thesis has bridged the gap between computation and biology through software, PepTracker (http://www.peptracker.com). This software provides a framework for the systematic detection and analysis of proteins that can be correlated with biological properties to expand the functional annotation of the genome. The tools created in this study aim to place analysis capabilities back in the hands of biologists, who are expert in evaluating their data. Another major advantage of the PepTracker suite is the implementation of a data warehouse, which manages and collates highly annotated experimental data from numerous experiments carried out by many researchers. This repository captures the collective experience of a laboratory, which can be accessed via user-friendly interfaces. Rather than viewing datasets as isolated components, this thesis explores the potential that can be gained from collating datasets in a “super-experiment” ideology, leading to formation of broad ranging questions and promoting biology driven lines of questioning. This has been uniquely implemented by integrating tools and techniques from the field of Business Intelligence with Life Sciences and successfully shown to aid in the analysis of proteomic interaction experiments. Having conquered a means of documenting a static proteomics snapshot of cells, the proteomics field is progressing towards understanding the extremely complex nature of cell dynamics. PepTracker facilitates this by providing the means to gather and analyse many protein properties to generate new biological insight, as demonstrated by the identification of novel protein isoforms. 572
62	Automating an Engine to Extract Educational Priorities for Workforce City Innovation Hobbs, Madison 01 January 2019 (has links) This thesis is grounded in my work done through the Harvey Mudd College Clinic Program as Project Manager of the PilotCity Clinic Team. PilotCity is a startup whose mission is to transform small to mid-sized cities into centers of innovation by introducing employer partnerships and work-based learning to high school classrooms. The team was tasked with developing software and algorithms to automate PilotCity's programming and to extract educational insights from unstructured data sources like websites, syllabi, resumes, and more. The team helped engineer a web application to expand and facilitate PilotCity's usership, designed a recommender system to automate the process of matching employers to high school classrooms, and packaged a topic modeling module to extract educational priorities from more complex data such as syllabi, course handbooks, or other educational text data. Finally, the team explored automatically generating supplementary course resources using insights from topic models. This thesis will detail the team's process from beginning to final deliverables including the methods, implementation, results, challenges, future directions, and impact of the project. Education Work Based Learning Natural Language Processing Topic modeling Data science Curriculum and Instruction Education
63	Science des données au service des réseaux d'opérateur : proposition de cas d’utilisation, d’outils et de moyens de déploiement / Data science at the service of operator networks Samba, Alassane 29 October 2018 (has links) L'évolution des télécommunications amené aujourd'hui à un foisonnement des appareils connectés et une massification des services multimédias. Face à cette demande accrue de service, les opérateurs ont besoin d'adapter le fonctionnement de leurs réseaux, afin de continuer à garantir un certain niveau de qualité d'expérience à leurs utilisateurs. Pour ce faire, les réseaux d'opérateur tendent vers un fonctionnement plus cognitif voire autonomique. Il s'agit de doter les réseaux de moyens d'exploiter toutes les informations ou données à leur disposition, les aidant à prendre eux-mêmes les meilleures décisions sur leurs services et leur fonctionnement, voire s'autogérer. Il s'agit donc d'introduire de l'intelligence artificielle dans les réseaux. Cela nécessite la mise en place de moyens d'exploiter les données, d'effectuer surelles de l'apprentissage automatique de modèles généralisables, apportant l’information qui permet d'optimiser les décisions. L'ensemble de ces moyens constituent aujourd'hui une discipline scientifique appelée science des données. Cette thèse s'insère dans une volonté globale de montrer l'intérêt de l'introduction de la science des données dans différents processus d'exploitation des réseaux. Elle comporte deux contributions algorithmiques correspondant à des cas d'utilisation de la science des données pour les réseaux d'opérateur, et deux contributions logicielles, visant à faciliter, d'une part l'analyse, et d'autre part le déploiement des algorithmes issus de la science des données. Les résultats concluants de ces différents travaux ont démontré l'intérêt et la faisabilité de l'utilisation de la science des données pour l'exploitation des réseaux d'opérateur. Ces résultats ont aussi fait l'objet de plusieurs utilisations par des projets connexes. / The evolution of telecommunications has led today to a proliferation of connected devices and a massification of multimedia services. Faced with this increased demand for service, operators need to adapt the operation of their networks, in order to continue to guarantee a certain level of quality of experience to their users. To do this, operator networks tend towards a more cognitive or autonomic functioning. It is about giving the networks the means to exploit all the information or data at their disposal, helping them to make the best decisions about their services and operations,and even self-manage. It is therefore a questionof introducing artificial intelligence into networks. This requires setting up means to exploit the data, to carry out on them the automatic learning of generalizable models, providing information that can optimize decisions. All these means today constitute a scientific discipline called data science. This thesis fits into a global desire to show the interest of the introduction of data science in different network operating processes. It inlcudes two algorithmic contributions corresponding to use cases of data science for the operator networks, and two software contributions, aiming to facilitate,on the one hand, the analysis, and on the other hand the deployment of the algorithms produced through data science. The conclusive results of these various studies have demonstrated the interest and the feasibility of using data science for the exploitation of operator networks. These results have also been used by related projects. Science des données Réseaux Prédiction de débit Déploiement Data science Operator network Throughput prediction Deployment 004
64	Smart Classifiers and Bayesian Inference for Evaluating River Sensitivity to Natural and Human Disturbances: A Data Science Approach Underwood, Kristen 01 January 2018 (has links) Excessive rates of channel adjustment and riverine sediment export represent societal challenges; impacts include: degraded water quality and ecological integrity, erosion hazards to infrastructure, and compromised public safety. The nonlinear nature of sediment erosion and deposition within a watershed and the variable patterns in riverine sediment export over a defined timeframe of interest are governed by many interrelated factors, including geology, climate and hydrology, vegetation, and land use. Human disturbances to the landscape and river networks have further altered these patterns of water and sediment routing. An enhanced understanding of river sediment sources and dynamics is important for stakeholders, and will become more critical under a nonstationary climate, as sediment yields are expected to increase in regions of the world that will experience increased frequency, persistence, and intensity of storm events. Practical tools are needed to predict sediment erosion, transport and deposition and to characterize sediment sources within a reasonable measure of uncertainty. Water resource scientists and engineers use multidimensional data sets of varying types and quality to answer management-related questions, and the temporal and spatial resolution of these data are growing exponentially with the advent of automated samplers and in situ sensors (i.e., “big data”). Data-driven statistics and classifiers have great utility for representing system complexity and can often be more readily implemented in an adaptive management context than process-based models. Parametric statistics are often of limited efficacy when applied to data of varying quality, mixed types (continuous, ordinal, nominal), censored or sparse data, or when model residuals do not conform to Gaussian distributions. Data-driven machine-learning algorithms and Bayesian statistics have advantages over Frequentist approaches for data reduction and visualization; they allow for non-normal distribution of residuals and greater robustness to outliers. This research applied machine-learning classifiers and Bayesian statistical techniques to multidimensional data sets to characterize sediment source and flux at basin, catchment, and reach scales. These data-driven tools enabled better understanding of: (1) basin-scale spatial variability in concentration-discharge patterns of instream suspended sediment and nutrients; (2) catchment-scale sourcing of suspended sediments; and (3) reach-scale sediment process domains. The developed tools have broad management application and provide insights into landscape drivers of channel dynamics and riverine solute and sediment export. Bayesian inference data science machine learning river dynamics Environmental Engineering Geomorphology Statistics and Probability
65	Nonparametric Inference for High Dimensional Data Mukhopadhyay, Subhadeep 03 October 2013 (has links) Learning from data, especially ‘Big Data’, is becoming increasingly popular under names such as Data Mining, Data Science, Machine Learning, Statistical Learning and High Dimensional Data Analysis. In this dissertation we propose a new related field, which we call ‘United Nonparametric Data Science’ - applied statistics with “just in time” theory. It integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and it is applicable to teaching introductory statistics courses that are closer to modern frontiers of scientific research. Our framework includes small data analysis (combining traditional and modern nonparametric statistical inference), big and high dimensional data analysis (by statistical modeling methods that extend our unified framework for small data analysis). The first part of the dissertation (Chapters 2 and 3) has been oriented by the goal of developing a new theoretical foundation to unify many cultures of statistical science and statistical learning methods using mid-distribution function, custom made orthonormal score function, comparison density, copula density, LP moments and comoments. It is also examined how this elegant theory yields solution to many important applied problems. In the second part (Chapter 4) we extend the traditional empirical likelihood (EL), a versatile tool for nonparametric inference, in the high dimensional context. We introduce a modified version of the EL method that is computationally simpler and applicable to a large class of “large p small n” problems, allowing p to grow faster than n. This is an important step in generalizing the EL in high dimensions beyond the p ≤ n threshold where the standard EL and its existing variants fail. We also present detailed theoretical study of the proposed method. Big data Quantile Empirical Likelihood LP score function Copula Nonparametric Classification Data Science
66	Evolutionary conservation and diversification of complex synaptic function in human proteome Pajak, Maciej January 2018 (has links) The evolution of synapses from early proto-synaptic protein complexes in unicellular eukaryotes to sophisticated machines comprising thousands of proteins parallels the emergence of finely tuned synaptic plasticity, a molecular correlate for memory and learning. Phenotypic change in organisms is ultimately the result of evolution of their genotype at the molecular level. Selection pressure is a measure of how changes in genome sequence that arise though naturally occurring processes in populations are fixed or eliminated in subsequent generations. Inferring phylogenetic information about proteins such as the variation of selection pressure across coding sequences can provide valuable information not only about the origin of proteins, but also the contribution of specific sites within proteins to their current roles within an organism. Recent evolutionary studies of synaptic proteins have generated attractive hypotheses about the emergence of finely-tuned regulatory mechanisms in the post-synaptic proteome related to learning, however, these analyses are relatively superficial. In this thesis, I establish a scalable molecular phylogenetic modelling framework based on three new inference methodologies to investigate temporal and spatial aspects of selection pressure changes for the whole human proteome using protein orthologs from up to 68 taxa. Temporal modelling of evolutionary selection pressure reveals informative features and patterns for the entire human proteome and identifies groups of proteins that share distinct diversification timelines. Multi-ontology enrichment analysis of these gene cohorts was used to aid biological interpretation, but these approaches are statistically under powered and do not capture a clear picture of the emergence of synaptic plasticity. Subsequent pathway-centric analysis of key synaptic pathways extends the interpretation of temporal data and allows for revision of previous hypotheses about the evolution of complex synaptic function. I proceed to integrate inferred selection pressure timeline information in the context of static protein-protein interaction data. A network analysis of the full human proteome reveals systematic patterns linking the temporal profile of proteins’ evolution and their topological role in the interaction graph. These graphs were used to test a mechanistic hypothesis that proposed a propagating diversification signal between interactors using the temporal modelling data and network analysis tools. Finally, I analyse the data of amino-acid level spatial modelling of selection pressure events in Arc, one of the master regulators of synaptic plasticity, and its interactors for which detailed experimental data is available. I use the Arc interactome as an example to discuss episodic and localised diversifying selection pressure events in tightly coupled complexes of protein and showcase potential for a similar systematic analysis of larger complexes of proteins using a pathway-centric approach. Through my work I revised our understanding of temporal evolutionary patterns that shaped contemporary synaptic function through profiling of emergence and refinement of proteins in multiple pathways of the nervous system. I also uncovered systematic effects linking dependencies between proteins with their active diversification, and hypothesised about their extension to domain level selection pressure events.
67	Crossing the Chasm: Deploying Machine Learning Analytics in Dynamic Real-World Scenarios January 2016 (has links) abstract: The dawn of Internet of Things (IoT) has opened the opportunity for mainstream adoption of machine learning analytics. However, most research in machine learning has focused on discovery of new algorithms or fine-tuning the performance of existing algorithms. Little exists on the process of taking an algorithm from the lab-environment into the real-world, culminating in sustained value. Real-world applications are typically characterized by dynamic non-stationary systems with requirements around feasibility, stability and maintainability. Not much has been done to establish standards around the unique analytics demands of real-world scenarios. This research explores the problem of the why so few of the published algorithms enter production and furthermore, fewer end up generating sustained value. The dissertation proposes a ‘Design for Deployment’ (DFD) framework to successfully build machine learning analytics so they can be deployed to generate sustained value. The framework emphasizes and elaborates the often neglected but immensely important latter steps of an analytics process: ‘Evaluation’ and ‘Deployment’. A representative evaluation framework is proposed that incorporates the temporal-shifts and dynamism of real-world scenarios. Additionally, the recommended infrastructure allows analytics projects to pivot rapidly when a particular venture does not materialize. Deployment needs and apprehensions of the industry are identified and gaps addressed through a 4-step process for sustainable deployment. Lastly, the need for analytics as a functional area (like finance and IT) is identified to maximize the return on machine-learning deployment. The framework and process is demonstrated in semiconductor manufacturing – it is highly complex process involving hundreds of optical, electrical, chemical, mechanical, thermal, electrochemical and software processes which makes it a highly dynamic non-stationary system. Due to the 24/7 uptime requirements in manufacturing, high-reliability and fail-safe are a must. Moreover, the ever growing volumes mean that the system must be highly scalable. Lastly, due to the high cost of change, sustained value proposition is a must for any proposed changes. Hence the context is ideal to explore the issues involved. The enterprise use-cases are used to demonstrate the robustness of the framework in addressing challenges encountered in the end-to-end process of productizing machine learning analytics in dynamic read-world scenarios. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2016 Computer science Statistics Systems science Analytics CRISP-DM Data Science Industrial Machine Learning Semiconductor
68	An Exploration of Statistical Modelling Methods on Simulation Data Case Study: Biomechanical Predator–Prey Simulations January 2018 (has links) abstract: Modern, advanced statistical tools from data mining and machine learning have become commonplace in molecular biology in large part because of the “big data” demands of various kinds of “-omics” (e.g., genomics, transcriptomics, metabolomics, etc.). However, in other fields of biology where empirical data sets are conventionally smaller, more traditional statistical methods of inference are still very effective and widely used. Nevertheless, with the decrease in cost of high-performance computing, these fields are starting to employ simulation models to generate insights into questions that have been elusive in the laboratory and field. Although these computational models allow for exquisite control over large numbers of parameters, they also generate data at a qualitatively different scale than most experts in these fields are accustomed to. Thus, more sophisticated methods from big-data statistics have an opportunity to better facilitate the often-forgotten area of bioinformatics that might be called “in-silicomics”. As a case study, this thesis develops methods for the analysis of large amounts of data generated from a simulated ecosystem designed to understand how mammalian biomechanics interact with environmental complexity to modulate the outcomes of predator–prey interactions. These simulations investigate how other biomechanical parameters relating to the agility of animals in predator–prey pairs are better predictors of pursuit outcomes. Traditional modelling techniques such as forward, backward, and stepwise variable selection are initially used to study these data, but the number of parameters and potentially relevant interaction effects render these methods impractical. Consequently, new modelling techniques such as LASSO regularization are used and compared to the traditional techniques in terms of accuracy and computational complexity. Finally, the splitting rules and instances in the leaves of classification trees provide the basis for future simulation with an economical number of additional runs. In general, this thesis shows the increased utility of these sophisticated statistical techniques with simulated ecological data compared to the approaches traditionally used in these fields. These techniques combined with methods from industrial Design of Experiments will help ecologists extract novel insights from simulations that combine habitat complexity, population structure, and biomechanics. / Dissertation/Thesis / Masters Thesis Industrial Engineering 2018 Biostatistics Biomechanics Ecology Classification Trees Data Science LASSO Logistic Regression Simulation Variable Selection
69	Functional characterization of C/D snoRNA-derived microRNAs Lemus Diaz, Gustavo Nicolas 08 December 2017 (has links) No description available. 570 snoRNA miRNA NGS Analytical flow cytometry Genomic data science bioinformatics Dual reporter assays Biologie (PPN619462639)
70	Estimation of Expected Lowest Fare in Flight Meta Search Kristensson, Lars January 2014 (has links) This thesis explores the possibility of estimating the outcome of a flight ticket fare comparison search, also called flight meta search, before it has been performed, as being able to do thiscould be highly useful in improving the flight meta search technology used today. The algorithm explored in this thesis is a distance weighted k-nearest neighbour, where the distance metric is a linear equation with sixteen features of first degree extracted from the input of the search. It is found that while the approach may have potential, the distance metric used in this thesis isnot sufficient to capture the similarities needed, and the end algorithm performs only slightly better than random. At the end of this thesis a series of possible further improvements are presented, that could potentially help improve the performance of the algorithm to a level that would be more useful. Estimation Prediction Algorithm Data Science Flight Search Flight Search Fare Price Computer Sciences Datavetenskap (datalogi)

Search results