Global ETD Search

561	Efficient Algorithms for Causal Linear Identification and Sequential Imitation Learning Daniel R Kumor (12476310) 28 April 2022 (has links) <p>Finding cause and effect relationships is one of the quintessential questions throughout many of the empirical sciences, AI, and Machine Learning. This dissertation develops graphical conditions and efficient algorithms for two problems, linear identification and imitation learning. For the first problem, it is well-known that correlation does not imply causation, so linear regression doesn’t necessarily find causal relations even in the limit of a large sample size. Over the past century, a plethora of methods has been developed for identifying interventional distributions given a combination of assumptions about the underlying mechanisms (e.g., linear functional dependence, causal diagram) and observational data. We characterize the computational complexity of several existing graphical criteria and develop new polynomial-time algorithms that subsume existing disparate efficient approaches. The proposed methods constitute the current state of the art in terms of polynomial-time identification coverage. In words, our methods have the capability of identifying the maximal set of structural coefficients when compared to any other efficient algorithms found in the literature.</p> <p>The second problem studied in the dissertation is Causal Sequential Imitation Learning, which is concerned with an agent that aims to learn a policy by observing an expert acting in the environment, and mimicking this expert's observed behavior. Sometimes, the agent (imitator) does not have access to the same set of observations or sensors as the expert, which gives rise to challenges in correctly interpreting expert actions. We develop necessary and sufficient conditions for the imitator to obtain identical performance to the expert in sequential settings given the domain’s causal diagram, and create a polynomial-time algorithm for finding the covariates to include when generating an imitating policy.</p> <p><br></p> Imitation Learning Causality
562	Predicting events in metastable systems near criticality Huang, Shan 24 February 2022 (has links) Predicting events in metastable systems is an important but challenging problem. It can help society forecast, prevent, or prepare for upcoming catastrophes. However, many metastable systems in nature operate near a critical point and are empirically unpredictable. We developed machine learning predictors, applied them to the prediction of nucleation events in the metastable Ising model, near and far from the spinodal critical point. We observed decreasing predictability as the critical point is approached, and found that this unpredictability is due to the vanishing density difference between the nucleating droplet and the background. We also developed a tensor representation of Lennard-Jones con gurations using the symmetry order parameters of the particles and use this representation to predict nucleation in a dense Lennard Jones liquid. Finally, we investigated the noise-induced critical point in two variations of the OFC model - a coupled OFC model and a OFC model with multiplicative noise. In both variations, we found a critical phase boundary that separates the ergodic and non-ergodic phase and the termination point of the phase boundary, which is consistent with a higher-order phase transition. Physics Criticality Machine learning Metastable Nucleation
563	A Genetic Algorithm Model for Financial Asset Diversification Onek, Tristan 01 April 2019 (has links) Machine learning models can produce balanced financial portfolios through a variety of methods. Genetic algorithms are one such method that can optimally combine different funds that may occupy a portfolio. This study introduces a genetic algorithm model that finds optimal combinations of funds for a portfolio through a new approach to fitness formula calculation. Each fund in a given population has a base fitness score consisting of the sum of several technical analysis indicators. Each indicator chosen measures a different performance aspect of a fund, allowing for a balanced fitness score. Additionally, each fund has multiple category variables that determine diversity when combined into a portfolio. The base fitness score for each portfolio is the sum of its funds' individual fitness scores. Portfolio fitness scores adjust based on the included funds' category variable diversity. Portfolios that consist of funds with largely similar categories receive lower adjusted fitness scores and do not cross over. This process encourages strong and diversified portfolios to reproduce. This model creates diverse portfolios that outperform market benchmarks and demonstrates future potential as a diversification-aware investment strategy. computational finance genetic algorithm machine learning Computing
564	Evaluation of Representations for Atomistic Machine Learning Yu, Hao 25 November 2021 (has links) Machine learning algorithms for atomistic systems have the potential to circumvent expensive quantum mechanical calculations and enable computations for large systems which are conventionally infeasible. In this way, the complexity of solving the many-body Schr odinger equation is reduced by mapping to statistical models. The appropriate data representation is crucial in increasing the accuracy, e ciency and reliability of the model. In this thesis, we conduct an in-depth evaluation of handcrafted and neural network learned representations for molecules, inorganic crystals and adsorbate-surface systems. In addition to evaluating the atomistic machine learning models by the mean absolute error, we employ the energy within threshold metric. We see signi cant di erences between representations from the evaluation of molecules. We propose ways to improve the performance of atomistic machine learning. Machine Learning Materials Informatics Computational Materials
565	ML-Miner: A Machine Learning Tool Used for Identification of Novel Biosynthetic Gene Clusters Wambo, Paul A. 04 April 2022 (has links) Identifying biosynthetic gene clusters from genomic data is challenging, with many in-silico tools suffering from a high rediscovery rate due to their dependence on rule-based algorithms. Next generation sequencing has provided an abundance of genomic information, and it has been hypothesized that there are many undiscovered biosynthetic gene clusters within this dataset. Here, we aim to develop a machine learning tool, ML-Miner, that infers patterns that describe a biosynthetic gene cluster in an unbiased manner and, as such, enables the identification of new biosynthetic gene clusters from genomic data. To solve this challenging problem, we define a simpler one to predict the class of a known BGC. Specifically, ML-Miner receives as input the concatenation of sequences that are known or believed to be part of a biosynthetic gene cluster. Its task is to identify which class it belongs, i.e. NPRS, PKS terpene and RiPPs. ML-Miner is a machine learning tool that uses Natural Language Processing, dimensionality reduction, and supervised learning to identify novel biosynthetic gene clusters. BioVec is a biological word embedding that we use to transform protein sequences from the highly curated MIBiG database of characterized biosynthetic gene clusters into their respective continuous distributed vector representations. Because the resulting protein vectors are of high dimensionality, a supervised Uniform Manifold and Approximation algorithm was employed to transform the high dimensional vectors into a robust lower-dimensional representation, as evaluated by Silhouette analysis, Hopkins’ statistic, and trustworthiness analysis. The density-Based Spatial Clustering of Applications and Noise algorithm showed that the clusters identified from the low dimensional datasets mapped to biosynthetic gene cluster types, defined with high accuracy in the MIBiG database. A random forest classifier was then trained and evaluated using the low dimensional vectors. It was shown to classify each biosynthetic gene cluster from the MIBiG database with excellent performance metrics. Finally, the model's ability to generalize was evaluated using biosynthetic gene clusters from the antiSMASH dataset, an uncurated database containing uncharacterized biosynthetic gene clusters. The performance metrics were high, with a balanced accuracy of ~85%. After a hyperparameter search, the balanced accuracy rose to ~90%. This suggests that ML-Miner is a robust machine learning pipeline that can be used to identify novel biosynthetic gene clusters. Future development of a confidence score for classification and a workflow for processing bacterial genomes into gene clusters will significantly improve the utility of this tool. Machine Learning BGC Biosynthetic Gene Clusters
566	Medical Diagnostics with Surface Enhanced Raman Scattering Hunter, Robert 13 May 2022 (has links) Raman spectroscopy is a powerful molecular fingerprinting method which measures the vibrational modes of molecules to identify and quantify chemical species. In biomedical spectroscopy, where samples are usually complex mixtures of many molecules, Raman spectra give a biochemical “portrait” that can be used to discriminate between distinct samples. One major technical challenge in implementing Raman spectrometer sensors is the technique’s low intrinsic signal to noise ratio. To amplify the Raman signal, a number of different approaches can be applied. In this thesis two techniques are used; surface enhanced Raman scattering (SERS) from metal nanoparticles along with light-matter interaction enhancement from co-coupling light and sample to a liquid core waveguide. In order to process the complex spectral data arising from these sensors, a robust signal processing method is required. To this end, we have developed and validated a machine learning spectral analysis platform based on genetically optimized support vector machines (GA-SVM). This work is the subject of Chapter 3. We found that the GA-SVM significantly outperformed the standard statistical based modelling approach, partial least squared, in regression tasks for several different biomedical Raman applications. Furthermore, we found that the use of more complex kernel functions in the SVM yielded superior results. The genetic optimization algorithm was necessary to use these more complex kernel functions because its computation time scales linearly with complexity, whereas the standard brute force approach scales exponentially. Chapter 4 concerns the development of a Raman sensor used to quantify and identify pathogenic bacteria. This device centres on a microfluidic flow cell which forces bacteria to flow through a hollow-core photonic crystal fiber (HC-PCF) to which the Raman excitation laser is also coupled. The bacteria are also mixed with silver nanoparticles to simultaneously achieve SERS and light-matter interaction enhancement in the sensor. Overall, the fiber and nanoparticles yield a bulk enhancement of 400x for the Raman spectrum. Bacteria are quantified in this system by counting the number of “spectral events” that occur as cells flow through the HC-PCF in a 15-minute window. This approach achieved very high linearity, as well as an average detection limit of 3.7 CFU/mL. In addition, bacteria are identified by using the same GA-SVM algorithm developed in the preceding chapter. These machine learning models achieved a discrimination accuracy of ~92% when comparing the spectra of the bacteria S. aureus, P. aeruginosa, and E. coli. In mixed samples of bacteria, the error of quantification increased significantly to 13.3 CFU/mL, but the output of the sensor was highly correlated with the ground-truth bacterial load. In Chapter 5 we outline the development of a diagnostic scheme for chemoresistance in ovarian cancer based on SERS measurements from cysteine-capped gold nanoparticles. Resistance to chemotherapy was determined based on three factors: the concentration of tumor derived exosomes, the chemical composition of the exosomes, and the concentration of exosome-derived cisplatin. Cisplatin is the drug of interest for this problem, as it is the most basic chemotherapy agent. The system works by first incubating the gold nanoparticles with tumor derived exosomes. The cisplatin therein causes the particles to destabilize slightly, resulting in the aggregation rate of the nanoparticles being proportional to the drug concentration. At steady state aggregation, the magnitude of the Raman spectrum is proportional to the exosome concentration, and the spectrum contains its chemical identity. Using in vitro cancer cell lines, we found that resistant cells tend to produce more exosomes and excrete a higher concentration of cisplatin within them. Overall, this sensor exhibited good diagnostic power for chemoresistance particularly in the most common subtype in ovarian cancer. Spectroscopy Medical diagnostics Machine Learning Nanoparticles
567	New Algorithms and Analysis Techniques for Reinforcement Learning Jia, Randy January 2020 (has links) In the advent of Big Data and Machine Learning, there is a demand for improved decision making in unknown, complex environments. Decision making under uncertainty is a common principle underlying many important decisions made by individuals, businesses, and society as a whole. These problems are typically modeled as multi-armed bandits (MAB), or, more generally, reinforcement learning (RL). In the MAB problem, an agent is faced with many options or arms, each with its own unknown reward distribution, and must determine the sequence of arms to pull, taking into account history of rewards of past pulls. The agent must balance exploration (pulling less-explored arms to learn the model), with exploitation (pulling the current reward maximizing arm so far). RL is a generalized, more complex extension of the MAB problem in which the current state, in addition to the arm or action, impacts the obtained reward. In this thesis, we focus on designing new algorithms to better address RL problems. In particular, we design an algorithm inspired by Thompson sampling for finite communicating MDPs and an algorithm inspired by stochastic convex optimization for some fundamental problems in operations management, including a problem in inventory control. We develop intuitive algorithms and prove theoretical bounds on their regret; in doing so, we derive some theoretically interesting analytical results that may be of independent interest. Operations research Machine learning Decision making
568	A Machine Learning Approach To Crime Investigation In The New York City Land Area Di Giovanni, Yani January 2020 (has links) This dissertation will speci cally discuss how machine learning,through some of its algorithms, is able to investigate the various kindsof crime committed in the New York City land area, with special focuson the root-cause, allegedly paving the way for the violation of certainareas of the law. After covering some general background informationconcerning the history of this eld while discussing a few examplestaken from previous work, as well as the history of crime within theinterested geographical area, focus will be placed in rst of all ndingways to retrieve all the necessary numerical information dating backseveral years, since some of them might not be explicitly available,and after ful lling this task, the selected machine learning algorithmswill be implemented to have an insight about the relationship betweenthe chosen variables. We then conclude with the direction in whichfuture research should be heading. Machine learning Crime Engineering and Technology Teknik och teknologier
569	Ontology Matching by Combining Instance-Based Concept Similarity Measures with Structure Todorov, Konstantin 12 April 2011 (has links) Ontologies describe the semantics of data and provide a uniform framework of understanding between different parties. The main common reference to an ontology definition describes them as knowledge bodies, which bring a formal representation of a shared conceptualization of a domain - the objects, concepts and other entities that are assumed to exist in a certain area of interest together with the relationships holding among them. However, in open and evolving systems with decentralized nature (as, for example, the Semantic Web), it is unlikely for different parties to adopt the same ontology. The problem of ontology matching evolves from the need to align ontologies, which cover the same or similar domains of knowledge. The task is to reducing ontology heterogeneity, which can occur in different forms, not in isolation from one another. Syntactically heterogeneous ontologies are expressed in different formal languages. Terminological heterogeneity stands for variations in names when referring to the same entities and concepts. Conceptual heterogeneity refers to differences in coverage, granularity or scope when modeling the same domain of interest. Finally, prgamatic heterogeneity is about mismatches in how entities are interpreted by people in a given context. The work presented in this thesis is a contribution to the problem of reducing the terminological and conceptual heterogeneity of hierarchical ontologies (defined as ontologies, which contain a hierarchical body), populated with text documents. We make use of both intensional (structural) and extensional (instance-based) aspects of the input ontologies and combine them in order to establish correspondences between their elements. In addition, the proposed procedures yield assertions on the granularity and the extensional richness of one ontology compared to another, which is helpful at assisting a process of ontology merging. Although we put an emphasis on the application of instance-based techniques, we show that combining them with intensional approaches leads to more efficient (both conceptually and computationally) similarity judgments. The thesis is oriented towards both researchers and practitioners in the domain of ontology matching and knowledge sharing. The proposed solutions can be applied successfully to the problem of matching web-directories and facilitating the exchange of knowledge on the web-scale. Ontology Matching Machine Learning ddc:000
570	Finding Combination of Features from Promoter Regions for Ovarian Cancer-related Gene Group Classification Olayan, Rawan S. 12 1900 (has links) In classification problems, it is always important to use the suitable combination of features that will be employed by classifiers. Generating the right combination of features usually results in good classifiers. In the situation when the problem is not well understood, data items are usually described by many features in the hope that some of these may be the relevant or most relevant ones. In this study, we focus on one such problem related to genes implicated in ovarian cancer (OC). We try to recognize two important OC-related gene groups: oncogenes, which support the development and progression of OC, and oncosuppressors, which oppose such tendencies. For this, we use the properties of promoters of these genes. We identified potential “regulatory features” that characterize OC-related oncogenes and oncosuppressors promoters. In our study, we used 211 oncogenes and 39 oncosuppressors. For these, we identified 538 characteristic sequence motifs from their promoters. Promoters are annotated by these motifs and derived feature vectors used to develop classification models. We made a comparison of a number of classification models in their ability to distinguish oncogenes from oncosuppressors. Based on 10-fold cross-validation, the resultant model was able to separate the two classes with sensitivity of 96% and specificity of 100% with the complete set of features. Moreover, we developed another recognition model where we attempted to distinguish oncogenes and oncosuppressors as one group from other OC-related genes. That model achieved accuracy of 82%. We believe that the results of this study will help in discovering other OC-related oncogenes and oncosuppressors not identified as yet. machine learning oncogenes oncosuppressors ovarian cancer

Search results