Spelling suggestions: "subject:"data sparsity"" "subject:"mata sparsity""
1 |
The Unreasonable Usefulness of Approximation by Linear CombinationLewis, Cannada Andrew 05 July 2018 (has links)
Through the exploitation of data-sparsity ---a catch all term for savings gained from a variety of approximations--- it is possible to reduce the computational cost of accurate electronic structure calculations to linear. Meaning, that the total time to solution for the calculation grows at the same rate as the number of particles that are correlated. Multiple techniques for exploiting data-sparsity are discussed, with a focus on those that can be systematically improved by tightening numerical parameters such that as the parameter approaches zero the approximation becomes exact. These techniques are first applied to Hartree-Fock theory and then we attempt to design a linear scaling massively parallel electron correlation strategy based on second order perturbation theory. / Ph. D. / The field of Quantum Chemistry is highly dependent on a vast hierarchy of approximations; all carefully balanced, so as to allow for fast calculation of electronic energies and properties to an accuracy suitable for quantitative predictions. Formally, computing these energies should have a cost that increases exponentially with the number of particles in the system, but the use of approximations based on local behavior, or nearness, of the particles reduces this scaling to low order polynomials while maintaining an acceptable amount of accuracy. In this work, we introduce several new approximations that throw away information in a specific fashion that takes advantage of the fact that the interactions between particles decays in magnitude with the distance between them (although sometimes very slowly) and also exploits the smoothness of those interactions, by factorizing their numerical representation into a linear combination of simpler items. These factorizations, while technical in nature, have benefits that are hard to obtain by merely ignoring interactions between distant particles. Through the development of new factorizations and a careful neglect of interactions between distant particles, we hope to be able to compute properties of molecules in such a way that accuracy is maintained, but that the cost of the calculations only grows at the same rate as the number of particles. It seems that very recently, circa 2015, that this goal may actually soon become a reality, potentially revolutionizing the ability of quantum chemistry to make quantitative predictions for properties of large molecules.
|
2 |
Assessing the reliability, resilience and sustainability of water resources systems in data-rich and data-sparse regionsHeadley, Miguel Learie January 2018 (has links)
Uncertainty associated with the potential impact of climate change on supply availability, varied success with demand-side interventions such as water efficiency and changes in priority relating to hydrometric data collection and ownership, have resulted in challenges for water resources system management particularly in data-sparse regions. Consequently, the aim of this thesis is to assess the reliability, resilience and sustainability of water resources systems in both data-rich and data-sparse regions with an emphasis on robust decision-making in data-sparse regions. To achieve this aim, new resilience indicators that capture water resources system failure duration and extent of failure (i.e. failure magnitude) from a social and environmental perspective were developed. These performance indicators enabled a comprehensive assessment of a number of performance enhancing interventions, which resulted in the identification of a set of intervention strategies that showed potential to improve reliability, resilience and sustainability in the case studies examined. Finally, a multi-criteria decision analysis supported trade-off decision making when the reliability, resilience and sustainability indicators were considered in combination. Two case studies were considered in this research: Kingston and St. Andrew in Jamaica and Anyplace in the UK. The Kingston and St. Andrew case study represents the main data-sparse case study where many assumptions were introduced to fill data gaps. The intervention strategy that showed great potential to improve reliability, resilience and sustainability identified from Kingston and St. Andrew water resources assessment was the ‘Site A-east’ desalination scheme. To ameliorate uncertainty and lack of confidence associated with results, a methodology was developed that transformed a key proportion of the Anyplace water resources system from a data-rich environment to a data-sparse environment. The Anyplace water resources system was then assessed in a data-sparse environment and the performance trade-offs of the intervention strategies were analysed using four multi-criteria decision analysis (MCDA) weighting combinations. The MCDA facilitated a robust comparison of the interventions’ performances in the data-rich and data-sparse case studies. Comparisons showed consistency in the performances of the interventions across data-rich and data-sparse hydrological conditions and serve to demonstrate to decision makers a novel approach to addressing uncertainty when many assumptions have been introduced in the water resources management process due to data sparsity.
|
3 |
Towards Personalized Recommendation Systems: Domain-Driven Machine Learning Techniques and FrameworksAlabdulrahman, Rabaa 16 September 2020 (has links)
Recommendation systems have been widely utilized in e-commerce settings to aid users through their shopping experiences. The principal advantage of these systems is their ability to narrow down the purchase options in addition to marketing items to customers. However, a number of challenges remain, notably those related to obtaining a clearer understanding of users, their profiles, and their preferences in terms of purchased items. Specifically, recommender systems based on collaborative filtering recommend items that have been rated by other users with preferences similar to those of the targeted users. Intuitively, the more information and ratings collected about the user, the more accurate are the recommendations such systems suggest.
In a typical recommender systems database, the data are sparse. Sparsity occurs when the number of ratings obtained by the users is much lower than the number required to build a prediction model. This usually occurs because of the users’ reluctance to share their reviews, either due to privacy issues or an unwillingness to make the extra effort. Grey-sheep users pose another challenge. These are users who shared their reviews and ratings yet disagree with the majority in the systems. The current state-of-the-art typically treats these users as outliers and removes them from the system. Our goal is to determine whether keeping these users in the system may benefit learning. Thirdly, cold-start problems refer to the scenario whereby a new item or user enters the system and is another area of active research. In this case, the system will have no information about the new user or item, making it problematic to find a correlation with others in the system. This thesis addresses the three above-mentioned research challenges through the development of machine learning methods for use within the recommendation system setting.
First, we focus on the label and data sparsity though the development of the Hybrid Cluster analysis and Classification learning (HCC-Learn) framework, combining supervised and unsupervised learning methods. We show that combining classification algorithms such as k-nearest neighbors and ensembles based on feature subspaces with cluster analysis algorithms such as expectation maximization, hierarchical clustering, canopy, k-means, and cascade k-means methods, generally produces high-quality results when applied to benchmark datasets. That is, cluster analysis clearly benefits the learning process, leading to high predictive accuracies for existing users.
Second, to address the cold-start problem, we present the Popular Users Personalized Predictions (PUPP-DA) framework. This framework combines cluster analysis and active learning, or so-called user-in-the-loop, to assign new customers to the most appropriate groups in our framework. Based on our findings from the HCC-Learn framework, we employ the expectation maximization soft clustering technique to create our user segmentations in the PUPP-DA framework, and we further incorporate Convolutional Neural Networks into our design. Our results show the benefits of user segmentation based on soft clustering and the use of active learning to improve predictions for new users. Furthermore, our findings show that focusing on frequent or popular users clearly improves classification accuracy. In addition, we demonstrate that deep learning outperforms machine learning techniques, notably resulting in more accurate predictions for individual users.
Thirdly, we address the grey-sheep problem in our Grey-sheep One-class Recommendations (GSOR) framework. The existence of grey-sheep users in the system results in a class imbalance whereby the majority of users will belong to one class and a small portion (grey-sheep users) will fall into the minority class. In this framework, we use one-class classification to provide a class structure for the training examples. As a pre-assessment stage, we assess the characteristics of grey-sheep users and study their impact on model accuracy. Next, as mentioned above, we utilize one-class learning, whereby we focus on the majority class to first learn the decision boundary in order to generate prediction lists for the grey-sheep (minority class). Our results indicate that including grey-sheep users in the training step, as opposed to treating them as outliers and removing them prior to learning, has a positive impact on the general predictive accuracy.
|
4 |
Three essays on econometrics / 計量経済学に関する三つの論文Yi, Kun 23 March 2023 (has links)
京都大学 / 新制・課程博士 / 博士(経済学) / 甲第24375号 / 経博第662号 / 新制||経||302(附属図書館) / 京都大学大学院経済学研究科経済学専攻 / (主査)教授 西山 慶彦, 教授 江上 雅彦, 講師 柳 貴英 / 学位規則第4条第1項該当 / Doctor of Economics / Kyoto University / DFAM
|
5 |
Classification in high dimensional feature spaces / by H.O. van DykVan Dyk, Hendrik Oostewald January 2009 (has links)
In this dissertation we developed theoretical models to analyse Gaussian and multinomial distributions. The analysis is focused on classification in high dimensional feature spaces and provides a basis for dealing with issues such as data sparsity and feature selection (for Gaussian and multinomial distributions, two frequently used models for high dimensional applications). A Naïve Bayesian philosophy is followed to deal with issues associated with the curse of dimensionality. The core treatment on Gaussian and multinomial models consists of finding analytical expressions for classification error performances. Exact analytical expressions were found for calculating error rates of binary class systems with Gaussian features of arbitrary dimensionality and using any type of quadratic decision boundary (except for degenerate paraboloidal boundaries).
Similarly, computationally inexpensive (and approximate) analytical error rate expressions were derived for classifiers with multinomial models. Additional issues with regards to the curse of dimensionality that are specific to multinomial models (feature sparsity) were dealt with and tested on a text-based language identification problem for all eleven official languages of South Africa. / Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009.
|
6 |
Classification in high dimensional feature spaces / by H.O. van DykVan Dyk, Hendrik Oostewald January 2009 (has links)
In this dissertation we developed theoretical models to analyse Gaussian and multinomial distributions. The analysis is focused on classification in high dimensional feature spaces and provides a basis for dealing with issues such as data sparsity and feature selection (for Gaussian and multinomial distributions, two frequently used models for high dimensional applications). A Naïve Bayesian philosophy is followed to deal with issues associated with the curse of dimensionality. The core treatment on Gaussian and multinomial models consists of finding analytical expressions for classification error performances. Exact analytical expressions were found for calculating error rates of binary class systems with Gaussian features of arbitrary dimensionality and using any type of quadratic decision boundary (except for degenerate paraboloidal boundaries).
Similarly, computationally inexpensive (and approximate) analytical error rate expressions were derived for classifiers with multinomial models. Additional issues with regards to the curse of dimensionality that are specific to multinomial models (feature sparsity) were dealt with and tested on a text-based language identification problem for all eleven official languages of South Africa. / Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009.
|
7 |
Mitigation of Data Scarcity Issues for Semantic Classification in a Virtual Patient Dialogue AgentStiff, Adam January 2020 (has links)
No description available.
|
8 |
Enhancing Simulated Sonar Images With CycleGAN for Deep Learning in Autonomous Underwater Vehicles / Djupinlärning, maskininlärning, sonar, simulering, GAN, cycleGAN, YOLO-v4, gles data, osäkerhetsanalysNorén, Aron January 2021 (has links)
This thesis addresses the issues of data sparsity in the sonar domain. A data pipeline is set up to generate and enhance sonar data. The possibilities and limitations of using cycleGAN as a tool to enhance simulated sonar images for the purpose of training neural networks for detection and classification is studied. A neural network is trained on the enhanced simulated sonar images and tested on real sonar images to evaluate the quality of these images.The novelty of this work lies in extending previous methods to a more general framework and showing that GAN enhanced simulations work for complex tasks on field data.Using real sonar images to enhance the simulated images, resulted in improved classification compared to a classifier trained on solely simulated images. / Denna rapport ämnar undersöka problemet med gles data för djupinlärning i sonardomänen. Ett dataflöde för att generera och höja kvalitén hos simulerad sonardata sätts upp i syfte att skapa en stor uppsättning data för att träna ett neuralt nätverk. Möjligheterna och begränsningarna med att använda cycleGAN för att höja kvalitén hos simulerad sonardata studeras och diskuteras. Ett neuralt nätverk för att upptäcka och klassificera objekt i sonarbilder tränas i syfte att evaluera den förbättrade simulerade sonardatan.Denna rapport bygger vidare på tidigare metoder genom att generalisera dessa och visa att metoden har potential även för komplexa uppgifter baserad på icke trivial data.Genom att träna ett nätverk för klassificering och detektion på simulerade sonarbilder som använder cycleGAN för att höja kvalitén, ökade klassificeringsresultaten markant jämfört med att träna på enbart simulerade bilder.
|
Page generated in 0.0398 seconds