Global ETD Search

1	New LSH-based Algorithm for Approximate Nearest Neighbor Andoni, Alexandr, Indyk, Piotr 04 November 2005 (has links) We present an algorithm for c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time ofO(dn^{1/c^2+o(1)}) and space O(dn + n^{1+1/c^2+o(1)}). AI locality sensitive hashing nearest neighbor high dimensions
2	The perils of particle swarm optimization in high dimensional problem spaces Oldewage, Elre Talea January 2017 (has links) Particle swarm optimisation (PSO) is a stochastic, population-based optimisation algorithm. PSO has been applied successfully to a variety of domains. This thesis examines the behaviour of PSO when applied to high dimensional optimisation problems. Empirical experiments are used to illustrate the problems exhibited by the swarm, namely that the particles are prone to leaving the search space and never returning. This thesis does not intend to develop a new version of PSO speci cally for high dimensional problems. Instead, the thesis investigates why PSO fails in high dimensional search spaces. Four di erent types of approaches are examined. The rst is the application of velocity clamping to prevent the initial velocity explosion and to keep particles inside the search space. The second approach selects values for the acceleration coe cients and inertia weights so that particle movement is restrained or so that the swarm follows particular patterns of movement. The third introduces coupling between problem variables, thereby reducing the swarm's movement freedom and forcing the swarm to focus more on certain subspaces within the search space. The nal approach examines the importance of initialisation strategies in controlling the swarm's exploration to exploitation ratio. The thesis shows that the problems exhibited by PSO in high dimensions, particularly unwanted particle roaming, can not be fully mitigated by any of the techniques examined. The thesis provides deeper insight into the reasons for PSO's poor performance by means of extensive empirical tests and theoretical reasoning. / Dissertation (MSc)--University of Pretoria, 2017. / Computer Science / MSc / Unrestricted UCTD Particle swarm optimization (PSO) High dimensions Large scale optimisation
3	Discriminant Analysis and Support Vector Regression in High Dimensions: Sharp Performance Analysis and Optimal Designs Sifaou, Houssem 04 1900 (has links) Machine learning is emerging as a powerful tool to data science and is being applied in almost all subjects. In many applications, the number of features is com- parable to the number of samples, and both grow large. This setting is usually named the high-dimensional regime. In this regime, new challenges arise when it comes to the application of machine learning. In this work, we conduct a high-dimensional performance analysis of some popular classification and regression techniques. In a first part, discriminant analysis classifiers are considered. A major challenge towards the use of these classifiers in practice is that they depend on the inverse of covariance matrices that need to be estimated from training data. Several estimators for the inverse of the covariance matrices can be used. The most common ones are estimators based on the regularization approach. In this thesis, we propose new estimators that are shown to yield better performance. The main principle of our proposed approach is the design of an optimized inverse covariance matrix estimator based on the assumption that the covariance matrix is a low-rank perturbation of a scaled identity matrix. We show that not only the proposed classifiers are easier to implement but also, outperform the classical regularization-based discriminant analysis classifiers. In a second part, we carry out a high-dimensional statistical analysis of linear support vector regression. Under some plausible assumptions on the statistical dis- tribution of the data, we characterize the feasibility condition for the hard support vector regression and, when feasible, derive an asymptotic approximation for its risk. Similarly, we study the test risk for the soft support vector regression as a function of its parameters. The analysis is then extended to the case of kernel support vector regression under generalized linear models assumption. Based on our analysis, we illustrate that adding more samples may be harmful to the test performance of these regression algorithms, while it is always beneficial when the parameters are optimally selected. Our results pave the way to understand the effect of the underlying hyper- parameters and provide insights on how to optimally choose the kernel function. High dimensions Discriminant analysis Support vector regression Asymptotic analysis
4	Um filtro iterativo utilizando árvores de decisão / An Iterative Decision Tree Threshold Filter Picchi Netto, Oscar 24 September 2013 (has links) Usar algoritmos de Aprendizado de Máquina é um dos modos ecientes de extrair as informações de grandes bases biológicas. Sabendo-se que a quantidade de dados que são coletados cresce a cada dia, o uso de alguma técnica de seleção de atributos eficiente é, em alguns casos, essencial não só para otimizar o tempo do algoritmo de Aprendizado da Máquina a ser aplicado posteriormente como também para reduzir os dados, de forma que possa ser possível testá-los, por exemplo, em uma bancada de laboratório em algumas situações específicas. O objetivo deste estudo é propor uma abordagem utilizando árvores de decisão em um filtro iterativo, visando auxiliar na extração de informação de grande bases biológicas. Pois, com uma base de menor dimensionalidade, um especialista humano pode entender melhor ou ainda utilizar um algoritmo de Aprendizado de Máquina de forma mais eficaz. O filtro proposto pode utilizar qualquer classificador com um seletor de atributos embutido e qualquer métrica pode ser utilizada para determinar se o atributo deve ser escolhido. Foi fixado, neste estudo, o algoritmo utilizado como J48 e a área embaixo da curva ROC (AUC) como métrica. Em experimentos utilizando diversas bases de dados biomédicas, o filtro proposto foi analisado e sua capacidade de compressão e desempenho foram avaliados em cinco diferentes paradigmas de aprendizado de máquina, utilizando dois limiares diferentes para a métrica escolhida. O melhor limiar obteve uma capacidade de compressão de cerca de 50% dos dados em geral e 99.4% em bases de baixa densidade, geralmente grandes bases. Os valores AUC obtidos pelo filtro quando comparados com cinco algoritmos de paradigmas de aprendizado diferentes mostraram um desempenho melhor em quatro das cinco situações avaliadas. O filtro proposto foi depois analisado e comparado com outros seletores de atributos da literatura e o indutor sozinho. Quanto ao tempo gasto pelo filtro em relação aos outros ele se apresentou no mesmo patamar de 3 dos 4 seletores testados. Quando comparado em relação ao AUC o filtro proposto se mostrou robusto nos cinco indutores analisados, não apresentando nenhuma diferença significativa em nenhum dos cenários testados. Em relação aos indutores, o filtro apresentou um desempenho melhor, mesmo que não significante, em 4 dos 5 indutores. / Using Machine Learning algorithms is an eficient way to extract information from large biological databases. But, in some cases, the amount of data is huge that using an eficient featured subset selection is, in some cases, essencial not only to optimize the learning time but also to reduce the amount of data, allowing, for example, a test in a laboratory workbench. The objective of this study is to propose an approach using decision trees in a iterative filter. The filter helps information extraction from large biological databases, since in a database with few dimensions a human specialist can understand it better or can use Machine Learning algorithms in a more efective way. The proposed lter can use any classier with embed featured subset selection and can use any performance metric to determine which attribute must be chosen. In this study, we have fixed the algorithm used within the filter as J48 and AUC was used as metric for performance evaluation. In experiments using biomedical databases, the proposed filter was analyzed and its compression capacity and performance were tested. In five diferent Machine Learning paradigms, using two diferent thresholds for the chosen metric. The best threshold was capable of reducing around 50% of the data using all databases and 99.4% on the small density bases, usually high dimensional databases. AUC values for the filter when compared with the five algorithm got a better performance in four of five tested situations. The proposed filter then was tested against others featured subset selectors from the literature, and against the inducer alone. Analyzing time the proposed lter is in the same level as 3 of 4 of the tested selectors. When tested for AUC the proposed selector shows itself robust in the five inducers tested, not showing any signicant diference in all tested scenarios. Against the inducers alone our filter showed a better performance, even not signicant, in 4 of the 5 inducers. Alta Dimensionalidade Aprendizado de Máquina Attribute Selection High Dimensions Machine Learning Seleção de Atributos
5	Um filtro iterativo utilizando árvores de decisão / An Iterative Decision Tree Threshold Filter Oscar Picchi Netto 24 September 2013 (has links) Usar algoritmos de Aprendizado de Máquina é um dos modos ecientes de extrair as informações de grandes bases biológicas. Sabendo-se que a quantidade de dados que são coletados cresce a cada dia, o uso de alguma técnica de seleção de atributos eficiente é, em alguns casos, essencial não só para otimizar o tempo do algoritmo de Aprendizado da Máquina a ser aplicado posteriormente como também para reduzir os dados, de forma que possa ser possível testá-los, por exemplo, em uma bancada de laboratório em algumas situações específicas. O objetivo deste estudo é propor uma abordagem utilizando árvores de decisão em um filtro iterativo, visando auxiliar na extração de informação de grande bases biológicas. Pois, com uma base de menor dimensionalidade, um especialista humano pode entender melhor ou ainda utilizar um algoritmo de Aprendizado de Máquina de forma mais eficaz. O filtro proposto pode utilizar qualquer classificador com um seletor de atributos embutido e qualquer métrica pode ser utilizada para determinar se o atributo deve ser escolhido. Foi fixado, neste estudo, o algoritmo utilizado como J48 e a área embaixo da curva ROC (AUC) como métrica. Em experimentos utilizando diversas bases de dados biomédicas, o filtro proposto foi analisado e sua capacidade de compressão e desempenho foram avaliados em cinco diferentes paradigmas de aprendizado de máquina, utilizando dois limiares diferentes para a métrica escolhida. O melhor limiar obteve uma capacidade de compressão de cerca de 50% dos dados em geral e 99.4% em bases de baixa densidade, geralmente grandes bases. Os valores AUC obtidos pelo filtro quando comparados com cinco algoritmos de paradigmas de aprendizado diferentes mostraram um desempenho melhor em quatro das cinco situações avaliadas. O filtro proposto foi depois analisado e comparado com outros seletores de atributos da literatura e o indutor sozinho. Quanto ao tempo gasto pelo filtro em relação aos outros ele se apresentou no mesmo patamar de 3 dos 4 seletores testados. Quando comparado em relação ao AUC o filtro proposto se mostrou robusto nos cinco indutores analisados, não apresentando nenhuma diferença significativa em nenhum dos cenários testados. Em relação aos indutores, o filtro apresentou um desempenho melhor, mesmo que não significante, em 4 dos 5 indutores. / Using Machine Learning algorithms is an eficient way to extract information from large biological databases. But, in some cases, the amount of data is huge that using an eficient featured subset selection is, in some cases, essencial not only to optimize the learning time but also to reduce the amount of data, allowing, for example, a test in a laboratory workbench. The objective of this study is to propose an approach using decision trees in a iterative filter. The filter helps information extraction from large biological databases, since in a database with few dimensions a human specialist can understand it better or can use Machine Learning algorithms in a more efective way. The proposed lter can use any classier with embed featured subset selection and can use any performance metric to determine which attribute must be chosen. In this study, we have fixed the algorithm used within the filter as J48 and AUC was used as metric for performance evaluation. In experiments using biomedical databases, the proposed filter was analyzed and its compression capacity and performance were tested. In five diferent Machine Learning paradigms, using two diferent thresholds for the chosen metric. The best threshold was capable of reducing around 50% of the data using all databases and 99.4% on the small density bases, usually high dimensional databases. AUC values for the filter when compared with the five algorithm got a better performance in four of five tested situations. The proposed filter then was tested against others featured subset selectors from the literature, and against the inducer alone. Analyzing time the proposed lter is in the same level as 3 of 4 of the tested selectors. When tested for AUC the proposed selector shows itself robust in the five inducers tested, not showing any signicant diference in all tested scenarios. Against the inducers alone our filter showed a better performance, even not signicant, in 4 of the 5 inducers. Alta Dimensionalidade Aprendizado de Máquina Seleção de Atributos Attribute Selection High Dimensions Machine Learning
6	Clustering High-dimensional Noisy Categorical and Mixed Data Zhiyi Tian (10925280) 27 July 2021 (has links) Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr<sup>2</sup> log<sup>3</sup>M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)<sup>2</sup> >= r *delta/2 with 0< delta <1 is necessary for any algorithm to succeed with probability at least (1+delta)/2. In case, where m=n and r is fixed, for example, the sufficient condition matches with the necessary condition up to a polylog(n) factor, showing that our proposed algorithm is nearly optimal. We also show our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency, both theoretically and numerically. In addition, we propose a spectral algorithm with standardization to cluster mixed data. This algorithm is computationally efficient and its clustering accuracy has been evaluated numerically on both real world data and synthetic data. Statistics Clustering Categorical values Noisy and sparse data High dimensions Spectral algorithm Clustering accuracy
7	Pricing and Hedging of Financial Instruments using Forward–Backward Stochastic Differential Equations : Call Spread Options with Different Interest Rates for Borrowing and Lending Berta, Abigail Hailu January 2022 (has links) In this project, we are aiming to solve option pricing and hedging problems numerically via Backward Stochastic Differential Equations (BSDEs). We use Markovian BSDEs to formulate nonlinear pricing and hedging problems of both European and American option types. This method of formulation is crucial for pricing financial instruments since it enables consideration of market imperfections and computations in high dimensions. We conduct numerical experiments of the pricing and hedging problems, where there is a higher interest rate for borrowing than lending, using the least squares Monte Carlo and deep neural network methods. Moreover, based on the experiment results, we point out which method to chooseover the other depending on the the problem at hand. Markovian BSDEs Least square Monte Carlo method Deep BSDE method Pricing and hedging in high dimensions. Computational Mathematics Beräkningsmatematik
8	Développement de nouveaux plans d'expériences uniformes adaptés à la simulation numérique en grande dimension Santiago, Jenny 04 February 2013 (has links) Cette thèse propose une méthodologie pour des études en simulation numérique en grande dimension. Elle se décline en différentes étapes : construction de plan d'expériences approprié, analyse de sensibilité et modélisation par surface de réponse. Les plans d'expériences adaptés à la simulation numérique sont les "Space Filling Designs", qui visent à répartir uniformément les points dans l'espace des variables d'entrée. Nous proposons l'algorithme WSP pour construire ces plans, rapidement, avec de bons critères d'uniformité, même en grande dimension. Ces travaux proposent la construction d'un plan polyvalent, qui sera utilisé pour les différentes étapes de l'étude : de l'analyse de sensibilité aux surfaces de réponse. L'analyse de sensibilité sera réalisée avec une approche innovante sur les points de ce plan, pour détecter le sous-ensemble de variables d'entrée réellement influentes. Basée sur le principe de la méthode de Morris, cette approche permet de hiérarchiser les variables d'entrée selon leurs effets. Le plan initial est ensuite "replié" dans le sous-espace des variables d'entrée les plus influentes, ce qui nécessite au préalable une étude pour vérifier l'uniformité de la répartition des points dans l'espace réduit et ainsi détecter d'éventuels amas et/ou lacunes. Ainsi, après réparation, ce plan est utilisé pour l'étape ultime : étude de surfaces de réponse. Nous avons alors choisi d'utiliser l'approche des Support Vector Regression, indépendante de la dimension et rapide dans sa mise en place. Obtenant des résultats comparables à l'approche classique (Krigeage), cette technique semble prometteuse pour étudier des phénomènes complexes en grande dimension. / This thesis proposes a methodology of study in numeric simulation for high dimensions. There are several steps in this methodology : setting up an experimental design, performing sensitivity analysis, then using response surface for modelling. In numeric simulation, we use a Space Filling Design that scatters the points in the entire domain. The construction of an experimental design in high dimensions must be efficient, with good uniformity properties. Moreover, this construction must be fast. We propose using the WSP algorithm to construct such an experimental design. This design is then used in all steps of the methodology, making it a versatile design, from sensitivity analysis to modelling. A sensitivity analysis allows identifying the influent factors. Adapting the Morris method principle, this approach classifies the inputs into three groups according to their effects. Then, the experimental design is folded over in the subspace of the influent inputs. This action can modify the uniformity properties of the experimental design by creating possible gaps and clusters. So, it is necessary to repair it by removing clusters and filling gaps. We propose a step-by-step approach to offer suitable repairing for each experimental design. Then, the repaired design is used for the final step: modelling from the response surface. We consider a Support Vector Machines method because dimension does not affect the construction. Easy to construct and with good results, similar to the results obtained by Kriging, the Support Vector Regression method is an alternative method for the study of complex phenomena in high dimensions. Algorithme WSP Amas Analyse de sensibilité Grande dimension Lacunes Modélisation Réparation Repliage Séparateurs à Vaste Marge Simulation numérique Clusters Gaps High dimensions Modelling Numeric simulation Response surface Sensitivity analysis Support Vector Regression WSP algorithm
9	Understanding High-Dimensional Data Using Reeb Graphs Harvey, William John 14 August 2012 (has links) No description available. Bioinformatics Computer Science computer science computational geometry computational topology geometry topology high dimensions high dimensional data Reeb graph contour tree visualization visual analytics Morse theory protein folding molecular dynamics survivin
10	n-TARP: A Random Projection based Method for Supervised and Unsupervised Machine Learning in High-dimensions with Application to Educational Data Analysis Yellamraju Tarun (6630578) 11 June 2019 (has links) Analyzing the structure of a dataset is a challenging problem in high-dimensions as the volume of the space increases at an exponential rate and typically, data becomes sparse in this high-dimensional space. This poses a significant challenge to machine learning methods which rely on exploiting structures underlying data to make meaningful inferences. This dissertation proposes the <i>n</i>-TARP method as a building block for high-dimensional data analysis, in both supervised and unsupervised scenarios.<div><br></div><div>The basic element, <i>n</i>-TARP, consists of a random projection framework to transform high-dimensional data to one-dimensional data in a manner that yields point separations in the projected space. The point separation can be tuned to reflect classes in supervised scenarios and clusters in unsupervised scenarios. The <i>n</i>-TARP method finds linear separations in high-dimensional data. This basic unit can be used repeatedly to find a variety of structures. It can be arranged in a hierarchical structure like a tree, which increases the model complexity, flexibility and discriminating power. Feature space extensions combined with <i>n</i>-TARP can also be used to investigate non-linear separations in high-dimensional data.<br></div><div><br></div><div>The application of <i>n</i>-TARP to both supervised and unsupervised problems is investigated in this dissertation. In the supervised scenario, a sequence of <i>n</i>-TARP based classifiers with increasing complexity is considered. The point separations are measured by classification metrics like accuracy, Gini impurity or entropy. The performance of these classifiers on image classification tasks is studied. This study provides an interesting insight into the working of classification methods. The sequence of <i>n</i>-TARP classifiers yields benchmark curves that put in context the accuracy and complexity of other classification methods for a given dataset. The benchmark curves are parameterized by classification error and computational cost to define a benchmarking plane. This framework splits this plane into regions of "positive-gain" and "negative-gain" which provide context for the performance and effectiveness of other classification methods. The asymptotes of benchmark curves are shown to be optimal (i.e. at Bayes Error) in some cases (Theorem 2.5.2).<br></div><div><br></div><div>In the unsupervised scenario, the <i>n</i>-TARP method highlights the existence of many different clustering structures in a dataset. However, not all structures present are statistically meaningful. This issue is amplified when the dataset is small, as random events may yield sample sets that exhibit separations that are not present in the distribution of the data. Thus, statistical validation is an important step in data analysis, especially in high-dimensions. However, in order to statistically validate results, often an exponentially increasing number of data samples are required as the dimensions increase. The proposed <i>n</i>-TARP method circumvents this challenge by evaluating statistical significance in the one-dimensional space of data projections. The <i>n</i>-TARP framework also results in several different statistically valid instances of point separation into clusters, as opposed to a unique "best" separation, which leads to a distribution of clusters induced by the random projection process.<br></div><div><br></div><div>The distributions of clusters resulting from <i>n</i>-TARP are studied. This dissertation focuses on small sample high-dimensional problems. A large number of distinct clusters are found, which are statistically validated. The distribution of clusters is studied as the dimensionality of the problem evolves through the extension of the feature space using monomial terms of increasing degree in the original features, which corresponds to investigating non-linear point separations in the projection space.<br></div><div><br></div><div>A statistical framework is introduced to detect patterns of dependence between the clusters formed with the features (predictors) and a chosen outcome (response) in the data that is not used by the clustering method. This framework is designed to detect the existence of a relationship between the predictors and response. This framework can also serve as an alternative cluster validation tool.<br></div><div><br></div><div>The concepts and methods developed in this dissertation are applied to a real world data analysis problem in Engineering Education. Specifically, engineering students' Habits of Mind are analyzed. The data at hand is qualitative, in the form of text, equations and figures. To use the <i>n</i>-TARP based analysis method, the source data must be transformed into quantitative data (vectors). This is done by modeling it as a random process based on the theoretical framework defined by a rubric. Since the number of students is small, this problem falls into the small sample high-dimensions scenario. The <i>n</i>-TARP clustering method is used to find groups within this data in a statistically valid manner. The resulting clusters are analyzed in the context of education to determine what is represented by the identified clusters. The dependence of student performance indicators like the course grade on the clusters formed with <i>n</i>-TARP are studied in the pattern dependence framework, and the observed effect is statistically validated. The data obtained suggests the presence of a large variety of different patterns of Habits of Mind among students, many of which are associated with significant grade differences. In particular, the course grade is found to be dependent on at least two Habits of Mind: "computation and estimation" and "values and attitudes."<br></div> Statistics Education Applied Statistics Probability Theory Stochastic Analysis and Modelling Pattern Recognition and Data Mining Education Assessment and Evaluation High-dimensions n-TARP Clustering Methods Machine Learning data analysis study Educational Data Statistical test pattern analysis techniques Pattern Dependence Benchmarks

Search results