1 |
A Risk-Oriented Clustering Approach for Asset Categorization and Risk MeasurementLiu, Lu 18 July 2019 (has links)
When faced with market risk for investments and portfolios, people often calculate the
risk measure, which is a real number mapping to each random payoff. There are many ways to quantify the potential risk, among which the most important input is the features from future performance. Future distributions are unknown and thus always estimated from historical Profit and Loss (P&L) distributions. However, past data may not be appropriate for estimating the future; risk measures generated from single historical distributions can be subject to error. To overcome these shortcomings, one natural way implemented is to identify and categorize similar assets whose Profit and Loss distributions can be used as alternative scenarios.
In practice, one of the most common and intuitive categorizations is sector, based on industry. It is widely agreed that companies in the same sector share the same, or related, business types and operating characteristics. But in the field of risk management, sector-based categorization does not necessarily mean assets are grouped in terms of their risk profiles, and we show that risk measures in the same sector tend to have large variation. Although improved risk measures related to the distribution ambiguity has been discussed at length, we seek to develop a more risk-oriented categorization by providing a new clustering approach. Furthermore, our method can better inform us of the potential risk and the extreme worst-case scenario within the same category.
|
2 |
Scalable Combinatorial Algorithms for Optimal Transport Based Similarity MetricsZhang, Kaiyi 22 November 2024 (has links)
Optimal Transport (OT), also known as Wasserstein distance, is a valuable metric for comparing probability distributions. Owing to its appealing statistical properties, researchers in various fields, such as machine learning, use OT within applications. However, computing both exact and approximate OT is computationally expensive and impractical for large datasets. Furthermore, OT is sensitive to small noise in the input distributions. In this document, we propose to use combinatorial methods to design scalable and noise-resistant solutions for OT.
We present four key contributions in this work. First, we introduce a novel combinatorial parallel algorithm for approximating OT, which achieves a parallel time complexity of $O(log n/varepsilon^2)$, where $n$ is the input size and $varepsilon$ is the addtitive error, Our algorithm outperforms the state-of-the-art in experiments. Second, we propose a new concept, OT-profile, representing the function of minimum partial optimal transport cost $sigma_alpha$ versus the transported mass $alpha$. This can be used to identify outliers in real-world data. The utility of OT-profile is demonstrated in outlier detection and PU-learning jobs and outperforms the state-of-the-art. Third, building upon the OT-profile, we propose a new OT-based metric for comparing distributions that is more robust to noise. This metric preserves desirable properties while reducing its sensitivity to noise for high $p$ values, providing a robust solution for real-world datasets. Lastly, we have developed a Python library that integrates our algorithms and methods into a user-friendly framework, making it easier for practitioners to adopt our methods. Our work enhances the computational efficiency and robustness of OT, making it practical for machine learning applicaitons. / Doctor of Philosophy / Imagine a task in which you are required to fill up a set of holes with piles of sand using the least amount of effort. The idea behind this is Optimal Transport (OT), also known as the Wasserstein distance. This distance is a popular math tool that measures the similarity (or distance) between two probability distributions. This tool has many applications in different areas, including machine learning and data analysis. For example, it is employed in generative models to measure the difference between generated and real images, or in analyzing real-word datasets containing noise.
However, the practice of OT faces two major problems. First, the computation becomes expensive with larger datasets, making it infeasible for large-scale applications. Therefore, it is important to make the computation of OT more efficient in processing a large volume of data. Second, OT is sensitive to outliers or noise in the input distributions, which can significantly influence the results.
In this thesis, we try to address these challenges through four main contributions: First, we present novel parallel combinatorial algorithms that reduce the OT computational burden. Our approach achieves a state-of-the-art time complexity and, at the same time, is easy to implement. Second, we introduce an innovative method for computing an 'OT-profile', which enables the to identify the outliers from input distributions. This approach improves the robustness of OT in real-world scenarios where data always contains noise and perturbation.
Third, building on the OT-profile concept, we propose a new robust OT-based metric for comparing distributions in the presence of noise. To facilitate the impact of these advancements, we have open-sourced a Python library that implements our methods in a user-friendly interface, making it easier for researchers and practitioners to apply our solutions to their data analysis tasks. This work not only advances the theoretical analysis of OT but also contributes to future practical applications. By addressing the two challenges in scalability and robustness, we made the application of OT more robust and easier with large-scale data.
|
3 |
Robust Exact Algorithms for the Euclidean Bipartite Matching ProblemGattani, Akshaykumar Gopalkrishna 06 July 2023 (has links)
The minimum cost bipartite matching problem is a well-studied optimization problem in computer science and operations research, with wide-ranging applications in fields such as machine learning, economics, transportation, logistics and biology. A special instance of this problem is the computation of the p-Wasserstein distance which we define next. Given a complete bipartite graph with two disjoint sets of n points in d-dimensional Euclidean space and an integer p ≥ 1, let the cost of an edge be the p-th power of the Euclidean distance between its endpoints. The objective of this problem is to find a minimum-cost matching in this complete bipartite graph. The Hungarian algorithm is a classical method that solves this problem in O(n^3) time. There are many algorithms that have a run time better than that of the Hungarian algorithm if the graphs have non-negative integer edge costs bounded by C. Since the input points have real-valued coordinates and the Euclidean distances can be irrational, such algorithms only return an approximate matching. Thus, the Hungarian algorithm remains the fastest known algorithm to compute an exact matching. In this thesis, we implement a new algorithm in the divide and conquer framework that computes the exact p-Wasserstein distance and has a run time asymptotically better than the Hungarian algorithm for stochastic point sets. Inspired by the techniques used in the algorithm, we also design an alternate version of the Hungarian algorithm that uses a grid- based approach. Our experimental analysis shows that both of our algorithms significantly outperform the classical Hungarian algorithm. / Master of Science / Suppose we have two sets of equal number of items and a list of compatible pairs of items, where a pair is considered compatible if its items belong to different sets. A perfect matching is a subset of compatible pairs where each item is paired with exactly one other item. When trying to find a perfect matching, there may be multiple options, and minimizing the cost of the perfect matching is often desired. This is referred to as the minimum cost bipartite matching problem, which is extensively studied due to its importance in algorithmic theory and operations research. A special instance of this problem is the calculation of the p- Wasserstein distance. It has many practical applications in fields such as machine learning, economics, transportation, logistics and biology. The Hungarian algorithm is the only known algorithm that can compute the exact p-Wasserstein distance. Therefore, our focus is to develop exact algorithms for this problem that perform better than the Hungarian algorithm and can scale efficiently to large datasets.
|
4 |
On the calibration of Lévy driven time series with coupling distances : an application in paleoclimateGairing, Jan, Högele, Michael, Kosenkova, Tetiana, Kulik, Alexei January 2014 (has links)
This article aims at the statistical assessment of time series with large
fluctuations in short time, which are assumed to stem from a continuous process perturbed by a Lévy process exhibiting a heavy tail behavior. We propose an easily implementable procedure to estimate efficiently the statistical difference between the noisy behavior of the data and a given reference jump measure in terms of so-called coupling distances. After a
short introduction to Lévy processes and coupling distances we recall basic statistical approximation results and derive rates of convergence. In the sequel the procedure is elaborated in detail in an abstract setting and eventually applied in a case study to simulated and paleoclimate data. It indicates the dominant presence of a non-stable heavy-tailed jump Lévy component for some tail index greater than 2.
|
5 |
Wasserstein Distance on Finite Spaces: Statistical Inference and AlgorithmsSommerfeld, Max 18 October 2017 (has links)
No description available.
|
6 |
Transportation Techniques for Geometric ClusteringJanuary 2020 (has links)
abstract: This thesis introduces new techniques for clustering distributional data according to their geometric similarities. This work builds upon the optimal transportation (OT) problem that seeks global minimum cost for matching distributional data and leverages the connection between OT and power diagrams to solve different clustering problems. The OT formulation is based on the variational principle to differentiate hard cluster assignments, which was missing in the literature. This thesis shows multiple techniques to regularize and generalize OT to cope with various tasks including clustering, aligning, and interpolating distributional data. It also discusses the connections of the new formulation to other OT and clustering formulations to better understand their gaps and the means to close them. Finally, this thesis demonstrates the advantages of the proposed OT techniques in solving machine learning problems and their downstream applications in computer graphics, computer vision, and image processing. / Dissertation/Thesis / Doctoral Dissertation Computer Engineering 2020
|
7 |
Empirical Optimal Transport on Discrete Spaces: Limit Theorems, Distributional Bounds and ApplicationsTameling, Carla 11 December 2018 (has links)
No description available.
|
8 |
Distances within and between Metric Spaces: Metric Geometry, Optimal Transport and Applications to Data AnalysisWan, Zhengchao January 2021 (has links)
No description available.
|
9 |
High dimensional Markov chain Monte Carlo methods : theory, methods and applications / Méthodes de Monte Carlo par chaîne de Markov en grandes dimensions : théorie, méthodes et applicationsDurmus, Alain 02 December 2016 (has links)
L'objet de cette thèse est l'analyse fine de méthodes de Monte Carlopar chaînes de Markov (MCMC) et la proposition de méthodologies nouvelles pour échantillonner une mesure de probabilité en grande dimension. Nos travaux s'articulent autour de trois grands sujets.Le premier thème que nous abordons est la convergence de chaînes de Markov en distance de Wasserstein. Nous établissons des bornes explicites de convergence géométrique et sous-géométrique. Nous appliquons ensuite ces résultats à l'étude d'algorithmes MCMC. Nous nous intéressons à une variante de l'algorithme de Metropolis-Langevin ajusté (MALA) pour lequel nous donnons des bornes explicites de convergence. Le deuxième algorithme MCMC que nous analysons est l'algorithme de Crank-Nicolson pré-conditionné, pour lequel nous montrerons une convergence sous-géométrique.Le second objet de cette thèse est l'étude de l'algorithme de Langevin unajusté (ULA). Nous nous intéressons tout d'abord à des bornes explicites en variation totale suivant différentes hypothèses sur le potentiel associé à la distribution cible. Notre étude traite le cas où le pas de discrétisation est maintenu constant mais aussi du cas d'une suite de pas tendant vers 0. Nous prêtons dans cette étude une attention toute particulière à la dépendance de l'algorithme en la dimension de l'espace d'état. Dans le cas où la densité est fortement convexe, nous établissons des bornes de convergence en distance de Wasserstein. Ces bornes nous permettent ensuite de déduire des bornes de convergence en variation totale qui sont plus précises que celles reportées précédemment sous des conditions plus faibles sur le potentiel. Le dernier sujet de cette thèse est l'étude des algorithmes de type Metropolis-Hastings par échelonnage optimal. Tout d'abord, nous étendons le résultat pionnier sur l'échelonnage optimal de l'algorithme de Metropolis à marche aléatoire aux densités cibles dérivables en moyenne Lp pour p ≥ 2. Ensuite, nous proposons de nouveaux algorithmes de type Metropolis-Hastings qui présentent un échelonnage optimal plus avantageux que celui de l'algorithme MALA. Enfin, nous analysons la stabilité et la convergence en variation totale de ces nouveaux algorithmes. / The subject of this thesis is the analysis of Markov Chain Monte Carlo (MCMC) methods and the development of new methodologies to sample from a high dimensional distribution. Our work is divided into three main topics. The first problem addressed in this manuscript is the convergence of Markov chains in Wasserstein distance. Geometric and sub-geometric convergence with explicit constants, are derived under appropriate conditions. These results are then applied to thestudy of MCMC algorithms. The first analyzed algorithm is an alternative scheme to the Metropolis Adjusted Langevin algorithm for which explicit geometric convergence bounds are established. The second method is the pre-Conditioned Crank-Nicolson algorithm. It is shown that under mild assumption, the Markov chain associated with thisalgorithm is sub-geometrically ergodic in an appropriated Wasserstein distance. The second topic of this thesis is the study of the Unadjusted Langevin algorithm (ULA). We are first interested in explicit convergence bounds in total variation under different kinds of assumption on the potential associated with the target distribution. In particular, we pay attention to the dependence of the algorithm on the dimension of the state space. The case of fixed step sizes as well as the case of nonincreasing sequences of step sizes are dealt with. When the target density is strongly log-concave, explicit bounds in Wasserstein distance are established. These results are then used to derived new bounds in the total variation distance which improve the one previously derived under weaker conditions on the target density.The last part tackles new optimal scaling results for Metropolis-Hastings type algorithms. First, we extend the pioneer result on the optimal scaling of the random walk Metropolis algorithm to target densities which are differentiable in Lp mean for p ≥ 2. Then, we derive new Metropolis-Hastings type algorithms which have a better optimal scaling compared the MALA algorithm. Finally, the stability and the convergence in total variation of these new algorithms are studied.
|
10 |
Semi-Supervised Semantic Segmentation for Agricultural Aerial ImagesChen-yi Lu (15383813) 01 May 2023 (has links)
<p>Unmanned Aerial Systems (UAS) have been an essential tool for field scouting, nutrient applications, and farm management. However, assessing the aerial images captured by UAS is labor-intensive, and human assessment can be misleading, introducing bias. Deep learning based image segmentation has been proposed to assist in segmenting different areas of interest in the field, but it usually requires significant pixel-level annotated data. To address this, we propose a semi-supervised learning algorithm, AgSemSeg, to train a robust image segmentation</p>
<p>model with less annotated data. Semi-supervised semantic segmentation aims to predict accurate pixel-level segmentation results via incorporating unlabeled images. Existing methods rely on computing the consistency loss on the output predictions between pseudo-labels and unlabeled images. In AgSemSeg, we exploit the intermediate feature representations rather than only using the output predictions to improve the overall performance of the</p>
<p>model. Specifically, we add a projection layer on the output of the backbone encoder, and inject consistency loss between intermediate feature representations with Sliced-Wasserstein distance. We evaluate AgSemSeg using Agriculture-Vision dataset and outperform the supervised baseline by up to 9.71%. We also evaluate AgSemSeg on benchmark datasets such as PASCAL VOC 2012 and Cityscapes datasets, and it outperforms supervised baselines by up to 24.6% and 7.5% mIoU, respectively. We also perform extensive ablation studies to show that our proposed components are key to the performance improvements of our method. </p>
|
Page generated in 0.0802 seconds