Global ETD Search

1	Essays in Econometrics and Finance: Lan, Xiaoying January 2022 (has links) Thesis advisor: Shakeeb S.K. Khan / Thesis advisor: Zhijie Z.X. Xiao / Binary choice models can be easily estimated (using, e.g. maximum likelihood estimation) when the distribution of the latent error is known, as in Logit or Probit. In contrast, most estimators with unknown error distribution (e.g., maximum score, maximum rank correlation, or Klein-Spady) are computationally difficult or numerically unstable, making estimation impractical with more than a few regressors. The first chapter proposes an estimator that is convex at each iteration, and so is numerically well behaved even with many regressors and large sample sizes. The proposed estimator, which is root-n consistent and asymptotically normal, is based on batch gradient descent, while using a sieve to estimate the unknown error distribution function. Simulations show that the estimator has lower mean bias and root mean squared error than Klein-Spady estimator. It also requires less time to compute. The second chapter discusses the same estimator in high dimensional setting. The estimator is consistent with rate lower than root-n when the number of regressors grows slower than the number of observations and asymptotic normal when the square of the number of regressors grows slower than the number of observations. Both theory and simulation show that higher learning rate is needed with higher number of regressors. The third chapter provides an application of the proposed estimator to bankruptcy prediction. With more than 20 regressors, the proposed estimator performs better than logistic regression in terms of Area Under the Receiver Operating Characteristics using firm data one year or two years prior to bankruptcy, but worse than logistic regression using firm data three years prior to bankruptcy. / Thesis (PhD) — Boston College, 2022. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Economics. bankruptcy econometrics high dimension
2	Variable Screening Methods in Multi-Category Problems for Ultra-High Dimensional Data Zeng, Yue, Zeng, Yue January 2017 (has links) Variable screening techniques are fast and crude techniques to scan high-dimensional data and conduct dimension reduction before a refined variable selection method is applied. Its marginal analysis feature makes the method computationally feasible for ultra-high dimensional problems. However, most existing screening methods for classification problems are designed only for binary classification problems. There is lack of a comprehensive study on variable screening for multi-class classification problems. This research aims to fill the gap by developing variable screening for multi-class problems, to meet the need of high dimensional classification. The work has useful applications in cancer study, medicine, engineering and biology. In this research, we propose and investigate new and effective screening methods for multi-class classification problems. We consider two types of screening methods. The first one conducts screening for multiple binary classification problems separately and then aggregates the selected variables. The second one conducts screening for multi-class classification problems directly. In particular, for each method we investigate important issues such as choices of classification algorithms, variable ranking, and model size determination. We implement various selection criteria and compare their performance. We conduct extensive simulation studies to evaluate and compare the proposed screening methods with existing ones, which show that the new methods are promising. Furthermore, we apply the proposed methods to four cancer studies. R code has been developed for each method. Classification High Dimension Variable Screening
3	Model-based clustering of high-dimensional binary data Tang, Yang 05 September 2013 (has links) We present a mixture of latent trait models with common slope parameters (MCLT) for high dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a d-dimensional Gaussian latent variable, is extended by implementing common factor analyzers. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. The Bayesian information criterion is used to select the number of components and the covariance structure as well as the dimensions of latent variables. Our approach is demonstrated on U.S. Congressional voting data and on a data set describing the sensory properties of orange juice. Our examples show that our model performs well even when the number of observations is not very large relative to the data dimensionality. In both cases, our approach yields intuitive clustering results. Additionally, our dimensionality-reduction method allows data to be displayed in low-dimensional plots. / Early Researcher Award from the Government of Ontario (McNicholas); NSERC Discovery Grants (Browne and McNicholas).
4	ESTIMATING R&D INTERACTION STRUCTURES AND SPILLOVER EFFECTS Tsyawo, Emmanuel Selorm January 2020 (has links) Firms’ research and development (R&D) efforts are known to generate spillover effects on other firms’ outcomes, e.g., innovation and productivity. Policy recommendations that ignore spillover effects may not be optimal from a social perspective whence the importance of accounting for spillover effects. Quantifying R&D spillover effects typically requires a spatial matrix that characterises the structure of interaction between firms. In practice, the spatial matrix is often unknown due to factors that include multiplicity of forms of connectivity and unclear guidance from economic theory. Estimates can be biased if the spatial matrix is misspecified, and they can also be sensitive to the choice of spatial matrix. This dissertation develops robust techniques that estimate the spatial matrix alongside other parameters from data using a two-pronged approach: (1) model elements of the spatial matrix using spatial covariates (e.g., geographic and product market proximity) and a parameter vector of finite length and (2) estimate the spatial matrix as a set of parameters from panel data. Approaches (1) and (2) address two identification challenges - uncertainty over relevant forms of connectivity and high-dimensionality of the design matrix - in single-index models. In this three-chapter dissertation, the first approach is applied in the first and third chapters, while the second approach is applied in the third chapter. Chapter 1 proposes a parsimonious approach to estimating the spatial matrix and parameters from panel data when the spatial matrix is partly or fully unknown. By controlling for several forms of connectivity between firms, the approach is made robust to misspecification of the spatial matrix. Also, the flexibility of the approach allows data to determine the degrees of sparsity and asymmetry of the spatial matrix. The chapter establishes consistency and asymptotic normality of the MLE under conditional independence and conditional strong-mixing assumptions on the outcome variable. The empirical results confirm positive spillover and private effects of R&D on firm innovation. There is evidence of time-variation and asymmetry in the interaction structure between firms. Geographic proximity and product market proximity are confirmed as relevant forms of connectivity between firms. Moreover, connectivity between firms is not limited to often-assumed notions of proximity; it is also linked to past R&D and patenting behaviour of firms. Single-index models suffer non-identification due to rank deficiency when the design matrix is high-dimensional. Chapter 2 proposes an estimator that projects a high-dimensional parameter vector into a reduced consistently estimable one. This estimator generalises the assumption of sparsity which is required for shrinkage methods such as the Lasso, and it applies even if the high-dimensional parameter vector’s support is bounded away from zero. Monte Carlo simulations demonstrate high approximating ability, improved precision, and reduced bias of the estimator. The estimator is used to estimate the network structure between firms in order to quantify the spillover effects of R&D on productivity using panel data. The empirical results show that firms on average generate positive R&D spillovers on firm productivity. The spatial autoregressive (SAR) model has wide applicability in economics and social networks. It is used to estimate, for example, equilibrium and peer effects models. The SAR model, like other spatial econometric models, is not immune to challenges associated with misspecification or uncertainty over the spatial matrix. Chapter 3 applies the approach developed in Chapter 1 to estimate the spatial matrix in the SAR model with autoregressive disturbances in a parsimonious yet flexible way using GMM. The asymptotic properties of the GMM estimator are established, and Monte Carlo simulations show good small sample performance. / Economics Economics High-dimension R&d Spillovers Spatial Matrix
5	Ecological monitoring of semi-natural grasslands : statistical analysis of dense satellite image time series with high spatial resolution Lopes, Maïlys 24 November 2017 (has links) (PDF) Grasslands are a significant source of biodiversity in farmed landscapes that is important to monitor. New generation satellites such as Sentinel-2 offer new opportunities for grassland’s monitoring thanks to their combined high spatial and temporal resolutions. Conversely, the new type of data provided by these sensors involves big data and high dimensional issues because of the increasing number of pixels to process and the large number of spectro-temporal variables. This thesis explores the potential of the new generation satellites to monitor biodiversity and factors that influence biodiversity in semi-natural grasslands. Tools suitable for the statistical analysis of grasslands using dense satellite image time series (SITS) with high spatial resolution are provided. First, we show that the spectro-temporal response of grasslands is characterized by its variability within and among the grasslands. Then, for the statistical analysis, grasslands are modeled at the object level to be consistent with ecological models that represent grasslands at the field scale. We propose to model the distribution of pixels in a grassland by a Gaussian distribution. Following this modeling, similarity measures between two Gaussian distributions robust to the high dimension are developed for the lassification of grasslands using dense SITS: the High-Dimensional Kullback-Leibler Divergence and the -Gaussian Mean Kernel. The latter outperforms conventional methods used with Support Vector Machines for the classification of grasslands according to their management practices and to their age. Finally, indicators of grassland biodiversity issued from dense SITS are proposed through spectro-temporal heterogeneity measures derived from the unsupervised clustering of grasslands. Their correlation with the Shannon index is significant but low. The results suggest that the spectro-temporal variations measured from SITS at a spatial resolution of 10 meters covering the period when the practices occur are more related to the intensity of management practices than to the species diversity. Therefore, although the spatial and spectral properties of Sentinel-2 seem limited to assess the species diversity in grasslands directly, this satellite should make possible the continuous monitoring of factors influencing biodiversity in grasslands. In this thesis, we provided methods that account for the heterogeneity within grasslands and enable the use of all the spectral and temporal information provided by new generation satellites. Remote sensing Satellite image time series High dimension Grassland Landscape ecology Biodiversity
6	IMBALANCED HIGH DIMENSIONAL CLASSIFICATION AND APPLICATIONS IN PRECISION MEDICINE Hui Sun (6630500) 14 May 2019 (has links) <div>Classification is an important supervised learning technique with numerous applications. This dissertation addresses two research problems in this area. The first is multicategory classification methods for high dimensional data. To handle high dimension low sample size (HDLSS) data with uneven group sizes (i.e., imbalanced data), we develop a new classification method called angle-based multicategory distance-weighted support vector machine (MDWSVM). It is motivated from its binary counterpart and has the merits of both the support vector machine (SVM) and distance-weighted discrimination (DWD) methods while alleviating both the data piling issue of SVM and the imbalanced data issue of DWD. Theoretical results and numerical studies are used to demonstrate the advantages of our MDWSVM method over existing methods.</div><div><br></div><div>The second part of the dissertation is on the application of classification methods to precision medicine problems. Because one-stage precision medicine problems can be reformulated as weighted classification problems, the subtle differences between classification methods may lead to different application performances under this setting. Among the margin-based classification methods, we propose to use the distance weighted discrimination outcome weighted learning (DWD-OWL) method. We also extend the model to handle negative rewards for better generality and apply the angle-based idea to handle multiple treatments. The proofs of Fisher consistency for DWD-OWL in both the binary and multicategory cases are provided. Under mild conditions, the insensitivity of DWD-OWL for imbalanced setting is also demonstrated.</div> Statistics SVM DWD DWSVM multicategory classification precision medicine ITR OWL high dimension HDLSS weighted classification MDWSVM
7	Um sistema eficiente de detecção da ocorrência de eventos em sinais multimídia. / An efficient system for detecting events in multimidia signals. Oliveira, Celso de 01 July 2008 (has links) Nos últimos anos tem ocorrido uma necessidade crescente de métodos que possam lidar com conteúdo multimídia em larga escala, e com busca de tais informações de maneira eficiente e efetiva. Os objetos de interesse são representados por vetores descritivos (e. g. cor, textura, geometria, timbre) extraídos do conteúdo, associados a pontos de um espaço multidimensional. Um processo de busca visa, então, encontrar dados similares a uma dada amostra, tipicamente medindo distância entre pontos. Trata-se de um problema comum a uma ampla variedade de aplicações incluindo som, imagens, vídeo, bibliotecas digitais, imagens médicas, segurança, entre outras. Os maiores desafios dizem respeito às dificuldades inerentes aos espaços de alta dimensão, conhecidas por curse of dimensionality, que restringem significativamente a aplicação dos métodos comuns de busca. A literatura recente contém uma variedade de métodos de redução de dimensão que são altamente dependentes do tipo de dado considerado. Constata-se também certa carência de métodos gerais de análise que possam prever com precisão o desempenho dos algoritmos propostos. O presente trabalho contém uma análise geral dos princípios aplicáveis aos sistemas de busca em espaços de alta dimensão. Tal análise permite estabelecer de maneira precisa o compromisso existente entre robustez, refletida principalmente na imunidade a ruído, a taxa de erros de reconhecimento e a dimensão do espaço de observação. Além disto, mostra-se que é possível conceber um método geral de mapeamento, para fins de reconhecimento, que independe de especificidades do conteúdo. Para melhorar a eficiência de busca, um novo método de busca em espaços de alta dimensão é introduzido e analisado. Por fim, descreve-se sumariamente uma realização prática, desenvolvida segundo os princípios discutidos e que atende eficientemente aplicações comerciais de monitoramento de exibição de conteúdo em rádio e TV. / In the last few years there has been an increasing need for methods to deal with large scale multimedia content, and to search such information efficiently and effectively. The objects of interest are represented by feature vectors (e. g. color, texture, geometry, timbre) extracted from the content, associated to points in a multidimensional space. A search process aims, therefore, to find similar data to a given sample, typically measuring distance between points. It is a common problem to a wide range of applications that include sound, image, video, digital library, medical imagery, security, amongst others. The major challenges refer to the difficulties, inherent to the high dimension spaces, known as curse of dimensionality that limit significantly the application of the most common search methods. The recent literature contains a number of dimension reduction methods that are highly dependent on the type of data considered. Besides, there has been a certain lack of general analysis methods that can predict accurately the performance of the proposed algorithms. The present work contains a general analysis of the principles applicable to high dimension space search systems. Such analysis allows establishing in a precise manner the existing tradeoff amongst the system robustness, reflected mainly in the noise immunity, the error rate and the dimension of the observation space. Furthermore, it is shown that it is possible to conceive a mapping method, for recognition purpose, that can be independent of the content specificities. To improve the search efficiency, a new high dimension space search method is introduced and analyzed. Finally, a practical realization is briefly described, which has been developed in accordance with the principles discussed, and that addresses efficiently commercial applications relative to radio and TV content broadcasting monitoring. Audio High dimension space Media Media recognition Multimedia Multimídia Reconhecimento de imagem Reconhecimento de padrões Video
8	Krigeage pour la conception de turbomachines : grande dimension et optimisation multi-objectif robuste / Kriging for turbomachineries conception : high dimension and multi-objective robust optimization Ribaud, Mélina 17 October 2018 (has links) Dans le secteur de l'automobile, les turbomachines sont des machines tournantes participant au refroidissement des moteurs des voitures. Leur performance dépend de multiples paramètres géométriques qui déterminent leur forme. Cette thèse s'inscrit dans le projet ANR PEPITO réunissant industriels et académiques autour de l'optimisation de ces turbomachines. L'objectif du projet est de trouver la forme du ventilateur maximisant le rendement en certains points de fonctionnement. Dans ce but, les industriels ont développé des codes CFD (computational fluid dynamics) simulant le fonctionnement de la machine. Ces codes sont très coûteux en temps de calcul. Il est donc impossible d'utiliser directement le résultat de ces simulations pour conduire une optimisation.Par ailleurs, lors de la construction des turbomachines, on observe des perturbations sur les paramètres d'entrée. Elles sont le reflet de fluctuations des machines de production. Les écarts observés sur la forme géométrique finale de la turbomachine peuvent provoquer une perte de performance conséquente. Il est donc nécessaire de prendre en compte ces perturbations et de procéder à une optimisation robuste à ces fluctuations. Dans ce travail de thèse, nous proposons des méthodes basées sur du krigeage répondant aux deux principales problématiques liées à ce contexte de simulations coûteuses :• Comment construire une bonne surface de réponse pour le rendement lorsqu'il y a beaucoup de paramètres géométriques ?• Comment procéder à une optimisation du rendement efficace tout en prenant en compte les perturbations des entrées ?Nous répondons à la première problématique en proposant plusieurs algorithmes permettant de construire un noyau de covariance pour le krigeage adapté à la grande dimension. Ce noyau est un produit tensoriel de noyaux isotropes où chacun de ces noyaux est lié à un sous groupe de variables d'entrée. Ces algorithmes sont testés sur des cas simulés et sur une fonction réelle. Les résultats montrent que l'utilisation de ce noyau permet d'améliorer la qualité de prédiction en grande dimension. Concernant la seconde problématique, nous proposons plusieurs stratégies itératives basées sur un co-krigeage avec dérivées pour conduire l'optimisation robuste. A chaque itération, un front de Pareto est obtenu par la minimisation de deux objectifs calculés à partir des prédictions de la fonction coûteuse. Le premier objectif représente la fonction elle-même et le second la robustesse. Cette robustesse est quantifiée par un critère estimant une variance locale et basée sur le développement de Taylor. Ces stratégies sont comparées sur deux cas tests en petite et plus grande dimension. Les résultats montrent que les meilleures stratégies permettent bien de trouver l'ensemble des solutions robustes. Enfin, les méthodes proposées sont appliquées sur les cas industriels propres au projet PEPITO. / The turbomachineries are rotary machines used to cool down the automotive engines. Their efficiency is impacted by a high number of geometric parameters that describe the shape.My thesis is fully funded by the ANR project PEPITO where industrials and academics collaborate. The aim of this project is to found the turbomachineries shape that maximizes the efficiency.That is why, industrials have developed numerical CFD (Computational fluid dynamics) codes that simulate the work of turbomachineries. However, the simulations are time-consuming. We cannot directly use the simulations provided to perform the optimization.In addition, during the production line, the input variables are subjected to perturbations. These perturbations are due to the production machineries fluctuations. The differences observed in the final shape of the turbomachinery can provoke a loss of efficiency. These perturbations have to be taken into account to conduct an optimization robust to the fluctuations. In this thesis, since the context is time consuming simulations we propose kriging based methods that meet the requirements of industrials. The issues are: • How can we construct a good response surface for the efficiency when the number of input variables is high?• How can we lead to an efficient optimization on the efficiency that takes into account the inputs perturbations?Several algorithms are proposed to answer to the first question. They construct a covariance kernel adapted to high dimension. This kernel is a tensor product of isotropic kernels in each subspace of input variables. These algorithms are benchmarked on some simulated case and on a real function. The results show that the use of this kernel improved the prediction quality in high dimension. For the second question, seven iterative strategies based on a co-kriging model are proposed to conduct the robust optimization. In each iteration, a Pareto front is obtained by the minimization of two objective computed from the kriging predictions. The first one represents the function and the second one the robustness. A criterion based on the Taylor theorem is used to estimate the local variance. This criterion quantifies the robustness. These strategies are compared in two test cases in small and higher dimension. The results show that the best strategies have well found the set of robust solutions. Finally, the methods are applied on the industrial cases provided by the PEPITO project. Krigeage Algorithme Grande dimension Noyau de covariance Optimisation robuste Kriging Algorithm High dimension Covariance kernel Robust optimization
9	Um sistema eficiente de detecção da ocorrência de eventos em sinais multimídia. / An efficient system for detecting events in multimidia signals. Celso de Oliveira 01 July 2008 (has links) Nos últimos anos tem ocorrido uma necessidade crescente de métodos que possam lidar com conteúdo multimídia em larga escala, e com busca de tais informações de maneira eficiente e efetiva. Os objetos de interesse são representados por vetores descritivos (e. g. cor, textura, geometria, timbre) extraídos do conteúdo, associados a pontos de um espaço multidimensional. Um processo de busca visa, então, encontrar dados similares a uma dada amostra, tipicamente medindo distância entre pontos. Trata-se de um problema comum a uma ampla variedade de aplicações incluindo som, imagens, vídeo, bibliotecas digitais, imagens médicas, segurança, entre outras. Os maiores desafios dizem respeito às dificuldades inerentes aos espaços de alta dimensão, conhecidas por curse of dimensionality, que restringem significativamente a aplicação dos métodos comuns de busca. A literatura recente contém uma variedade de métodos de redução de dimensão que são altamente dependentes do tipo de dado considerado. Constata-se também certa carência de métodos gerais de análise que possam prever com precisão o desempenho dos algoritmos propostos. O presente trabalho contém uma análise geral dos princípios aplicáveis aos sistemas de busca em espaços de alta dimensão. Tal análise permite estabelecer de maneira precisa o compromisso existente entre robustez, refletida principalmente na imunidade a ruído, a taxa de erros de reconhecimento e a dimensão do espaço de observação. Além disto, mostra-se que é possível conceber um método geral de mapeamento, para fins de reconhecimento, que independe de especificidades do conteúdo. Para melhorar a eficiência de busca, um novo método de busca em espaços de alta dimensão é introduzido e analisado. Por fim, descreve-se sumariamente uma realização prática, desenvolvida segundo os princípios discutidos e que atende eficientemente aplicações comerciais de monitoramento de exibição de conteúdo em rádio e TV. / In the last few years there has been an increasing need for methods to deal with large scale multimedia content, and to search such information efficiently and effectively. The objects of interest are represented by feature vectors (e. g. color, texture, geometry, timbre) extracted from the content, associated to points in a multidimensional space. A search process aims, therefore, to find similar data to a given sample, typically measuring distance between points. It is a common problem to a wide range of applications that include sound, image, video, digital library, medical imagery, security, amongst others. The major challenges refer to the difficulties, inherent to the high dimension spaces, known as curse of dimensionality that limit significantly the application of the most common search methods. The recent literature contains a number of dimension reduction methods that are highly dependent on the type of data considered. Besides, there has been a certain lack of general analysis methods that can predict accurately the performance of the proposed algorithms. The present work contains a general analysis of the principles applicable to high dimension space search systems. Such analysis allows establishing in a precise manner the existing tradeoff amongst the system robustness, reflected mainly in the noise immunity, the error rate and the dimension of the observation space. Furthermore, it is shown that it is possible to conceive a mapping method, for recognition purpose, that can be independent of the content specificities. To improve the search efficiency, a new high dimension space search method is introduced and analyzed. Finally, a practical realization is briefly described, which has been developed in accordance with the principles discussed, and that addresses efficiently commercial applications relative to radio and TV content broadcasting monitoring. Multimídia Reconhecimento de imagem Reconhecimento de padrões Audio High dimension space Media Media recognition Multimedia Video
10	High Dimensional Multivariate Inference Under General Conditions Kong, Xiaoli 01 January 2018 (has links) In this dissertation, we investigate four distinct and interrelated problems for high-dimensional inference of mean vectors in multi-groups. The first problem concerned is the profile analysis of high dimensional repeated measures. We introduce new test statistics and derive its asymptotic distribution under normality for equal as well as unequal covariance cases. Our derivations of the asymptotic distributions mimic that of Central Limit Theorem with some important peculiarities addressed with sufficient rigor. We also derive consistent and unbiased estimators of the asymptotic variances for equal and unequal covariance cases respectively. The second problem considered is the accurate inference for high-dimensional repeated measures in factorial designs as well as any comparisons among the cell means. We derive asymptotic expansion for the null distributions and the quantiles of a suitable test statistic under normality. We also derive the estimator of parameters contained in the approximate distribution with second-order consistency. The most important contribution is high accuracy of the methods, in the sense that p-values are accurate up to the second order in sample size as well as in dimension. The third problem pertains to the high-dimensional inference under non-normality. We relax the commonly imposed dependence conditions which has become a standard assumption in high dimensional inference. With the relaxed conditions, the scope of applicability of the results broadens. The fourth problem investigated pertains to a fully nonparametric rank-based comparison of high-dimensional populations. To develop the theory in this context, we prove a novel result for studying the asymptotic behavior of quadratic forms in ranks. The simulation studies provide evidence that our methods perform reasonably well in the high-dimensional situation. Real data from Electroencephalograph (EEG) study of alcoholic and control subjects is analyzed to illustrate the application of the results. Profile analysis MANOVA High-dimension Repeated measure Non-parametric Rank transforms Multivariate Analysis Statistical Methodology

Search results