Global ETD Search

81	Um estudo sobre o papel de medidas de similaridade em visualização de coleções de documentos / A study on the role of similarity measures in visual text analytics Salazar, Frizzi Alejandra San Roman 27 September 2012 (has links) Técnicas de visualização de informação, tais como as que utilizam posicionamento de pontos baseado na similaridade do conteúdo, são utilizadas para criar representações visuais de dados que evidenciem certos padrões. Essas técnicas são sensíveis à qualidade dos dados, a qual, por sua vez, depende de uma etapa de pré-processamento muito influente. Esta etapa envolve a limpeza do texto e, em alguns casos, a detecção de termos e seus pesos, bem como a definição de uma função de (dis)similaridade. Poucos são os estudos realizados sobre como esses cálculos de (dis)similaridade afetam a qualidade das representações visuais geradas para dados textuais. Este trabalho apresenta um estudo sobre o papel das diferentes medidas de (dis)similaridade entre pares de textos na geração de mapas visuais. Nos concentramos principalmente em dois tipos de funções de distância, aquelas computadas a partir da representação vetorial do texto (Vector Space Model (VSM)) e em medidas de comparação direta de strings textuais. Comparamos o efeito na geração de mapas visuais com técnicas de posicionamento de pontos, utilizando as duas abordagens. Para isso, foram utilizadas medidas objetivas para comparar a qualidade visual dos mapas, tais como Neighborhood Hit (NH) e Coeficiente de Silhueta (CS). Descobrimos que ambas as abordagens têm pontos a favor, mas de forma geral, o VSM apresentou melhores resultados quanto à discriminação de classes. Porém, a VSM convencional não é incremental, ou seja, novas adições à coleção forçam o recálculo do espaço de dados e das dissimilaridades anteriormente computadas. Nesse sentido, um novo modelo incremental baseado no VSM (Incremental Vector Space Model (iVSM)) foi considerado em nossos estudos comparativos. O iVSM apresentou os melhores resultados quantitativos e qualitativos em diversas configurações testadas. Os resultados da avaliação são apresentados e recomendações sobre a aplicação de diferentes medidas de similaridade de texto em tarefas de análise visual, são oferecidas / Information visualization techniques, such as similarity based point placement, are used for generating of visual data representation that evidence some patterns. These techniques are sensitive to data quality, which depends of a very influential preprocessing step. This step involves cleaning the text and in some cases, detecting terms and their weights, as well as definiting a (dis)similarity function. There are few studies on how these (dis)similarity calculations aect the quality of visual representations for textual data. This work presents a study on the role of the various (dis)similarity measures in generating visual maps. We focus primarily on two types of distance functions, those based on vector representations of the text (Vector Space Model (VSM)) and measures obtained from direct comparison of text strings, comparing the effect on the visual maps obtained with point placement techniques with the two approaches. For this, objective measures were employed to compare the visual quality of the generated maps, such as the Neighborhood Hit and Silhouette Coefficient. We found that both approaches have strengths, but in general, the VSM showed better results as far as class discrimination is concerned. However, the conventional VSM is not incremental, i.e., new additions to the collection force the recalculation of the data space and dissimilarities previously computed. Thus, a new model based on incremental VSM (Incremental Vector Space Model (iVSM)) has been also considered in our comparative studies. iVSM showed the best quantitative and qualitative results in several of the configurations considered. The evaluation results are presented and recommendations on the application of different similarity measures for text analysis tasks visually are provided Análise visual de textos Mineração visual de textos Modelo espaço vetorial Vector space model Visual text analytics Visual text mining
82	Um estudo sobre o papel de medidas de similaridade em visualização de coleções de documentos / A study on the role of similarity measures in visual text analytics Frizzi Alejandra San Roman Salazar 27 September 2012 (has links) Técnicas de visualização de informação, tais como as que utilizam posicionamento de pontos baseado na similaridade do conteúdo, são utilizadas para criar representações visuais de dados que evidenciem certos padrões. Essas técnicas são sensíveis à qualidade dos dados, a qual, por sua vez, depende de uma etapa de pré-processamento muito influente. Esta etapa envolve a limpeza do texto e, em alguns casos, a detecção de termos e seus pesos, bem como a definição de uma função de (dis)similaridade. Poucos são os estudos realizados sobre como esses cálculos de (dis)similaridade afetam a qualidade das representações visuais geradas para dados textuais. Este trabalho apresenta um estudo sobre o papel das diferentes medidas de (dis)similaridade entre pares de textos na geração de mapas visuais. Nos concentramos principalmente em dois tipos de funções de distância, aquelas computadas a partir da representação vetorial do texto (Vector Space Model (VSM)) e em medidas de comparação direta de strings textuais. Comparamos o efeito na geração de mapas visuais com técnicas de posicionamento de pontos, utilizando as duas abordagens. Para isso, foram utilizadas medidas objetivas para comparar a qualidade visual dos mapas, tais como Neighborhood Hit (NH) e Coeficiente de Silhueta (CS). Descobrimos que ambas as abordagens têm pontos a favor, mas de forma geral, o VSM apresentou melhores resultados quanto à discriminação de classes. Porém, a VSM convencional não é incremental, ou seja, novas adições à coleção forçam o recálculo do espaço de dados e das dissimilaridades anteriormente computadas. Nesse sentido, um novo modelo incremental baseado no VSM (Incremental Vector Space Model (iVSM)) foi considerado em nossos estudos comparativos. O iVSM apresentou os melhores resultados quantitativos e qualitativos em diversas configurações testadas. Os resultados da avaliação são apresentados e recomendações sobre a aplicação de diferentes medidas de similaridade de texto em tarefas de análise visual, são oferecidas / Information visualization techniques, such as similarity based point placement, are used for generating of visual data representation that evidence some patterns. These techniques are sensitive to data quality, which depends of a very influential preprocessing step. This step involves cleaning the text and in some cases, detecting terms and their weights, as well as definiting a (dis)similarity function. There are few studies on how these (dis)similarity calculations aect the quality of visual representations for textual data. This work presents a study on the role of the various (dis)similarity measures in generating visual maps. We focus primarily on two types of distance functions, those based on vector representations of the text (Vector Space Model (VSM)) and measures obtained from direct comparison of text strings, comparing the effect on the visual maps obtained with point placement techniques with the two approaches. For this, objective measures were employed to compare the visual quality of the generated maps, such as the Neighborhood Hit and Silhouette Coefficient. We found that both approaches have strengths, but in general, the VSM showed better results as far as class discrimination is concerned. However, the conventional VSM is not incremental, i.e., new additions to the collection force the recalculation of the data space and dissimilarities previously computed. Thus, a new model based on incremental VSM (Incremental Vector Space Model (iVSM)) has been also considered in our comparative studies. iVSM showed the best quantitative and qualitative results in several of the configurations considered. The evaluation results are presented and recommendations on the application of different similarity measures for text analysis tasks visually are provided Análise visual de textos Mineração visual de textos Modelo espaço vetorial Vector space model Visual text analytics Visual text mining
83	Numerische Methoden zur Analyse hochdimensionaler Daten / Numerical Methods for Analyzing High-Dimensional Data Heinen, Dennis 01 July 2014 (has links) Diese Dissertation beschäftigt sich mit zwei der wesentlichen Herausforderungen, welche bei der Bearbeitung großer Datensätze auftreten, der Dimensionsreduktion und der Datenentstörung. Der erste Teil dieser Dissertation liefert eine Zusammenfassung über Dimensionsreduktion. Ziel der Dimensionsreduktion ist eine sinnvolle niedrigdimensionale Darstellung eines vorliegenden hochdimensionalen Datensatzes. Insbesondere diskutieren und vergleichen wir bewährte Methoden des Manifold-Learning. Die zentrale Annahme des Manifold-Learning ist, dass der hochdimensionale Datensatz (approximativ) auf einer niedrigdimensionalen Mannigfaltigkeit liegt. Störungen im Datensatz sind bei allen Dimensionsreduktionsmethoden hinderlich. Der zweite Teil dieser Dissertation stellt eine neue Entstörungsmethode für hochdimensionale Daten vor, eine Wavelet-Shrinkage-Methode für die Glättung verrauschter Abtastwerte einer zugrundeliegenden multivariaten stückweise stetigen Funktion, wobei die Abtastpunkte gestreut sein können. Die Methode stellt eine Verallgemeinerung und Weiterentwicklung der für die Bildkompression eingeführten "Easy Path Wavelet Transform" (EPWT) dar. Grundlage ist eine eindimensionale Wavelet-Transformation entlang (adaptiv) zu konstruierender Pfade durch die Abtastpunkte. Wesentlich für den Erfolg der Methode sind passende adaptive Pfadkonstruktionen. Diese Dissertation beinhaltet weiterhin eine kurze Diskussion der theoretischen Eigenschaften von Wavelets entlang von Pfaden sowie numerische Resultate und schließt mit möglichen Modifikationen der Entstörungsmethode. 510 Mathematics (PPN61756535X)
84	Mapas auto-organizáveis com topologioa variante no tempo para categorização em subespaços em dados de alta dimensionalidade e vistas múltiplas ANTONINO, Victor Oliveira 16 August 2016 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2017-04-24T15:04:03Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) mapas-auto-organizaveis2.pdf: 2835656 bytes, checksum: 8836a86bd2cced9353cb25b53383b305 (MD5) / Made available in DSpace on 2017-04-24T15:04:03Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) mapas-auto-organizaveis2.pdf: 2835656 bytes, checksum: 8836a86bd2cced9353cb25b53383b305 (MD5) Previous issue date: 2016-08-16 / Métodos e algoritmos em aprendizado de máquina não supervisionado têm sido empregados em diversos problemas significativos. Uma explosão na disponibilidade de dados de várias fontes e modalidades está correlacionada com os avanços na obtenção, compressão, armazenamento, transferência e processamento de grandes quantidades de dados complexos com alta dimensionalidade, como imagens digitais, vídeos de vigilância e microarranjos de DNA. O agrupamento se torna difícil devido à crescente dispersão desses dados, bem como a dificuldade crescente em discriminar distâncias entre os pontos de dados. Este trabalho apresenta um algoritmo de agrupamento suave em subespaços baseado em um mapa auto-organizável (SOM) com estrutura variante no tempo, o que significa que o agrupamento dos dados pode ser alcançado sem qualquer conhecimento prévio, tais como o número de categorias ou a topologia dos padrões de entrada, nos quais ambos são determinados durante o processo de treinamento. O modelo também atribui diferentes pesos a diferentes dimensões, o que implica que cada dimensão contribui para o descobrimento dos aglomerados de dados. Para validar o modelo, diversos conjuntos de dados reais foram utilizados, considerando uma diversificada gama de contextos, tais como mineração de dados, expressão genética, agrupamento multivista e problemas de visão computacional. Os resultados são promissores e conseguem lidar com dados reais caracterizados pela alta dimensionalidade. / Unsupervised learning methods have been employed on many significant problems. A blast in the availability of data from multiple sources and modalities is correlated with advancements in how to obtain, compress, store, transfer, and process large amounts of complex high-dimensional data, such as digital images, surveillance videos, and DNA microarrays. Clustering becomes challenging due to the increasing sparsity of such data, as well as the increasing difficulty in discriminating distances between data points. This work presents a soft subspace clustering algorithm based on a self-organizing map (SOM) with time-variant structure, meaning that clustering data can be achieved without any prior knowledge such as the number of categories or input data topology, in which both are determined during the training process. The model also assigns different weights to different dimensions, this implies that every dimension contributes to uncover clusters. To validate the model, we used a number of real-world data sets, considering a diverse range of contexts such as data mining, gene expression, multi-view and computer vision problems. The promising results can handle real-world data characterized by high dimensionality. Dados em Alta Dimensionalidade Campo Receptivo Local Aprendizagem por Relevância Mapas Auto-Organizáveis Agrupamento em Subespaços High-Dimensional Data Local Receptive Field Relevance Learning SelfOrganizing Maps (SOMs) Subspace Clustering
85	Multivariate analysis of high-throughput sequencing data / Analyses multivariées de données de séquençage à haut débit Durif, Ghislain 13 December 2016 (has links) L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF / The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing Statistiques computationnelles Données en grande dimension Réduction de dimension Compression Sélection de Variables Régression logistique Partial Least Squares parcimonieuse Factorisation probabiliste de matrices Computational Statistics High-dimensional data Dimension reduction Compression Variable selection Logistic regression Sparse Partial Least Squares Probabilistic matrix factorization 570.15
86	Visual Analysis of High-Dimensional Point Clouds using Topological Abstraction Oesterling, Patrick 14 April 2016 (has links) This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\''s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\''s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve. info:eu-repo/classification/ddc/500 ddc:500
87	Advances on Dimension Reduction for Multivariate Linear Regression Guo, Wenxing January 2020 (has links) Multivariate linear regression methods are widely used statistical tools in data analysis, and were developed when some response variables are studied simultaneously, in which our aim is to study the relationship between predictor variables and response variables through the regression coefficient matrix. The rapid improvements of information technology have brought us a large number of large-scale data, but also brought us great challenges in data processing. When dealing with high dimensional data, the classical least squares estimation is not applicable in multivariate linear regression analysis. In recent years, some approaches have been developed to deal with high-dimensional data problems, among which dimension reduction is one of the main approaches. In some literature, random projection methods were used to reduce dimension in large datasets. In Chapter 2, a new random projection method, with low-rank matrix approximation, is proposed to reduce the dimension of the parameter space in high-dimensional multivariate linear regression model. Some statistical properties of the proposed method are studied and explicit expressions are then derived for the accuracy loss of the method with Gaussian random projection and orthogonal random projection. These expressions are precise rather than being bounds up to constants. In multivariate regression analysis, reduced rank regression is also a dimension reduction method, which has become an important tool for achieving dimension reduction goals due to its simplicity, computational efficiency and good predictive performance. In practical situations, however, the performance of the reduced rank estimator is not satisfactory when the predictor variables are highly correlated or the ratio of signal to noise is small. To overcome this problem, in Chapter 3, we incorporate matrix projections into reduced rank regression method, and then develop reduced rank regression estimators based on random projection and orthogonal projection in high-dimensional multivariate linear regression models. We also propose a consistent estimator of the rank of the coefficient matrix and achieve prediction performance bounds for the proposed estimators based on mean squared errors. Envelope technology is also a popular method in recent years to reduce estimative and predictive variations in multivariate regression, including a class of methods to improve the efficiency without changing the traditional objectives. Variable selection is the process of selecting a subset of relevant features variables for use in model construction. The purpose of using this technology is to avoid the curse of dimensionality, simplify models to make them easier to interpret, shorten training time and reduce overfitting. In Chapter 4, we combine envelope models and a group variable selection method to propose an envelope-based sparse reduced rank regression estimator in high-dimensional multivariate linear regression models, and then establish its consistency, asymptotic normality and oracle property. Tensor data are in frequent use today in a variety of fields in science and engineering. Processing tensor data is a practical but challenging problem. Recently, the prevalence of tensor data has resulted in several envelope tensor versions. In Chapter 5, we incorporate envelope technique into tensor regression analysis and propose a partial tensor envelope model, which leads to a parsimonious version for tensor response regression when some predictors are of special interest, and then consistency and asymptotic normality of the coefficient estimators are proved. The proposed method achieves significant gains in efficiency compared to the standard tensor response regression model in terms of the estimation of the coefficients for the selected predictors. Finally, in Chapter 6, we summarize the work carried out in the thesis, and then suggest some problems of further research interest. / Dissertation / Doctor of Philosophy (PhD)
88	Adaptive risk management Chen, Ying 13 February 2007 (has links) In den vergangenen Jahren ist die Untersuchung des Risikomanagements vom Baselkomitee angeregt, um die Kredit- und Bankwesen regelmäßig zu aufsichten. Für viele multivariate Risikomanagementmethoden gibt es jedoch Beschränkungen von: 1) verlässt sich die Kovarianzschätzung auf eine zeitunabhängige Form, 2) die Modelle beruhen auf eine unrealistischen Verteilungsannahme und 3) numerische Problem, die bei hochdimensionalen Daten auftreten. Es ist das primäre Ziel dieser Doktorarbeit, präzise und schnelle Methoden vorzuschlagen, die diesen Beschränkungen überwinden. Die Grundidee besteht darin, zuerst aus einer hochdimensionalen Zeitreihe die stochastisch unabhängigen Komponenten (IC) zu extrahieren und dann die Verteilungsparameter der resultierenden IC beruhend auf eindimensionale Heavy-Tailed Verteilungsannahme zu identifizieren. Genauer gesagt werden zwei lokale parametrische Methoden verwendet, um den Varianzprozess jeder IC zu schätzen, das lokale Moving Window Average (MVA) Methode und das lokale Exponential Smoothing (ES) Methode. Diese Schätzungen beruhen auf der realistischen Annahme, dass die IC Generalized Hyperbolic (GH) verteilt sind. Die Berechnung ist schneller und erreicht eine höhere Genauigkeit als viele bekannte Risikomanagementmethoden. / Over recent years, study on risk management has been prompted by the Basel committee for the requirement of regular banking supervisory. There are however limitations of many risk management methods: 1) covariance estimation relies on a time-invariant form, 2) models are based on unrealistic distributional assumption and 3) numerical problems appear when applied to high-dimensional portfolios. The primary aim of this dissertation is to propose adaptive methods that overcome these limitations and can accurately and fast measure risk exposures of multivariate portfolios. The basic idea is to first retrieve out of high-dimensional time series stochastically independent components (ICs) and then identify the distributional behavior of every resulting IC in univariate space. To be more specific, two local parametric approaches, local moving window average (MWA) method and local exponential smoothing (ES) method, are used to estimate the volatility process of every IC under the heavy-tailed distributional assumption, namely ICs are generalized hyperbolic (GH) distributed. By doing so, it speeds up the computation of risk measures and achieves much better accuracy than many popular risk management methods. Risikomanagement Heavy-Tailed Verteilung Lokale parametrische Methoden Hochdimensionale Datenanalyse Risk management Heavy-tailed distribution Local parametric methods High-dimensional data analysis 330 Wirtschaft 17 Wirtschaft QP 300 ddc:330
89	Understanding High-Dimensional Data Using Reeb Graphs Harvey, William John 14 August 2012 (has links) No description available. Bioinformatics Computer Science computer science computational geometry computational topology geometry topology high dimensions high dimensional data Reeb graph contour tree visualization visual analytics Morse theory protein folding molecular dynamics survivin
90	Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-dimensional Extensions Jana, Sayantee 11 1900 (has links) A Growth Curve Model (GCM) is a multivariate linear model used for analyzing longitudinal data with short to moderate time series. It is a special case of Generalized Multivariate Analysis of Variance (GMANOVA) models. Analysis using the GCM involves comparison of mean growths among different groups. The classical GCM, however, possesses some limitations including distributional assumptions, assumption of identical degree of polynomials for all groups and it requires larger sample size than the number of time points. In this thesis, we relax some of the assumptions of the traditional GCM and develop appropriate inferential tools for its analysis, with the aim of reducing bias, improving precision and to gain increased power as well as overcome limitations of high-dimensionality. Existing methods for estimating the parameters of the GCM assume that the underlying distribution for the error terms is multivariate normal. In practical problems, however, we often come across skewed data and hence estimation techniques developed under the normality assumption may not be optimal. Simulation studies conducted in this thesis, in fact, show that existing methods are sensitive to the presence of skewness in the data, where estimators are associated with increased bias and mean square error (MSE), when the normality assumption is violated. Methods appropriate for skewed distributions are, therefore, required. In this thesis, we relax the distributional assumption of the GCM and provide estimators for the mean and covariance matrices of the GCM under multivariate skew normal (MSN) distribution. An estimator for the additional skewness parameter of the MSN distribution is also provided. The estimators are derived using the expectation maximization (EM) algorithm and extensive simulations are performed to examine the performance of the estimators. Comparisons with existing estimators show that our estimators perform better than existing estimators, when the underlying distribution is multivariate skew normal. Illustration using real data set is also provided, wherein Triglyceride levels from the Framingham Heart Study is modelled over time. The GCM assumes equal degree of polynomial for each group. Therefore, when groups means follow different shapes of polynomials, the GCM fails to accommodate this difference in one model. We consider an extension of the GCM, wherein mean responses from different groups can have different shapes, represented by polynomials of different degree. Such a model is referred to as Extended Growth Curve Model (EGCM). We extend our work on GCM to EGCM, and develop estimators for the mean and covariance matrices under MSN errors. We adopted the Restricted Expectation Maximization (REM) algorithm, which is based on the multivariate Newton-Raphson (NR) method and Lagrangian optimization. However, the multivariate NR method and hence, the existing REM algorithm are applicable to vector parameters and the parameters of interest in this study are matrices. We, therefore, extended the NR approach to matrix parameters, which consequently allowed us to extend the REM algorithm to matrix parameters. The performance of the proposed estimators were examined using extensive simulations and a motivating real data example was provided to illustrate the application of the proposed estimators. Finally, this thesis deals with high-dimensional application of GCM. Existing methods for a GCM are developed under the assumption of ‘small p large n’ (n >> p) and are not appropriate for analyzing high-dimensional longitudinal data, due to singularity of the sample covariance matrix. In a previous work, we used Moore-Penrose generalized inverse to overcome this challenge. However, the method has some limitations around near singularity, when p~n. In this thesis, a Bayesian framework was used to derive a test for testing the linear hypothesis on the mean parameter of the GCM, which is applicable in high-dimensional situations. Extensive simulations are performed to investigate the performance of the test statistic and establish optimality characteristics. Results show that this test performs well, under different conditions, including the near singularity zone. Sensitivity of the test to mis-specification of the parameters of the prior distribution are also examined empirically. A numerical example is provided to illustrate the usefulness of the proposed method in practical situations. / Thesis / Doctor of Philosophy (PhD) Growth Curve Model (GCM) GMANOVA models Bayesian methods High-dimensional data Longitudinal analysis Multivariate Skew Normal distribution Extended Growth Curve Model (EGCM) EM algorithm Restricted EM algorithm Matrix Newton Raphson method

Search results