121 |
Goodness-of-Fit and Change-Point Tests for Functional DataGabrys, Robertas 01 May 2010 (has links)
A test for independence and identical distribution of functional observations is proposed in this thesis. To reduce dimension, curves are projected on the most important functional principal components. Then a test statistic based on lagged cross--covariances of the resulting vectors is constructed. We show that this dimension reduction step introduces asymptotically negligible terms, i.e. the projections behave asymptotically as iid vector--valued observations. A complete asymptotic theory based on correlations of random matrices, functional principal component expansions, and Hilbert space techniques is developed. The test statistic has chi-square asymptotic null distribution.
Two inferential tests for error correlation in the functional linear model are put forward. To construct them, finite dimensional residuals are computed in two different ways, and then their autocorrelations are suitably defined. From these autocorrelation matrices, two quadratic forms are constructed whose limiting distributions are chi--squared with known numbers of degrees of freedom (different for the two forms).
A test for detecting a change point in the mean of functional observations is developed. The null distribution of the test statistic is asymptotically pivotal with a well-known asymptotic distribution. A comprehensive asymptotic theory for the estimation of a change--point in the mean function of functional observations is developed.
The procedures developed in this thesis can be readily computed using the R package fda. All theoretical insights obtained in this thesis are confirmed by simulations and illustrated by real life-data examples.
|
122 |
Sheaf Theory as a Foundation for Heterogeneous Data FusionMansourbeigi, Seyed M-H 01 December 2018 (has links)
A major impediment to scientific progress in many fields is the inability to make sense of the huge amounts of data that have been collected via experiment or computer simulation. This dissertation provides tools to visualize, represent, and analyze the collection of sensors and data all at once in a single combinatorial geometric object. Encoding and translating heterogeneous data into common language are modeled by supporting objects. In this methodology, the behavior of the system based on the detection of noise in the system, possible failure in data exchange and recognition of the redundant or complimentary sensors are studied via some related geometric objects. Applications of the constructed methodology are described by two case studies: one from wildfire threat monitoring and the other from air traffic monitoring. Both cases are distributed (spatial and temporal) information systems. The systems deal with temporal and spatial fusion of heterogeneous data obtained from multiple sources, where the schema, availability and quality vary. The behavior of both systems is explained thoroughly in terms of the detection of the failure in the systems and the recognition of the redundant and complimentary sensors. A comparison between the methodology in this dissertation and the alternative methods is described to further verify the validity of the sheaf theory method. It is seen that the method has less computational complexity in both space and time.
|
123 |
Topological Data Analysis of Properties of Four-Regular Rigid Vertex GraphsConine, Grant Mcneil 24 June 2014 (has links)
Homologous DNA recombination and rearrangement has been modeled with a class of four-regular rigid vertex graphs called assembly graphs which can also be represented by double occurrence words. Various invariants have been suggested for these graphs, some based on the structure of the graphs, and some biologically motivated.
In this thesis we use a novel method of data analysis based on a technique known as partial-clustering analysis and an algorithm known as Mapper to examine the relationships between these invariants. We introduce some of the basic machinery of topological data analysis, including the construction of simplicial complexes on a data set, clustering analysis, and the workings of the Mapper algorithm. We define assembly graphs and three specific invariants of these graphs: assembly number, nesting index, and genus range. We apply Mapper to the set of all assembly graphs up to 6 vertices and compare relationships between these three properties. We make several observations based upon the results of the analysis we obtained. We conclude with some suggestions for further research based upon our findings.
|
124 |
Techniques to handle missing values in a factor analysisTurville, Christopher, University of Western Sydney, Faculty of Informatics, Science and Technology January 2000 (has links)
A factor analysis typically involves a large collection of data, and it is common for some of the data to be unrecorded. This study investigates the ability of several techniques to handle missing values in a factor analysis, including complete cases only, all available cases, imputing means, an iterative component method, singular value decomposition and the EM algorithm. A data set that is representative of that used for a factor analysis is simulated. Some of this data are then randomly removed to represent missing values, and the performance of the techniques are investigated over a wide range of conditions. Several criteria are used to investigate the abilities of the techniques to handle missing values in a factor analysis. Overall, there is no one technique that performs best for all of the conditions studied. The EM algorithm is generally the most effective technique except when there are ill-conditioned matrices present or when computing time is of concern. Some theoretical concerns are introduced regarding the effects that changes in the correlation matrix will have on the loadings of a factor analysis. A complicated expression is derived that shows that the change in factor loadings as a result of change in the elements of a correlation matrix involves components of eigenvectors and eigenvalues. / Doctor of Philosophy (PhD)
|
125 |
Multi-angular hyperspectral data and its influences on soil and plant property measurements: spectral mapping and functional data analysis approachSugianto, ., Biological, Earth & Environmental Science, UNSW January 2006 (has links)
This research investigates the spectral reflectance characteristics of soil and vegetation using multi-angular and single view hyperspectral data. The question of the thesis is ???How much information can be obtained from multi-angular hyperspectral remote sensing in comparison with single view angle hyperspectral remote sensing of soil and vegetation???? This question is addressed by analysing multi-angular and single view angle hyperspectral remote sensing using data from the field, airborne and space borne hyperspectral sensors. Spectral mapping, spectral indices and Functional Data Analysis (FDA) are used to analyse the data. Spectral mapping has been successfully used to distinguish features of soil and cotton with hyperspectral data. Traditionally, spectral mapping is based on collecting endmembers of pure pixels and using these as training areas for supervised classification. There are, however, limitations in the use of these algorithms when applied to multi-angular images, as the reflectance of a single ground unit will differ at each angle. Classifications using six-class endmembers identified using single angle imagery were assessed using multi-angular Compact High Resolution Imaging Spectrometer (CHRIS) imagery, as well as a set of vegetation indices. The results showed no significant difference between the angles. Low nutrient content in the soil produced lower vegetation index values, and more nutrients increased the index values. This research introduces FDA as an image processing tool for multi-angular hyperspectral imagery of soil and cotton, using basis functions for functional principal component analysis (fPCA) and functional linear modelling. FDA has advantages over conventional statistical analysis because it does not assume the errors in the data are independent and uncorrelated. Investigations showed that B-splines with 20-basis functions was the best fit for multi-angular soil spectra collected using the spectroradiometer and the satellite mounted CHRIS. Cotton spectra collected from greenhouse plants using a spectrodiometer needed 30-basis functions to fit the model, while 20-basis functions were sufficient for cotton spectra extracted from CHRIS. Functional principal component analysis (fPCA) of multi-angular soil spectra show the first fPCA explained a minimum of 92.5% of the variance of field soil spectra for different azimuth and zenith angles and 93.2% from CHRIS for the same target. For cotton, more than 93.6% of greenhouse trial and 70.6% from the CHRIS data were explained by the first fPCA. Conventional analysis of multi-angular hyperspectral data showed significant differences exist between soil spectra acquired at different azimuth and zenith angles. Forward scan direction of zenith angle provides higher spectral reflectance than backward direction. However, most multi-angular hyperspectral data analysed as functional data show no significant difference from nadir, except for small parts of the wavelength of cotton spectra using CHRIS. There is also no significant difference for soil spectra analysed as functional data collected from the field, although there was some difference for soil spectra extracted from CHRIS. Overall, the results indicate that multi-angular hyperspectral data provides only a very small amount of additional information when used for conventional analyses.
|
126 |
Physique statistique des réseaux de neurones et de l'optimisation combinatoireKrauth, Werner 14 June 1989 (has links) (PDF)
Dans la première partie nous étudions l'apprentissage et le rappel dans des réseaux de neurones à une couche (modèle de Hopfield). Nous proposons un algorithme d'apprentissage qui est capable d'optimiser la 'stabilité', un paramètre qui décrit la qualité de la représentation d'un pattern dans le réseau. Pour des patterns aléatoires, cet algorithme permet d'atteindre la borne théorique de Gardner. Nous étudions ensuite l'importance dynamique de la stabilité et d'un paramètre concernant la symétrie de la matrice de couplages. Puis, nous traitons le cas où les couplages ne peuvent prendre que deux valeurs (inhibiteur, excitateur). Pour ce modèle nous établissons les limites supérieures de la capacité par un calcul numérique, et nous proposons une solution analytique. La deuxième partie de la thèse est consacrée à une étude détaillée - du point de vue de la physique statistique - du problème du voyageur de commerce. Nous étudions le cas spécial d'une matrice aléatoire de connexions. Nous exposons la théorie de ce problème (suivant la méthode des répliques) et la comparons aux résultats d'une étude numérique approfondie.
|
127 |
Mining for Lung Cancer Biomarkers in Plasma Metabolomics Data / Sökande efter Biomarkörer för Lungcancer genom Analys av MetabolitdataJohnsson, Anna January 2010 (has links)
<p>Lung cancer is the cancer form that has the highest mortality worldwide and inaddition the survival of lung cancer is very low. Only 15% of the patients are alivefive years from set diagnosis. More research is needed to understand the biologyof lung cancer and thus make it possible to discover the disease at an early stage.Early diagnosis leads to an increased chance of survival. In this thesis 179 lungcancer- and 116 control samples of blood serum were analyzed for identificationof metabolomic biomarkers. The control samples were derived from patients withbenign lung diseases.Data was gained from GC/TOF-MS analysis and analyzed with the help ofthe multivariate analysis methods PCA and OPLS/OPLS-DA. In this thesis it isinvestigated how to pre-treat and analyze the data in the best way in order todiscover biomarkers. One part of the aim was to give directions for how to selectsamples from a biobank for further biological validation of suspected biomarkers.Models for different stages of lung cancer versus control samples were computedand validated. The most influencing metabolites in the models were selected andconfoundings with other clinical characteristics like gender and hemoglobin levelswere studied. 13 lung cancer biomakers were identified and validated by raw dataand new OPLS models based solely upon the biomarkers.In summary the identified biomarkers are able to separate fairly good betweencontrol samples and late lung cancer, but are poor for separation of early lungcancer from control samples. The recommendation is to select controls and latelung cancer samples from the biobank for further confirmation of the biomarkers.NyckelordLung cancer is the cancer form that has the highest mortality worldwide and inaddition the survival of lung cancer is very low. Only 15% of the patients are alivefive years from set diagnosis. More research is needed to understand the biologyof lung cancer and thus make it possible to discover the disease at an early stage.Early diagnosis leads to an increased chance of survival. In this thesis 179 lungcancer- and 116 control samples of blood serum were analyzed for identificationof metabolomic biomarkers. The control samples were derived from patients withbenign lung diseases.Data was gained from GC/TOF-MS analysis and analyzed with the help ofthe multivariate analysis methods PCA and OPLS/OPLS-DA. In this thesis it isinvestigated how to pre-treat and analyze the data in the best way in order todiscover biomarkers. One part of the aim was to give directions for how to selectsamples from a biobank for further biological validation of suspected biomarkers.Models for different stages of lung cancer versus control samples were computedand validated. The most influencing metabolites in the models were selected andconfoundings with other clinical characteristics like gender and hemoglobin levelswere studied. 13 lung cancer biomakers were identified and validated by raw dataand new OPLS models based solely upon the biomarkers.In summary the identified biomarkers are able to separate fairly good betweencontrol samples and late lung cancer, but are poor for separation of early lungcancer from control samples. The recommendation is to select controls and latelung cancer samples from the biobank for further confirmation of the biomarkers.Nyckelord</p>
|
128 |
Exploring factors affecting math achievement using large scale assessment results in SaskatchewanLai, Hollis 16 September 2008
Current research suggests that a high level of confidence and a low level of anxiety are predictive of higher math achievement. Compared to students from other provinces, previous research has found that Saskatchewan students have a higher level of confidence and a lower level of anxiety for learning math, but still tend to achieve lower math scores compared to students in other provinces. The data suggest that there may be unique factors effecting math learning for students in Saskatchewan. The purpose of the study is to determine the factors that may affect Saskatchewan students math achievement. Exploratory factor analyses and regression methods were employed to investigate possible traits that aid students in achieving higher math scores. Results from a 2007 math assessment administered to grade 5 students in Saskatchewan were used for the current study. The goal of the study was to provide a better understanding of the factors and trends unique to students for mathematic achievements in Saskatchewan.<p> Using results from a province-wide math assessment and an accompanying questionnaire administered to students in grade five across public school in Saskatchewan (n=11,279), the present study found statistical significance in three factors that have been supported by previous studies to influence math achievement differences, specifically in (1) confidence in math, (2) parental involvement in math and (3) extracurricular participation in math. The three aforementioned factors were found to be related to math achievement as predicted by the Assessment for Learning (AFL) program in Saskatchewan, although there were reservations to the findings due to a weak amount of variances accounted for in the regression model (r2 =.084). Furthermore, a multivariate analysis of variance indicated gender and locations of schools to have effects on students math achievement scores. Although a high amount of measurement errors in the questionnaire (and subsequently a low variance accounted for by the regression model) limited the scope and implications of the model, future implications and improvements are discussed
|
129 |
Tools and theory to improve data analysisGrolemund, Garrett 24 July 2013 (has links)
This thesis proposes a scientific model to explain the data analysis process. I argue that data analysis is primarily a procedure to build un- derstanding and as such, it dovetails with the cognitive processes of the human mind. Data analysis tasks closely resemble the cognitive process known as sensemaking. I demonstrate how data analysis is a sensemaking task adapted to use quantitative data. This identification highlights a uni- versal structure within data analysis activities and provides a foundation for a theory of data analysis. The model identifies two competing chal- lenges within data analysis: the need to make sense of information that we cannot know and the need to make sense of information that we can- not attend to. Classical statistics provides solutions to the first challenge, but has little to say about the second. However, managing attention is the primary obstacle when analyzing big data. I introduce three tools for managing attention during data analysis. Each tool is built upon a different method for managing attention. ggsubplot creates embedded plots, which transform data into a format that can be easily processed by the human mind. lubridate helps users automate sensemaking out- side of the mind by improving the way computers handle date-time data. Visual Inference Tools develop expertise in young statisticians that
can later be used to efficiently direct attention. The insights of this thesis are especially helpful for consultants, applied statisticians, and teachers of data analysis.
|
130 |
Mining for Lung Cancer Biomarkers in Plasma Metabolomics Data / Sökande efter Biomarkörer för Lungcancer genom Analys av MetabolitdataJohnsson, Anna January 2010 (has links)
Lung cancer is the cancer form that has the highest mortality worldwide and inaddition the survival of lung cancer is very low. Only 15% of the patients are alivefive years from set diagnosis. More research is needed to understand the biologyof lung cancer and thus make it possible to discover the disease at an early stage.Early diagnosis leads to an increased chance of survival. In this thesis 179 lungcancer- and 116 control samples of blood serum were analyzed for identificationof metabolomic biomarkers. The control samples were derived from patients withbenign lung diseases.Data was gained from GC/TOF-MS analysis and analyzed with the help ofthe multivariate analysis methods PCA and OPLS/OPLS-DA. In this thesis it isinvestigated how to pre-treat and analyze the data in the best way in order todiscover biomarkers. One part of the aim was to give directions for how to selectsamples from a biobank for further biological validation of suspected biomarkers.Models for different stages of lung cancer versus control samples were computedand validated. The most influencing metabolites in the models were selected andconfoundings with other clinical characteristics like gender and hemoglobin levelswere studied. 13 lung cancer biomakers were identified and validated by raw dataand new OPLS models based solely upon the biomarkers.In summary the identified biomarkers are able to separate fairly good betweencontrol samples and late lung cancer, but are poor for separation of early lungcancer from control samples. The recommendation is to select controls and latelung cancer samples from the biobank for further confirmation of the biomarkers.NyckelordLung cancer is the cancer form that has the highest mortality worldwide and inaddition the survival of lung cancer is very low. Only 15% of the patients are alivefive years from set diagnosis. More research is needed to understand the biologyof lung cancer and thus make it possible to discover the disease at an early stage.Early diagnosis leads to an increased chance of survival. In this thesis 179 lungcancer- and 116 control samples of blood serum were analyzed for identificationof metabolomic biomarkers. The control samples were derived from patients withbenign lung diseases.Data was gained from GC/TOF-MS analysis and analyzed with the help ofthe multivariate analysis methods PCA and OPLS/OPLS-DA. In this thesis it isinvestigated how to pre-treat and analyze the data in the best way in order todiscover biomarkers. One part of the aim was to give directions for how to selectsamples from a biobank for further biological validation of suspected biomarkers.Models for different stages of lung cancer versus control samples were computedand validated. The most influencing metabolites in the models were selected andconfoundings with other clinical characteristics like gender and hemoglobin levelswere studied. 13 lung cancer biomakers were identified and validated by raw dataand new OPLS models based solely upon the biomarkers.In summary the identified biomarkers are able to separate fairly good betweencontrol samples and late lung cancer, but are poor for separation of early lungcancer from control samples. The recommendation is to select controls and latelung cancer samples from the biobank for further confirmation of the biomarkers.Nyckelord
|
Page generated in 0.2166 seconds