Global ETD Search

1	Contributions to Data Reduction and Statistical Model of Data with Complex Structures Wei, Yanran 30 August 2022 (has links) With advanced technology and information explosion, the data of interest often have complex structures, with the large size and dimensions in the form of continuous or discrete features. There is an emerging need for data reduction, efficient modeling, and model inference. For example, data can contain millions of observations with thousands of features. Traditional methods, such as linear regression or LASSO regression, cannot effectively deal with such a large dataset directly. This dissertation aims to develop several techniques to effectively analyze large datasets with complex structures in the observational, experimental and time series data. In Chapter 2, I focus on the data reduction for model estimation of sparse regression. The commonly-used subdata selection method often considers sampling or feature screening. Un- der the case of data with both large number of observation and predictors, we proposed a filtering approach for model estimation (FAME) to reduce both the size of data points and features. The proposed algorithm can be easily extended for data with discrete response or discrete predictors. Through simulations and case studies, the proposed method provides a good performance for parameter estimation with efficient computation. In Chapter 3, I focus on modeling the experimental data with quantitative-sequence (QS) factor. Here the QS factor concerns both quantities and sequence orders of several compo- nents in the experiment. Existing methods usually can only focus on the sequence orders or quantities of the multiple components. To fill this gap, we propose a QS transformation to transform the QS factor to a generalized permutation matrix, and consequently develop a simple Gaussian process approach to model the experimental data with QS factors. In Chapter 4, I focus on forecasting multivariate time series data by leveraging the au- toregression and clustering. Existing time series forecasting method treat each series data independently and ignore their inherent correlation. To fill this gap, I proposed a clustering based on autoregression and control the sparsity of the transition matrix estimation by adap- tive lasso and clustering coefficient. The clustering-based cross prediction can outperforms the conventional time series forecasting methods. Moreover, the the clustering result can also enhance the forecasting accuracy of other forecasting methods. The proposed method can be applied on practical data, such as stock forecasting, topic trend detection. / Doctor of Philosophy / This dissertation focuses on three projects that are related to data reduction and statistical modeling of data with complex structures. In chapter 2, we propose a filtering approach of data for parameter estimation of sparse regression. Given data with thousands of ob- servations and predictors or even more, large storage and computation spaces is need to handle these data. It is challenging to computational power and takes long time in terms of computational cost. So we come up with an algorithm (FAME) that can reduce both the number of observations and predictors. After data reduction, this subdata selected by FAME keeps most information of the original dataset in terms of parameter estimation. Compare with existing methods, the dimension of the subdata generated by the proposed algorithm is smaller while the computational time does not increase. In chapter 3, we use quantitative-sequence (QS) factor to describe experimental data. One simple example of experimental data is milk tea. Adding 1 cup of milk first or adding 2 cup of tea first will influence the flavor. And this case can be extended to cases when there are thousands of ingredients need to be input into the experiment. Then the order and amount of ingredients will generate different experimental results. We use QS factor to describe this kind of order and amount. Then by transforming the QS factor to a matrix containing continuous value and set this matrix as input, we model the experimental results with a simple Gaussian process. In chapter 4, we propose an autoregression-based clustering and forecasting method of multi- variate time series data. Existing research works often treat each time series independently. Our approach incorporates the inherent correlation of data and cluster related series into one group. The forecasting is built based on each cluster and data within one cluster can cross predict each other. One application of this method is on topic trending detection. With thousands of topics, it is unfeasible to apply one model for forecasting all time series. Considering the similarity of trends among related topics, the proposed method can cluster topics based on their similarity, and then perform forecasting in autoregression model based on historical data within each cluster. high-dimensional data subdata selection filtering approach Analysis of experimental data Gaussian process Permutation matrix. QS factor multivariate time series spectral clustering autoregression cross prediction
2	Designing Genomic Solutions for Abiotic Traits in Flax (Linum usitatissimum L.) Khan, Nadeem 15 December 2022 (has links) Flax (Linum usitatissimum L.) is a self-pollinated crop widely cultivated for fiber and oil production. Flaxseed is renowned for its health attributes but the presence of compounds, such as the heavy metal cadmium (Cd), is undesirable. Genomic studies in flax have produced large amounts of data in the last 15 years, providing useful resources to improve the genetic of this crop using genomics-based technologies and strategies. The goal of this thesis is therefore to capitalize on these advances to address the Cd problem and to propose solutions to improve breeding efficiencies. To find genomic-based solutions to Cd content, to the currently low breeding efficiency and to abiotic stress resistance in flax, this study utilized four major strategies: (1) genomic cross prediction, (2) gene family identification, (3) genome-wide association study (GWAS) and (4) genomic selection (GS). Characterization of the ATP-binding cassette (ABC) transporter and heavy metal associated (HMA) gene families was performed using the flax genome sequence. A total of 198 ABC transporter and 12 HMA genes were identified in the flax genome, of which nine were orthologous to Cd-associated genes in Arabidopsis, rice and maize. A transcriptomic analysis of eight tissues provided some support towards the functional annotation of these genes and confirmed the expression of these ABC transporter and HMA genes in flax seeds and other tissues. A diversity panel of 168 flax accessions was grown in the field at multiple locations and years and the seed content of 24 heavy metals (HMs) was measured. The panel was also sequenced and a single nucleotide polymorphism (SNP) dataset of nearly 43,000 SNPs was defined. A GWAS was conducted using these genotypic and phenotypic data and a total of 355 non-redundant quantitative trait nucleotides (QTNs) were identified for ten of the 24 metal contents. Overall, a total of 24 major and 331 minor effect QTNs were detected, including 11 that were pleiotropic. After allelic tests, 108 non-redundant QTNs were retained for eight of the ten metals and ranging from one for copper (Cu) to 70 for strontium (Sr). A total of 20 candidate genes for HM accumulation were identified at 12 of the 24 major QTN loci, of which five belonged to the ABC transporter family. Many of the metal contents, including Cd, appeared to be controlled by many genes of small effects; hence, GS is better suited than marker-assisted selection for application in breeding. To test this, predictive ability using ten GS statistical models was evaluated using trait-specific QTN and the random genome-wide 43K SNP datasets. Significantly higher predictive abilities were observed from the GS models built with the dataset made of QTNs associated with metal contents (70-80%) compared to that of the 43K dataset (10-25%). This study showed the feasibility of using GS to improve the predictive ability of polygenic traits such as metal content in seeds. GS can be applied in early generation selection to accelerate the improvement of abiotic stress resistance and either select low-Cd lines or discard high-Cd lines. These findings validate the use of a QTL-based strategy as a highly effective method for improving the efficiency of predictive ability of GS for highly complex traits such as resistance or tolerance to HM accumulation. Identification of both large and minor effect QTNs and/or pleiotropic effects hold potential for flax breeding improvement. Candidate gene functional validation can be performed using methods such as genome editing or targeting induced local lesions in genomes (TILLING). Flax (Linum usitatissimum L.) Cadmium (Cd) accumulation Genome-wide association study (GWAS) Quantitative trait loci (QTLs) Quantitative trait nucleotides (QTNs) Genomic selection (GS) Genomic cross prediction Candidate genes ATP-binding cassette (ABC) transporters Heavy metal associated (HMA) genes

Search results

Contributions to Data Reduction and Statistical Model of Data with Complex Structures

Designing Genomic Solutions for Abiotic Traits in Flax (Linum usitatissimum L.)