Return to search

Some problems in high dimensional data analysis

The bloom of economics and technology has had an enormous impact on society. Along with these developments, human activities nowadays produce massive amounts of data that can be easily collected for relatively low cost with the aid of new technologies. Many examples can be mentioned here including data from web term-document data, sensor arrays, gene expression, finance data, imaging and hyperspectral analysis. Because of the enormous amount of data from various different and new sources, more and more challenging scientific problems appear. These problems have changed the types of problems which mathematical scientists work. / In traditional statistics, the dimension of the data, p say, is low, with many observations, n say. In this case, classical rules such as the Central Limit Theorem are often applied to obtain some understanding from data. A new challenge to statisticians today is dealing with a different setting, when the data dimension is very large and the number of observations is small. The mathematical assumption now could be p > n, or even p goes to infinity and n fixed in many cases, for example, there are few patients with many genes. In these cases, classical methods fail to produce a good understanding of the nature of the problem. Hence, new methods need to be found to solve these problems. Mathematical explanations are also needed to generalize these cases. / The research preferred in this thesis includes two problems: Variable selection and Classification, in the case where the dimension is very large. The work on variable selection problems, in particular the Adaptive Lasso was completed by June 2007 and the research on classification has been carried out through out 2008 and 2009. The research on the Dantzig selector and the Lasso were finished in July 2009. Therefore, this thesis is divided into two parts. In the first part of the thesis we study the Adaptive Lasso, the Lasso and the Dantzig selector. In particular, in Chapter 2 we present some results for the Adaptive Lasso. Chapter 3 will provides two examples that show that neither the Dantzig selector or the Lasso is definitely better than the other. The second part of the thesis is organized as follows. In Chapter 5, we shall construct the model setting. In Chapter 6, we summarize the results of the scaled centroid-based classifier. We also prove some results on the scaled centroid-based classifier. Because there are similarities between the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) classifiers, Chapter 8 introduces a class of distance-based classifiers that could be considered a generalization of the SVM and DWD classifiers. Chapters 9 and 10 are about the SVM and DWD classifiers. Chapter 11 demonstrates the performance of these classifiers on simulated data sets and some cancer data sets.

Identiferoai:union.ndltd.org:ADTP/283861
Date January 2010
CreatorsPham, Tung Huy
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
RightsRestricted Access: Abstract and Citation Only Available

Page generated in 0.0015 seconds