• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Improved shrunken centroid method for better variable selection in cancer classification with high throughput molecular data

Xukun, Li January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / Cancer type classification with high throughput molecular data has received much attention. Many methods have been published in this area. One of them is called PAM (nearest centroid shrunken algorithm), which is simple and efficient. It can give very good prediction accuracy. A problem with PAM is that this method selects too many genes, some of which may have no influence on cancer type. A reason for this phenomenon is that PAM assumes that all genes have identical distribution and give a common threshold parameter for genes selection. This may not hold in reality since expressions from different genes could have very different distributions due to complicated biological process. We propose a new method aimed to improve the ability of PAM to select informative genes. Keeping informative genes while reducing false positive variables can lead to more accurate classification result and help to pinpoint target genes for further studies. To achieve this goal, we introduce variable specific test based on Edgeworth expansion to select informative genes. We apply this test on each gene and select some genes based on the result of the test so that a large number of genes will be excluded. Afterward, soft thresholding with cross-validation can be further applied to decide a common threshold value. Simulation and real application show that our method can reduce the irrelevant information and select the informative genes more precisely. The simulation results give us more insight about where the newly proposed procedure could improve the accuracy, especially when the data set is skewed or unbalanced. The method can be applied to broad molecular data, including, for example, lipidomic data from mass spectrum, copy number data from genomics, eQLT analysis with GWAS data, etc. We expect the proposed method will help life scientists to accelerate discoveries with highthroughput data.

Page generated in 0.0837 seconds