Global ETD Search

121	Dimensionality Reduction Using Factor Analysis Khosla, Nitin, n/a January 2006 (has links) In many pattern recognition applications, a large number of features are extracted in order to ensure an accurate classification of unknown classes. One way to solve the problems of high dimensions is to first reduce the dimensionality of the data to a manageable size, keeping as much of the original information as possible and then feed the reduced-dimensional data into a pattern recognition system. In this situation, dimensionality reduction process becomes the pre-processing stage of the pattern recognition system. In addition to this, probablility density estimation, with fewer variables is a simpler approach for dimensionality reduction. Dimensionality reduction is useful in speech recognition, data compression, visualization and exploratory data analysis. Some of the techniques which can be used for dimensionality reduction are; Factor Analysis (FA), Principal Component Analysis(PCA), and Linear Discriminant Analysis(LDA). Factor Analysis can be considered as an extension of Principal Component Analysis. The EM (expectation maximization) algorithm is ideally suited to problems of this sort, in that it produces maximum-likelihood (ML) estimates of parameters when there is a many-to-one mapping from an underlying distribution to the distribution governing the observation, conditioned upon the obervations. The maximization step then provides a new estimate of the parameters. This research work compares the techniques; Factor Analysis (Expectation-Maximization algorithm based), Principal Component Analysis and Linear Discriminant Analysis for dimensionality reduction and investigates Local Factor Analysis (EM algorithm based) and Local Principal Component Analysis using Vector Quantization. Pattern recognition applications dimensionality reduction factor analysis principal component analysis linear discriminant analysis
122	Multistage Seismic Assessment Methods For Existing Reinforced Concrete Buildings And Their Applicability For Retrofitting Cost Estimation Dogan, Onur 01 February 2013 (has links) (PDF) When the huge building stock in Turkey is considered, it is practically impossible to carry out detailed structural analyses for all of the buildings. In order to cope with the seismic safety evaluation of a large number of existing buildings, it is necessary to use simplified techniques, which can predict the seismic vulnerability of the existing buildings in a relatively short time. The comprehensive structural data compiled for the 48 different reinforced concrete buildings contain full information on their structural characteristics before and after retrofitting and are used in this study. The first basic goal of the study is to develop a procedure through which the building stock under consideration can be classified as &ldquo / safe&rdquo / or &ldquo / unsafe&rdquo / according to the current Turkish Seismic Code. The classification procedure is based on discriminant analysis. The cross-sectional area of the load-bearing members of a building and its preliminary assessment score are selected as the discriminator variables. The second and ultimate basic goal of the study is to propose a method through which the minimum retrofitting cost for satisfying the provisions of the Turkish Seismic Code can be estimated. A quick and uncostly assessment of retrofitting cost estimates based on the procedure described in this thesis will provide a useful input for decisions concerning whether a seismically &ldquo / unsafe&rdquo / building should be rebuilt or retrofitted. Such a situation will save time, labor and money, when it is used for the evaluation of building stocks involving large number of buildings and also in urban transformation operations.
123	Classification models for disease diagnosis and outcome analysis Wu, Tsung-Lin 12 July 2011 (has links) In this dissertation we study the feature selection and classification problems and apply our methods to real-world medical and biological data sets for disease diagnosis. Classification is an important problem in disease diagnosis to distinguish patients from normal population. DAMIP (discriminant analysis -- mixed integer program) was shown to be a good classification model, which can directly handle multigroup problems, enforce misclassification limits, and provide reserved judgement region. However, DAMIP is NP-hard and presents computational challenges. Feature selection is important in classification to improve the prediction performance, prevent over-fitting, or facilitate data understanding. However, this combinatorial problem becomes intractable when the number of features is large. In this dissertation, we propose a modified particle swarm optimization (PSO), a heuristic method, to solve the feature selection problem, and we study its parameter selection in our applications. We derive theories and exact algorithms to solve the two-group DAMIP in polynomial time. We also propose a heuristic algorithm to solve the multigroup DAMIP. Computational studies on simulated data and data from UCI machine learning repository show that the proposed algorithm performs very well. The polynomial solution time of the heuristic method allows us to solve DAMIP repeatedly within the feature selection procedure. We apply the PSO/DAMIP classification framework to several real-life medical and biological prediction problems. (1) Alzheimer's disease: We use data from several neuropsychological tests to discriminate subjects of Alzheimer's disease, subjects of mild cognitive impairment, and control groups. (2) Cardiovascular disease: We use traditional risk factors and novel oxidative stress biomarkers to predict subjects who are at high or low risk of cardiovascular disease, in which the risk is measured by the thickness of the carotid intima-media or/and the flow-mediated vasodilation. (3) Sulfur amino acid (SAA) intake: We use 1H NMR spectral data of human plasma to classify plasma samples obtained with low SAA intake or high SAA intake. This shows that our method helps for metabolomics study. (4) CpG islands for lung cancer: We identify a large number of sequence patterns (in the order of millions), search candidate patterns from DNA sequences in CpG islands, and look for patterns which can discriminate methylation-prone and methylation-resistant (or in addition, methylation-sporadic) sequences, which relate to early lung cancer prediction. Biomedical informatics applications Medical sciences Statistical methods Discriminant analysis Data mining Bioinformatics Medical informatics
124	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
125	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
126	Application of multivariate statistical method to characterize the groundwater quality of a contaminated site Chiou, Hsien-wei 07 February 2010 (has links) In this study, a chlorinated-solvent contaminated groundwater site was used as the study site. Multivariate statistical analysis explains the huge and complicated current situation of the original data efficiently, concisely, and explicitly; it simplifies the original data into representative factors, or bases on the similarity between data to cluster and identify clustering outcome. The statistical software SPSS 12.0 was used to perform the multivariate statistical analysis to evaluate groundwater quality characteristics of this site. Results show that 20 analytical items of groundwater quality of the study site are simplified into seven major representative factors through factor analysis, including ¡§background¡¨, ¡§salt residual¡¨, ¡§hardness¡¨, ¡§ethylene chloride¡¨, ¡§alkalinity¡¨, ¡§organic pollutant¡¨, and ¡§chloroform¡¨. The factor score diagram was drawn according to the score of monitoring well on each factor and 89.6% of the variance could be obtained. This study used cluster analysis to cluster in two phrases, the groundwater quality monitoring wells were classified into seven clusters according to the similarity of monitored data nature and the differences between clusters. The groundwater quality characteristics and pollutant distributions of each cluster out this site were evaluated. The clustering result indicates that for the sixth cluster (where monitoring well SW-6 was the representative well), the average concentrations of chlorides such as 1,1-dichloroethylene, 1,1-dichloroethane, and cis-1,2-dichloroethylene were the highest among the clusters, indicating those the groundwater of nearby area might be polluted by chlorinated organic compounds. In addition, to evaluate whether the clustering of cluster analysis were appropriate or not, discriminant analysis is used to evaluate clustering accuracy, in which seven Fisher discriminant coefficient formulas that were exclusively suitable for this location were established. Then, the observed values were substituted to Fisher discriminant coefficient formula. Result shows that the monitoring well¡¦s clusters obtained from discriminant analysis were totally identical with the result of actual cluster analysis; the accuracy were 100%. After performing cross-validation analysis, the result shows that the accuracy were 80%, indicating the use of discriminant analysis (with forecasting function) to verify the clustering result of the cluster analysis was highly accurate. After analyzing the pollution condition of this site using time trend and space distribution, it were determined to conclude that trichloroethylene and 1,1-dichloroethylene were the major concerning pollutants; the pollutants appeared to be spreading on a large scale, so it was difficult to use the existing data to evaluate the pollution source. After assessing environmental medium characteristics and pollutant distribution of the site, this study suggests that the use of insitu bioremediation, which is cost-effective, can be applied as a remedial mothod. Factor analysis discriminant analysis cluster analysis chlorinated organic compound groundwater quality
127	Characterization of the toxicity of Helicobacter pylori clinical isolates and the biomarker in the stools of gastric cancer patients using MALDI-TOF/MS and multivariate analysis Leung, Yun-Shiuan 06 August 2012 (has links) Chapter 1. Deciphering the toxicity of Helicobacter pylori clinical isolates from gastric diseases patients using MALDI-TOF/MS and multivariate analysis. Helicobacter pylori (H. pyloyi) infection is associated with gastric diseases such as gastric polyp, chronic gastritis, gastric ulcer, gastric cancer, etc. In fact, most of the people infected not have the symptoms of gastric diseases due to the high degree of variability of gene with H. pyloyi and the specific immune responses of the hosts. In order to investigate the relationship between H.pylori and gastric diseases, the clinical strains of H. pylori isolated from patients from nine gastric diseases were extracted from the optimized extraction and analysis by MALDI-TOF/MS, then the high reproducible spectra were combined with multivariate statistical analysis including Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), Discriminant Analysis (DA) . In the result of PCA, there is no specific potential marker to discriminate the clinical strains to nine gastric diseases. In the result of HCA, the strains from different gastric diseases were clustered together means they have the similarity of the protein and metabolite. In the result of DA, the strains from gastric and non-gastric cancer were discriminanted by the discriminant function composed of thirty-eight discriminant variables in the spectra. This discriminant function would be confirmed by other clinical strains isolated from gastric diseases patients in the future and then would help to predict the the similarity of the protein and metabolite of the strains isolated from the gastric diseases patients whether gastric cancer or not. Chapter 2. Biomarker discovery in the stools of gastric cancer patients using MALDI-TOF/MS. According to the statistics of Republic 100 years from the Department of Health, cancer was the first of the ten lesding to death. With the modern change of eatiog habbits, gastrointestinal cancer has increased steadily. Gastrointestinal cancer accompanied occult gastrointestinal bleeding, and it is commonly detected by the fecal occult blood test (FOB). FOB including Guaiac-based fecal occult-blood test and immunochemical tests. Guaiac-based fecal occult-blood tests make use of the pseudoperoxidase activity of heme, and the reagent turns blue after oxidation by oxidants or peroxidases in the presence of an oxygen donor such as hydrogen peroxide, so it would have the potential of false-positive result. Immunochemical tests, which use antibodies detect against human hemoglobin with great sensitivity, but the tests are limited by loss of hemoglobin antigenicity at room temperature and require processing in a laboratory. In order to decrease the false-positive of detecting heme and decreasing the cost of the detection against hemoglobin in stools, in the study, we ues the distill water to extract the heme (m/z 616) and hemoglobin in stools and analysis with the reflectron and linear mode of MALDI-TOF/MS. In this study, at first, we used the stimulated stomach acid decomposing the hemoglobin to release the heme, to stimulate the gastrointestinal bleeding. Second, we used the distill water to extract the hemoglobin in stools, and detected by the linear mode of MALDI-TOF/MS, and the detection limit of MALDI-TOF/MS against hemoglobin in stool was better than the immunochemical tests. Third, the same strategy was applied to fifty-nine patients (including nineteen esophageal cancer patients, twenty gastric cancer patients and colorectal cancer patients) stools to detect heme and hemoglobin by MALDI-TOF/MS and the results were compared with the fecal occult blood test. In the detection of heme, MALDI-TOF/MS had not detect heme, but the Guaiac-based fecal occult-blood test had detected, it would be that the stools had the oxidants (not heme) to react the reagent. In addition, MALDI-TOF/MS had detected heme, but the Guaiac-based fecal occult-blood test had no results, those cases would be catched up in the future. In the detection of hemoglobin, using immunochemical tests to be the reference index, MALDI-TOF/MS had the false-negative result might come from the complicated matrix effect of stools, so that the hemoglobin could not form the good crystalline with matrix CHCA. The false-positive results of MALDI-TOF/MS might come from the criteria of hemoglobin signal. hemoglobin heme fecal occult blood test upper gastrointestinal bleeding Discriminant Analysis MALDI-TOF/MS Helicobacter pylori
128	Evaluation of Groundwater Characteristics Using Multivariate Statistical Method: a Case Study in Kaohsiung Wang, Mei-hsueh 24 August 2012 (has links) It is not easy to state clearly to the public for quality of groundwater bodies, even if there are a large number of effective water quality data, it is still hard to combine and induct,and it often occurs in different units have each put forward to explain on the test results.Multivariate statistical analysis method can simplify high complex data into a representative function of the small number of factors, clearly explained to a group of inter-relationship of the original variables, or to be clustered and identified according to the similarity between the data to understand the reason behind the formation of certain phenomena, so this study utilize it to explore the groundwater characteristics. In this study, monitoring data come from the Kaohsiung city 48 groundwater monitoring wells of the EPA National Water Quality Monitoring Information website database, apply SPSS12.0 package software to execute multivariate statistical analysis, including factor analysis ,cluster analysis and discriminant analysis, and thus induction, sorting and classification of water quality characteristics, evaluating the causes of pollution and local area characteristics. The results of factor analysis to obtain the groundwater quality of the Kaohsiung region 4 representative factors: the factor of salinization, organic pollution factor, the factor of ore melting and acid-base factor. Four principal component factors instead of the 17 analysis projects of the regional groundwater quality in Kaohsiung city, the variance amounted to 78.3%. Use of cluster analysis of the 48 monitoring wells in the region is divided into four groups, according to the different nature of the monitoring data and the nature of similarity and group, to investigate the correlation between the monitoring well water quality within each cluster and the main factor, and by monitoring wells position to distinguish between the average underground water quality of inland area than the coastal area, we can get the results of seawater intrusion and salinization phenomena in coastal area, and monitoring wells located in the Cijin district are polluted by the pH factor. Kaohsiung regional groundwater quality is generally in the case of hard water to very hard water. In order to understand the difference of the multivariate statistical analysis method and the general groundwater pollution index analysis, draw Piper water quality diamond cluster analysis diagram to compare the similarities and differences,the results show that the multivariate statistical analysis can supply a systematic analysis of variable data and the overall variations of the water quality, and objective clustering, while the general composite index analytcial method such as Piper, by the characteristic position to get the type of pollution, but difficult to explain the overall pollution characteristics. At last, in this study, the hope to recommend the pollution control assessment and prevention strategies of Kaohsiung city underground water. discriminant analysis multivariate statistical factor analysis Piper water quality diamond diagram cluster analysis
129	The Model of Credit Rating for Country Risk Chen, Liang-kuang 10 June 2004 (has links) none Factor analysis Credit Ratings Country Risk Ordered Logit model Multiple Discriminant Analysis
130	Classification Of Remotely Sensed Data By Using 2d Local Discriminant Bases Tekinay, Cagri 01 August 2009 (has links) (PDF) In this thesis, 2D Local Discriminant Bases (LDB) algorithm is used to 2D search structure to classify remotely sensed data. 2D Linear Discriminant Analysis (LDA) method is converted into an M-ary classifier by combining majority voting principle and linear distance parameters. The feature extraction algorithm extracts the relevant features by removing the irrelevant ones and/or combining the ones which do not represent supplemental information on their own. The algorithm is implemented on a remotely sensed airborne data set from Tippecanoe County, Indiana to evaluate its performance. The spectral and spatial-frequency features are extracted from the multispectral data and used for classifying vegetative species like corn, soybeans, red clover, wheat and oat in the data set. QA General 15707

Search results