Global ETD Search

151	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
152	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
153	Application of multivariate statistical method to characterize the groundwater quality of a contaminated site Chiou, Hsien-wei 07 February 2010 (has links) In this study, a chlorinated-solvent contaminated groundwater site was used as the study site. Multivariate statistical analysis explains the huge and complicated current situation of the original data efficiently, concisely, and explicitly; it simplifies the original data into representative factors, or bases on the similarity between data to cluster and identify clustering outcome. The statistical software SPSS 12.0 was used to perform the multivariate statistical analysis to evaluate groundwater quality characteristics of this site. Results show that 20 analytical items of groundwater quality of the study site are simplified into seven major representative factors through factor analysis, including ¡§background¡¨, ¡§salt residual¡¨, ¡§hardness¡¨, ¡§ethylene chloride¡¨, ¡§alkalinity¡¨, ¡§organic pollutant¡¨, and ¡§chloroform¡¨. The factor score diagram was drawn according to the score of monitoring well on each factor and 89.6% of the variance could be obtained. This study used cluster analysis to cluster in two phrases, the groundwater quality monitoring wells were classified into seven clusters according to the similarity of monitored data nature and the differences between clusters. The groundwater quality characteristics and pollutant distributions of each cluster out this site were evaluated. The clustering result indicates that for the sixth cluster (where monitoring well SW-6 was the representative well), the average concentrations of chlorides such as 1,1-dichloroethylene, 1,1-dichloroethane, and cis-1,2-dichloroethylene were the highest among the clusters, indicating those the groundwater of nearby area might be polluted by chlorinated organic compounds. In addition, to evaluate whether the clustering of cluster analysis were appropriate or not, discriminant analysis is used to evaluate clustering accuracy, in which seven Fisher discriminant coefficient formulas that were exclusively suitable for this location were established. Then, the observed values were substituted to Fisher discriminant coefficient formula. Result shows that the monitoring well¡¦s clusters obtained from discriminant analysis were totally identical with the result of actual cluster analysis; the accuracy were 100%. After performing cross-validation analysis, the result shows that the accuracy were 80%, indicating the use of discriminant analysis (with forecasting function) to verify the clustering result of the cluster analysis was highly accurate. After analyzing the pollution condition of this site using time trend and space distribution, it were determined to conclude that trichloroethylene and 1,1-dichloroethylene were the major concerning pollutants; the pollutants appeared to be spreading on a large scale, so it was difficult to use the existing data to evaluate the pollution source. After assessing environmental medium characteristics and pollutant distribution of the site, this study suggests that the use of insitu bioremediation, which is cost-effective, can be applied as a remedial mothod. Factor analysis discriminant analysis cluster analysis chlorinated organic compound groundwater quality
154	Characterization of the toxicity of Helicobacter pylori clinical isolates and the biomarker in the stools of gastric cancer patients using MALDI-TOF/MS and multivariate analysis Leung, Yun-Shiuan 06 August 2012 (has links) Chapter 1. Deciphering the toxicity of Helicobacter pylori clinical isolates from gastric diseases patients using MALDI-TOF/MS and multivariate analysis. Helicobacter pylori (H. pyloyi) infection is associated with gastric diseases such as gastric polyp, chronic gastritis, gastric ulcer, gastric cancer, etc. In fact, most of the people infected not have the symptoms of gastric diseases due to the high degree of variability of gene with H. pyloyi and the specific immune responses of the hosts. In order to investigate the relationship between H.pylori and gastric diseases, the clinical strains of H. pylori isolated from patients from nine gastric diseases were extracted from the optimized extraction and analysis by MALDI-TOF/MS, then the high reproducible spectra were combined with multivariate statistical analysis including Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), Discriminant Analysis (DA) . In the result of PCA, there is no specific potential marker to discriminate the clinical strains to nine gastric diseases. In the result of HCA, the strains from different gastric diseases were clustered together means they have the similarity of the protein and metabolite. In the result of DA, the strains from gastric and non-gastric cancer were discriminanted by the discriminant function composed of thirty-eight discriminant variables in the spectra. This discriminant function would be confirmed by other clinical strains isolated from gastric diseases patients in the future and then would help to predict the the similarity of the protein and metabolite of the strains isolated from the gastric diseases patients whether gastric cancer or not. Chapter 2. Biomarker discovery in the stools of gastric cancer patients using MALDI-TOF/MS. According to the statistics of Republic 100 years from the Department of Health, cancer was the first of the ten lesding to death. With the modern change of eatiog habbits, gastrointestinal cancer has increased steadily. Gastrointestinal cancer accompanied occult gastrointestinal bleeding, and it is commonly detected by the fecal occult blood test (FOB). FOB including Guaiac-based fecal occult-blood test and immunochemical tests. Guaiac-based fecal occult-blood tests make use of the pseudoperoxidase activity of heme, and the reagent turns blue after oxidation by oxidants or peroxidases in the presence of an oxygen donor such as hydrogen peroxide, so it would have the potential of false-positive result. Immunochemical tests, which use antibodies detect against human hemoglobin with great sensitivity, but the tests are limited by loss of hemoglobin antigenicity at room temperature and require processing in a laboratory. In order to decrease the false-positive of detecting heme and decreasing the cost of the detection against hemoglobin in stools, in the study, we ues the distill water to extract the heme (m/z 616) and hemoglobin in stools and analysis with the reflectron and linear mode of MALDI-TOF/MS. In this study, at first, we used the stimulated stomach acid decomposing the hemoglobin to release the heme, to stimulate the gastrointestinal bleeding. Second, we used the distill water to extract the hemoglobin in stools, and detected by the linear mode of MALDI-TOF/MS, and the detection limit of MALDI-TOF/MS against hemoglobin in stool was better than the immunochemical tests. Third, the same strategy was applied to fifty-nine patients (including nineteen esophageal cancer patients, twenty gastric cancer patients and colorectal cancer patients) stools to detect heme and hemoglobin by MALDI-TOF/MS and the results were compared with the fecal occult blood test. In the detection of heme, MALDI-TOF/MS had not detect heme, but the Guaiac-based fecal occult-blood test had detected, it would be that the stools had the oxidants (not heme) to react the reagent. In addition, MALDI-TOF/MS had detected heme, but the Guaiac-based fecal occult-blood test had no results, those cases would be catched up in the future. In the detection of hemoglobin, using immunochemical tests to be the reference index, MALDI-TOF/MS had the false-negative result might come from the complicated matrix effect of stools, so that the hemoglobin could not form the good crystalline with matrix CHCA. The false-positive results of MALDI-TOF/MS might come from the criteria of hemoglobin signal. hemoglobin heme fecal occult blood test upper gastrointestinal bleeding Discriminant Analysis MALDI-TOF/MS Helicobacter pylori
155	Evaluation of Groundwater Characteristics Using Multivariate Statistical Method: a Case Study in Kaohsiung Wang, Mei-hsueh 24 August 2012 (has links) It is not easy to state clearly to the public for quality of groundwater bodies, even if there are a large number of effective water quality data, it is still hard to combine and induct,and it often occurs in different units have each put forward to explain on the test results.Multivariate statistical analysis method can simplify high complex data into a representative function of the small number of factors, clearly explained to a group of inter-relationship of the original variables, or to be clustered and identified according to the similarity between the data to understand the reason behind the formation of certain phenomena, so this study utilize it to explore the groundwater characteristics. In this study, monitoring data come from the Kaohsiung city 48 groundwater monitoring wells of the EPA National Water Quality Monitoring Information website database, apply SPSS12.0 package software to execute multivariate statistical analysis, including factor analysis ,cluster analysis and discriminant analysis, and thus induction, sorting and classification of water quality characteristics, evaluating the causes of pollution and local area characteristics. The results of factor analysis to obtain the groundwater quality of the Kaohsiung region 4 representative factors: the factor of salinization, organic pollution factor, the factor of ore melting and acid-base factor. Four principal component factors instead of the 17 analysis projects of the regional groundwater quality in Kaohsiung city, the variance amounted to 78.3%. Use of cluster analysis of the 48 monitoring wells in the region is divided into four groups, according to the different nature of the monitoring data and the nature of similarity and group, to investigate the correlation between the monitoring well water quality within each cluster and the main factor, and by monitoring wells position to distinguish between the average underground water quality of inland area than the coastal area, we can get the results of seawater intrusion and salinization phenomena in coastal area, and monitoring wells located in the Cijin district are polluted by the pH factor. Kaohsiung regional groundwater quality is generally in the case of hard water to very hard water. In order to understand the difference of the multivariate statistical analysis method and the general groundwater pollution index analysis, draw Piper water quality diamond cluster analysis diagram to compare the similarities and differences,the results show that the multivariate statistical analysis can supply a systematic analysis of variable data and the overall variations of the water quality, and objective clustering, while the general composite index analytcial method such as Piper, by the characteristic position to get the type of pollution, but difficult to explain the overall pollution characteristics. At last, in this study, the hope to recommend the pollution control assessment and prevention strategies of Kaohsiung city underground water. discriminant analysis multivariate statistical factor analysis Piper water quality diamond diagram cluster analysis
156	The Model of Credit Rating for Country Risk Chen, Liang-kuang 10 June 2004 (has links) none Factor analysis Credit Ratings Country Risk Ordered Logit model Multiple Discriminant Analysis
157	Analysis Of Sinusoidal And Helical Buckling Of Drill String In Horizontal Wells Using Finite Element Method Arpaci, Erdogan 01 August 2009 (has links) (PDF) The number of horizontal wells is increasing rapidly in all over the world with the growth of new technological developments. During horizontal well drilling, much more complex problems occur when compared with vertical well drilling, such as decrease in load transfer to the bit, tubular failure, tubular fatigue and tubular lock-up. This makes selection of appropriate tubular and making the right drill string design more important. As the total compression load on the horizontal section increases, the behavior of the tubular changes from straight to sinusoidal buckling, and if the total compression load continues to increase the behavior of the tubular changes to helical buckling. Determination of critical buckling loads with finite element method (FEM) in horizontal wells is the main objective of this study. Initially, a computer program (ANSYS) that uses FEM is employed to simulate different tubular and well conditions. Four different pipe sizes, four different wellbore sizes and three different torque values are used to model the cases. Critical buckling load values corresponding to significant variables are collected from these simulated cases. The results are classified into different buckling modes according to the applied weight on bit values and the main properties of the simulated model, such as modulus of elasticity, moment of inertia of tubular cross section, weight per unit length of tubular and radial clearance between the wellbore and the tubular. Then, the boundary equations between the buckling modes are obtained. The equations developed in this thesis by simulating the cases for the specific tubular sizes are used to make a comparison between the critical buckling load values from the models in the literature and this work. It is observed that the results of this work fit with literature models as the tubular size increases. The influence of torque on critical buckling load values is investigated. It is observed that torque has a slight effect on critical buckling load values. Also the applicability of ANSYS for buckling problems was revealed by comparing the ANSYS results with the literature models&amp / #8217 / results and the experimental study in the literature. QC General 27395
158	Infinite dimensional discrimination and classification Shin, Hyejin 17 September 2007 (has links) Modern data collection methods are now frequently returning observations that should be viewed as the result of digitized recording or sampling from stochastic processes rather than vectors of finite length. In spite of great demands, only a few classification methodologies for such data have been suggested and supporting theory is quite limited. The focus of this dissertation is on discrimination and classification in this infinite dimensional setting. The methodology and theory we develop are based on the abstract canonical correlation concept of Eubank and Hsing (2005), and motivated by the fact that Fisher's discriminant analysis method is intimately tied to canonical correlation analysis. Specifically, we have developed a theoretical framework for discrimination and classification of sample paths from stochastic processes through use of the Loeve-Parzen isomorphism that connects a second order process to the reproducing kernel Hilbert space generated by its covariance kernel. This approach provides a seamless transition between the finite and infinite dimensional settings and lends itself well to computation via smoothing and regularization. In addition, we have developed a new computational procedure and illustrated it with simulated data and Canadian weather data. Fisher's linear discriminant analysis Canonical correlation analysis Stochastic porcesses Reproducing kernel Hilbert space Functional data
159	A discriminant analysis between adolescent sexual offenders and non sexual offenders Hill, Robert A. January 1999 (has links) Thesis (Ph. D.)--University of Missouri-Columbia, 1999. / Typescript. Vita. Includes bibliographical references (leaves 36-44). Also available on the Internet.
160	Time series discrimination, signal comparison testing, and model selection in the state-space framework / Bengtsson, Thomas January 2000 (has links) Thesis (Ph. D.)--University of Missouri-Columbia, 2000. / Typescript. Vita. Includes bibliographical references (leaf 104). Also available on the Internet.

Search results