Global ETD Search

1	Robust Discriminant Analysis With Asymmetric Classes Ndwapi, Nkumbuludzi January 2018 (has links) Discriminant analysis uses labelled observations to infer the labels of unlabelled observations in a population. Despite many advances in unsupervised and, to a lesser extent, semi-supervised learning over the past decade, discriminant analysis is often employed using approaches that date back to very well-known work of Fisher in the 1930s. One notable exception is mixture discriminant analysis, where the labels are estimated using parametric finite mixture models, commonly the Gaussian mixture model. The supposed advantage with mixture discriminant analysis is that multiple Gaussian components can be used for each class, hence providing a work around when a class is not Gaussian. This thesis makes several contributions to ``modern" discriminant analysis. Three robust discriminant analysis methods are introduced using mixtures of multivariate t-distributions, mixtures of multivariate power exponential distributions, and mixtures of contaminated Gaussian distributions, respectively. This provides an appealing framework for handling varying tail-weights and peakedness in the classes that may also contain mild outliers. To facilitate the modelling of asymmetric classes, we also explore robust discriminant analysis via finite mixtures of generalized hyperbolic distributions and mixtures of multivariate skew-t distributions. These approaches are tailored towards skewed classes but also have the added advantage of modelling symmetric classes where necessary. Finally, we introduce an approach that combines support vector machines with mixture discriminant analysis. This approach defines class boundaries in the labelled observations and, in some sense, improves mixture discriminant analysis performance. Crucially, in all of our mixture modelling work, we consider the case where the number of components per class is one. The utility of the approaches introduced is demonstrated on simulated and real data sets. / Thesis / Doctor of Philosophy (PhD) Discriminant analysis mixture discriminant analysis
2	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
3	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
4	Random Forest Analogues for Mixture Discriminant Analysis Mallo, Muz 09 June 2022 (has links) Finite mixture modelling is a powerful and well-developed paradigm, having proven useful in unsupervised learning and, to a lesser extent supervised learning (mixture discriminant analysis), especially in the case(s) of data with local variation and/or latent variables. It is the aim of this thesis to improve upon mixture discriminant analysis by introducing two types of random forest analogues which are called Mix- Forests. The first MixForest is based on Gaussian mixture models from the famous family of Gaussian parsimonious clustering models and will be useful in classify- ing lower dimensional data. The second MixForest extends the technique to higher dimensional data via the use of mixtures of factor analyzers from the well-known family of parsimonious Gaussian mixture models. MixForests will be utilized in the analysis of real data to demonstrate potential increases in classification accuracy as well as inferential procedures such as generalization error estimation and variable importance measures. / Thesis / Doctor of Philosophy (PhD) finite mixture models Gaussian mixture models ensemble methods mixture discriminant analysis
5	Dynamical analysis of respiratory signals for diagnosis of sleep disordered breathing disorders. Suren Rathnayake Unknown Date (has links) Sleep disordered breathing (SDB) is a highly prevalent but an under-diagnosed disease. Among adults in the ages between 30 to 60 years, 24% of males and 9% of females show conditions of SDB, while 82% of men and 93% of women with moderate to severe SDB remain undiagnosed. Polysomnography (PSG) is the reference diagnostic test for SDB. During PSG, a number of physiological signals are recorded during an overnight sleep and then manually scored for sleep/wake stages and SDB events to obtain the reference diagnosis. The manual scoring of SDB events is an extremely time consuming and cumbersome task with high inter- and intra-rater variations. PSG is a labour intensive, expensive and patient inconvenient test. Further, PSG facilities are limited leading to long waiting lists. There is an enormous clinical need for automation of PSG scoring and an alternative automated ambulatory method suitable for screening the population. During the work of this thesis, we focus (1) on implementing a framework that enables more reliable scoring of SDB events which also lowers manual scoring time, and (2) implementing a reliable automated screening procedure that can be used as a patient-friendly home based study. The recordings of physiological measurements obtained during patients’ sleep of- ten suffer from data losses, interferences and artefacts. In a typical sleep scoring session, artifact-corrupted signal segments are visually detected and removed from further consideration. We developed a novel framework for automated artifact detection and signal restoration, based on the redundancy among respiratory flow signals. The signals focused on are the airflow (thermistor sensors) and nasal pressure signals that are clinically significant in detecting respira- tory disturbances. We treat the respiratory system as a dynamical system, and use the celebrated Takens embedding theorem as the theoretical basis for sig- nal prediction. In this study, we categorise commonly occurring artefacts and distortions in the airflow and nasal pressure measurements into several groups and explore the efficacy of the proposed technique in detecting/recovering them. Results we obtained from a database of clinical PSG signals indicated that theproposed technique can detect artefacts/distortions with a sensitivity >88% and specificity >92%. This work has the potential to simplify the work done by sleep scoring technicians, and also to improve automated sleep scoring methods. During the next phase of the thesis we have investigated the diagnostic ability of single – and dual–channel respiratory flow measuring devices. Recent studies have shown that single channel respiratory flow measurements can be used for automated diagnosis/screening for sleep disordered breathing (SDB) diseases. Improvements for reliable home-based monitoring for SDB may be achieved with the use of predictors based on recurrence quantification analysis (RQA). RQA essentially measures the complex structures present in a time series and are relatively independent of the nonlinearities present in the respiratory measurements such as those due to breathing nonlinearities and sensor movements. The nasal pressure, thermistor-based airflow, abdominal movement and thoracic movement measurements obtained during Polysomnography, were used in this study to implement an algorithm for automated screening for SDB diseases. The algorithm predicts SDB-affected measurement segments using twelve features based on RQA, body mass index (BMI) and neck circumference using mixture discriminant analysis (MDA). The rate of SDB affected segments of data per hour of recording (RDIS) is used as a measure for the diagnosis of SDB diseases. The operating points to be chosen were the prior probability of SDB affected data segments (π1) and the RDIS threshold value, above which a patient is predicted to have a SDB disease. Cross-validation with five-folds, stratified based on the RDI values of the recordings, was used in estimating the operating points. Sensitivity and specificity rates for the final classifier were estimated using a two-layer assessment approach with the operating points chosen at the inner layer using five-fold cross-validation and the choice assessed at the outer layer using repeated learning-testing. The nasal pressure measurement showed higher accuracy compared to other respiratory measurements when used alone. The nasal pressure and thoracic movement measurements were identified as the best pair of measurements to be used in a dual channel device. The estimated sensitivity and specificity (standard error) in diagnosing SDB disease (RDI ≥ 15) are 90.3(3.1)% and 88.3(5.5)% when nasal pressure is used alone and together with the thoracic movement it was 89.5(3.7)% and 100.0(0.0)%. Present results suggest that RQA of a single respiratory measurement has potential to be used in an automated SDB screening device, while with dual-channel more reliable accuracy can be expected. Improvements may be possible by including other RQA based features and optimisation of the parameters. nonlinear signal analysis dynamic systems modelling sleep disordered breathing neural networks recurrence quantification analysis mixture discriminant analysis cross-validation
6	Dynamical analysis of respiratory signals for diagnosis of sleep disordered breathing disorders. Suren Rathnayake Unknown Date (has links) Sleep disordered breathing (SDB) is a highly prevalent but an under-diagnosed disease. Among adults in the ages between 30 to 60 years, 24% of males and 9% of females show conditions of SDB, while 82% of men and 93% of women with moderate to severe SDB remain undiagnosed. Polysomnography (PSG) is the reference diagnostic test for SDB. During PSG, a number of physiological signals are recorded during an overnight sleep and then manually scored for sleep/wake stages and SDB events to obtain the reference diagnosis. The manual scoring of SDB events is an extremely time consuming and cumbersome task with high inter- and intra-rater variations. PSG is a labour intensive, expensive and patient inconvenient test. Further, PSG facilities are limited leading to long waiting lists. There is an enormous clinical need for automation of PSG scoring and an alternative automated ambulatory method suitable for screening the population. During the work of this thesis, we focus (1) on implementing a framework that enables more reliable scoring of SDB events which also lowers manual scoring time, and (2) implementing a reliable automated screening procedure that can be used as a patient-friendly home based study. The recordings of physiological measurements obtained during patients’ sleep of- ten suffer from data losses, interferences and artefacts. In a typical sleep scoring session, artifact-corrupted signal segments are visually detected and removed from further consideration. We developed a novel framework for automated artifact detection and signal restoration, based on the redundancy among respiratory flow signals. The signals focused on are the airflow (thermistor sensors) and nasal pressure signals that are clinically significant in detecting respira- tory disturbances. We treat the respiratory system as a dynamical system, and use the celebrated Takens embedding theorem as the theoretical basis for sig- nal prediction. In this study, we categorise commonly occurring artefacts and distortions in the airflow and nasal pressure measurements into several groups and explore the efficacy of the proposed technique in detecting/recovering them. Results we obtained from a database of clinical PSG signals indicated that theproposed technique can detect artefacts/distortions with a sensitivity >88% and specificity >92%. This work has the potential to simplify the work done by sleep scoring technicians, and also to improve automated sleep scoring methods. During the next phase of the thesis we have investigated the diagnostic ability of single – and dual–channel respiratory flow measuring devices. Recent studies have shown that single channel respiratory flow measurements can be used for automated diagnosis/screening for sleep disordered breathing (SDB) diseases. Improvements for reliable home-based monitoring for SDB may be achieved with the use of predictors based on recurrence quantification analysis (RQA). RQA essentially measures the complex structures present in a time series and are relatively independent of the nonlinearities present in the respiratory measurements such as those due to breathing nonlinearities and sensor movements. The nasal pressure, thermistor-based airflow, abdominal movement and thoracic movement measurements obtained during Polysomnography, were used in this study to implement an algorithm for automated screening for SDB diseases. The algorithm predicts SDB-affected measurement segments using twelve features based on RQA, body mass index (BMI) and neck circumference using mixture discriminant analysis (MDA). The rate of SDB affected segments of data per hour of recording (RDIS) is used as a measure for the diagnosis of SDB diseases. The operating points to be chosen were the prior probability of SDB affected data segments (π1) and the RDIS threshold value, above which a patient is predicted to have a SDB disease. Cross-validation with five-folds, stratified based on the RDI values of the recordings, was used in estimating the operating points. Sensitivity and specificity rates for the final classifier were estimated using a two-layer assessment approach with the operating points chosen at the inner layer using five-fold cross-validation and the choice assessed at the outer layer using repeated learning-testing. The nasal pressure measurement showed higher accuracy compared to other respiratory measurements when used alone. The nasal pressure and thoracic movement measurements were identified as the best pair of measurements to be used in a dual channel device. The estimated sensitivity and specificity (standard error) in diagnosing SDB disease (RDI ≥ 15) are 90.3(3.1)% and 88.3(5.5)% when nasal pressure is used alone and together with the thoracic movement it was 89.5(3.7)% and 100.0(0.0)%. Present results suggest that RQA of a single respiratory measurement has potential to be used in an automated SDB screening device, while with dual-channel more reliable accuracy can be expected. Improvements may be possible by including other RQA based features and optimisation of the parameters. nonlinear signal analysis dynamic systems modelling sleep disordered breathing neural networks recurrence quantification analysis mixture discriminant analysis cross-validation
7	Dynamical analysis of respiratory signals for diagnosis of sleep disordered breathing disorders. Suren Rathnayake Unknown Date (has links) Sleep disordered breathing (SDB) is a highly prevalent but an under-diagnosed disease. Among adults in the ages between 30 to 60 years, 24% of males and 9% of females show conditions of SDB, while 82% of men and 93% of women with moderate to severe SDB remain undiagnosed. Polysomnography (PSG) is the reference diagnostic test for SDB. During PSG, a number of physiological signals are recorded during an overnight sleep and then manually scored for sleep/wake stages and SDB events to obtain the reference diagnosis. The manual scoring of SDB events is an extremely time consuming and cumbersome task with high inter- and intra-rater variations. PSG is a labour intensive, expensive and patient inconvenient test. Further, PSG facilities are limited leading to long waiting lists. There is an enormous clinical need for automation of PSG scoring and an alternative automated ambulatory method suitable for screening the population. During the work of this thesis, we focus (1) on implementing a framework that enables more reliable scoring of SDB events which also lowers manual scoring time, and (2) implementing a reliable automated screening procedure that can be used as a patient-friendly home based study. The recordings of physiological measurements obtained during patients’ sleep of- ten suffer from data losses, interferences and artefacts. In a typical sleep scoring session, artifact-corrupted signal segments are visually detected and removed from further consideration. We developed a novel framework for automated artifact detection and signal restoration, based on the redundancy among respiratory flow signals. The signals focused on are the airflow (thermistor sensors) and nasal pressure signals that are clinically significant in detecting respira- tory disturbances. We treat the respiratory system as a dynamical system, and use the celebrated Takens embedding theorem as the theoretical basis for sig- nal prediction. In this study, we categorise commonly occurring artefacts and distortions in the airflow and nasal pressure measurements into several groups and explore the efficacy of the proposed technique in detecting/recovering them. Results we obtained from a database of clinical PSG signals indicated that theproposed technique can detect artefacts/distortions with a sensitivity >88% and specificity >92%. This work has the potential to simplify the work done by sleep scoring technicians, and also to improve automated sleep scoring methods. During the next phase of the thesis we have investigated the diagnostic ability of single – and dual–channel respiratory flow measuring devices. Recent studies have shown that single channel respiratory flow measurements can be used for automated diagnosis/screening for sleep disordered breathing (SDB) diseases. Improvements for reliable home-based monitoring for SDB may be achieved with the use of predictors based on recurrence quantification analysis (RQA). RQA essentially measures the complex structures present in a time series and are relatively independent of the nonlinearities present in the respiratory measurements such as those due to breathing nonlinearities and sensor movements. The nasal pressure, thermistor-based airflow, abdominal movement and thoracic movement measurements obtained during Polysomnography, were used in this study to implement an algorithm for automated screening for SDB diseases. The algorithm predicts SDB-affected measurement segments using twelve features based on RQA, body mass index (BMI) and neck circumference using mixture discriminant analysis (MDA). The rate of SDB affected segments of data per hour of recording (RDIS) is used as a measure for the diagnosis of SDB diseases. The operating points to be chosen were the prior probability of SDB affected data segments (π1) and the RDIS threshold value, above which a patient is predicted to have a SDB disease. Cross-validation with five-folds, stratified based on the RDI values of the recordings, was used in estimating the operating points. Sensitivity and specificity rates for the final classifier were estimated using a two-layer assessment approach with the operating points chosen at the inner layer using five-fold cross-validation and the choice assessed at the outer layer using repeated learning-testing. The nasal pressure measurement showed higher accuracy compared to other respiratory measurements when used alone. The nasal pressure and thoracic movement measurements were identified as the best pair of measurements to be used in a dual channel device. The estimated sensitivity and specificity (standard error) in diagnosing SDB disease (RDI ≥ 15) are 90.3(3.1)% and 88.3(5.5)% when nasal pressure is used alone and together with the thoracic movement it was 89.5(3.7)% and 100.0(0.0)%. Present results suggest that RQA of a single respiratory measurement has potential to be used in an automated SDB screening device, while with dual-channel more reliable accuracy can be expected. Improvements may be possible by including other RQA based features and optimisation of the parameters. nonlinear signal analysis dynamic systems modelling sleep disordered breathing neural networks recurrence quantification analysis mixture discriminant analysis cross-validation

1

Page generated in 0.0842 seconds