Global ETD Search

41	Statistical Learning in Drug Discovery via Clustering and Mixtures Wang, Xu January 2007 (has links) In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent. Statistics
42	Network Data Streaming: Algorithms for Network Measurement and Monitoring Kumar, Abhishek 18 November 2005 (has links) With the emergence of computer networks as one of the primary modes of communication, and with their adoption for an increasingly wide range of applications, there is a growing need to understand and characterize the traffic they carry. The rise of large scale network attacks adds urgency to this need. However, the large size, high speed and increasing complexity of these networks imply that tracking and characterizing the traffic they carry is an increasingly difficult problem. Dealing with higher level aggregates, such as flows instead of packets, does not solve the problem because these aggregates tend to be quite numerous and exhibit dynamics of their own. In this thesis, we investigate a novel approach to deal with the immense amounts of data associated with problems in network measurement and monitoring. Building upon the paradigm of Data Streaming, which processes a large stream of data using a small working memory to answer a class of queries, we develop an architecture for Network Data Streaming that can accommodate additional constraints imposed in the context of network monitoring. Using this architecture, we design algorithms for monitoring properties of network traffic that have traditionally been considered too difficult to monitor at high speed network links and routers. Our first algorithm provides the ability to accurately estimate the size of individual flows. A second algorithm to estimate the distribution of flow sizes enables network operators to monitor anomalies in the traffic. Incorporating the use of packet sampling, we can extend the latter algorithm to estimate the flow size distribution of arbitrary subpopulations. Finally, we apply the tools of Network Data Streaming to the operation of packet sampling itself. Using the ability to efficiently estimate flow-statistics such as approximate per-flow size, we design a family of mechanisms where the sampling decision is guided by this knowledge. The individual solutions developed in this thesis share a common architectural theme, supporting the monitoring of highly dynamic populations. Integrating this with the traditional sampling based framework for network monitoring will enable a broad range of applications for accurate and comprehensive monitoring of network traffic. High-speed networks Network algorithms Expectation maximization Data streaming Network monitoring Computer networks
43	EM-Based Joint Detection and Estimation for Two-Way Relay Network Yen, Kai-wei 01 August 2012 (has links) In this paper, the channel estimation problem for a two-way relay network (TWRN) based on two different wireless channel assumptions is considered. Previous works have proposed a training-based channel estimation method to obtain the channel state information (CSI). But in practice the channel change from one data block to another, which may cause the performance degradation due to the outdated CSI. To enhance the performance, the system has to insert more training signal. In order to improve the bandwidth efficiency, we propose a joint channel estimation and data detection method based on expectation-maximization (EM) algorithm. From the simulation results, the proposed method can combat the effect of fading channel and still the MSE results are very close to Cramer-Rao Lower Bound (CRLB) at the high signal-to-noise ratio (SNR) region. Additionally, as compare with the previous work, the proposed scheme also has a better detection performance for both time-varying and time-invariant channels. channel estimation cooperative communication network expectation-maximization algorithm data detection two-way relay network
44	Computer aided diagnosis in digital mammography [electronic resource]: classification of mass and normal tissue / by Monika Shinde. Shinde, Monika. January 2003 (has links) Title from PDF of title page. / Document formatted into pages; contains 63 pages. / Thesis (M.S.C.S.)--University of South Florida, 2003. / Includes bibliographical references. / Text (Electronic thesis) in PDF format. / ABSTRACT: The work presented here is an important component of an on going project of developing an automated mass classification system for breast cancer screening and diagnosis for Digital Mammogram applications. Specifically, in this work the task of automatically separating mass tissue from normal breast tissue given a region of interest in a digitized mammogram is investigated. This is the crucial stage in developing a robust automated classification system because the classification depends on the accurate assessment of the tumor-normal tissue border as well as information gathered from the tumor area. In this work the Expectation Maximization (EM) method is developed and applied to high resolution digitized screen-film mammograms with the aim of segmenting normal tissue from mass tissue. / ABSTRACT: Both the raw data and summary data generated by Laws' texture analysis are investigated. Since the ultimate goal is robust classification, the merits of the tissue segmentation are assessed by its impact on the overall classification performance. Based on the 300 image dataset consisting of 97 malignant and 203 benign cases, a 63% sensitivity and 89% specificity was achieved. Although, the segmentation requires further investigation, the development and related computer coding of the EM algorithm was successful. The method was developed to take in account the input feature correlation. This development allows other researchers at this facility to investigate various input features without having the intricate understanding of the EM approach. / System requirements: World Wide Web browser and PDF reader. / Mode of access: World Wide Web. expectation maximization. laws' texture features. mass segmentation.
45	Analysis of circular data in the dynamic model and mixture of von Mises distributions Lan, Tian, active 2013 10 December 2013 (has links) Analysis of circular data becomes more and more popular in many fields of studies. In this report, I present two statistical analysis of circular data using von Mises distributions. Firstly, the maximization-expectation algorithm is reviewed and used to classify and estimate circular data from the mixture of von Mises distributions. Secondly, Forward Filtering Backward Smoothing method via particle filtering is reviewed and implemented when circular data appears in the dynamic state-space models. / text Circular data Von Mises distribution Mixture of distributions Time series Dynamic model Expectation-maximization algorithm Particle filter
46	Weakly supervised part-of-speech tagging for Chinese using label propagation Ding, Weiwei, 1985- 02 February 2012 (has links) Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit. In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text Chinese part-of-speech tagging Hidden Markov model Expectation maximization Label propagation
47	Statistical Analysis of Operational Data for Manufacturing System Performance Improvement Wang, Zhenrui January 2013 (has links) The performance of a manufacturing system relies on its four types of elements: operators, machines, computer system and material handling system. To ensure the performance of these elements, operational data containing various aspects of information are collected for monitoring and analysis. This dissertation focuses on the operator performance evaluation and machine failure prediction. The proposed research work is motivated by the following challenges in analyzing operational data. (i) the complex relationship between the variables, (ii) the implicit information important to failure prediction, and (iii) data with outliers, missing and erroneous measurements. To overcome these challenges, the following research has been conducted. To compare operator performance, a methodology combining regression modeling and multiple comparisons technique is proposed. The regression model quantifies and removes the complex effects of other impacting factors on the operator performance. A robust zero-inflated Poisson (ZIP) model is developed to reduce the impacts of the excessive zeros and outliers in the performance metric, i.e. the number of defects (NoD), on regression analysis. The model residuals are plotted in non-parametric statistical charts for performance comparison. The estimated model coefficients are also used to identify under-performing machines. To detect temporal patterns from operational data sequence, an algorithm is proposed for detecting interval-based asynchronous periodic patterns (APP). The algorithm effectively and efficiently detects pattern through a modified clustering and a convolution-based template matching method. To predict machine failures based on the covariates with erroneous measurements, a new method is proposed for statistical inference of proportional hazard model under a mixture of classical and Berkson errors. The method estimates the model coefficients with an expectation-maximization (EM) algorithm with expectation step achieved by Monte Carlo simulation. The model estimated with the proposed method will improve the accuracy of the inference on machine failure probability. The research work presented in this dissertation provides a package of solutions to improve manufacturing system performance. The effectiveness and efficiency of the proposed methodologies have been demonstrated and justified with both numerical simulations and real-world case studies. Hierarchical clustering Measurement error Multiple comparisons Robust regression Systems & Industrial Engineering Expectation-maximization
48	Time series analysis of Saudi Arabia oil production data Albarrak, Abdulmajeed Barrak 14 December 2013 (has links) Saudi Arabia is the largest petroleum producer and exporter in the world. Saudi Arabian economy hugely depends on production and export of oil. This motivates us to do research on oil production of Saudi Arabia. In our research the prime objective is to find the most appropriate models for analyzing Saudi Arabia oil production data. Initially we think of considering integrated autoregressive moving average (ARIMA) models to fit the data. But most of the variables under study show some kind of volatility and for this reason we finally decide to consider autoregressive conditional heteroscedastic (ARCH) models for them. If there is no ARCH effect, it will automatically become an ARIMA model. But the existence of missing values for almost each of the variable makes the analysis part complicated since the estimation of parameters in an ARCH model does not converge when observations are missing. As a remedy to this problem we estimate missing observations first. We employ the expectation maximization (EM) algorithm for estimating the missing values. But since our data are time series data, any simple EM algorithm is not appropriate for them. There is also evidence of the presence of outliers in the data. Therefore we finally employ robust regression least trimmed squares (LTS) based EM algorithm to estimate the missing values. After the estimation of missing values we employ the White test to select the most appropriate ARCH models for all sixteen variables under study. Normality test on resulting residuals is performed for each of the variable to check the validity of the fitted model. / ARCH/GARCH models, outliers and robustness : tests for normality and estimation of missing values in time series -- Outlier analysis and estimation of missing values by robust EM algorithm for Saudi Arabia oil production data -- Selection of ARCH models for Saudi Arabia oil production data. / Department of Mathematical Sciences Expectation-maximization algorithms
49	Towards Finding Optimal Mixture Of Subspaces For Data Classification Musa, Mohamed Elhafiz Mustafa 01 October 2003 (has links) (PDF) In pattern recognition, when data has different structures in different parts of the input space, fitting one global model can be slow and inaccurate. Learning methods can quickly learn the structure of the data in local regions, consequently, offering faster and more accurate model fitting. Breaking training data set into smaller subsets may lead to curse of dimensionality problem, as a training sample subset may not be enough for estimating the required set of parameters for the submodels. Increasing the size of training data may not be at hand in many situations. Interestingly, the data in local regions becomes more correlated. Therefore, by decorrelation methods we can reduce data dimensions and hence the number of parameters. In other words, we can find uncorrelated low dimensional subspaces that capture most of the data variability. The current subspace modelling methods have proved better performance than the global modelling methods for the given type of training data structure. Nevertheless these methods still need more research work as they are suffering from two limitations 2 There is no standard method to specify the optimal number of subspaces. &sup2 / There is no standard method to specify the optimal dimensionality for each subspace. In the current models these two parameters are determined beforehand. In this dissertation we propose and test algorithms that try to find a suboptimal number of principal subspaces and a suboptimal dimensionality for each principal subspaces automatically. QA Computer Software 76.75-76.765
50	Estimating parameters in markov models for longitudinal studies with missing data or surrogate outcomes / Yeh, Hung-Wen. Chan, Wenyaw. January 2007 (has links) Thesis (Ph. D.)--University of Texas Health Science Center at Houston, School of Public Health, 2007. / Includes bibliographical references (leaves 58-59).

Search results