Spelling suggestions: "subject:"expectations maximization.""
41 |
Statistical Learning in Drug Discovery via Clustering and MixturesWang, Xu January 2007 (has links)
In drug discovery, thousands of compounds are assayed to detect activity against a
biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large
volume of compounds tested by high-throughput screening, and the complexity of
molecular structure and its relationship to activity.
This thesis focuses on the design of statistical learning algorithms/models and
their applications to drug discovery. The two main parts of the thesis are: an
algorithm-based statistical method and a more formal model-based approach. Both
approaches can facilitate and accelerate the process of developing new drugs. A
unifying theme is the use of unsupervised methods as components of supervised
learning algorithms/models.
In the first part of the thesis, we explore a sequential screening approach, Cluster
Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates
High Throughput Screening with mathematical modeling to sequentially select the
best compounds. CSARA is a cluster-based and algorithm driven method. To
gain further insight into this method, we use three carefully designed experiments
to compare predictive accuracy with Recursive Partitioning, a popular structureactivity
relationship analysis method. The experiments show that CSARA outperforms
Recursive Partitioning. Comparisons include problems with many descriptor
sets and situations in which many descriptors are not important for activity.
In the second part of the thesis, we propose and develop constrained mixture
discriminant analysis (CMDA), a model-based method. The main idea of CMDA
is to model the distribution of the observations given the class label (e.g. active
or inactive class) as a constrained mixture distribution, and then use Bayes’ rule
to predict the probability of being active for each observation in the testing set.
Constraints are used to deal with the otherwise explosive growth of the number
of parameters with increasing dimensionality. CMDA is designed to solve several
challenges in modeling drug data sets, such as multiple mechanisms, the rare target
problem (i.e. imbalanced classes), and the identification of relevant subspaces of
descriptors (i.e. variable selection).
We focus on the CMDA1 model, in which univariate densities form the building
blocks of the mixture components. Due to the unboundedness of the CMDA1 log
likelihood function, it is easy for the EM algorithm to converge to degenerate solutions.
A special Multi-Step EM algorithm is therefore developed and explored via
several experimental comparisons. Using the multi-step EM algorithm, the CMDA1
model is compared to model-based clustering discriminant analysis (MclustDA).
The CMDA1 model is either superior to or competitive with the MclustDA model,
depending on which model generates the data. The CMDA1 model has better
performance than the MclustDA model when the data are high-dimensional and
unbalanced, an essential feature of the drug discovery problem!
An alternate approach to the problem of degeneracy is penalized estimation. By
introducing a group of simple penalty functions, we consider penalized maximum
likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves
the convergence of the conventional EM algorithm, and helps avoid degenerate
solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s
of the two-dimensional CMDA1 model can be asymptotically consistent.
|
42 |
Network Data Streaming: Algorithms for Network Measurement and MonitoringKumar, Abhishek 18 November 2005 (has links)
With the emergence of computer networks as one of the primary modes of
communication, and with their adoption for an increasingly wide range
of applications, there is a growing need to understand and
characterize the traffic they carry. The rise of large scale
network attacks adds urgency to this need. However, the large size,
high speed and increasing complexity of these networks imply that
tracking and characterizing the traffic they carry is an increasingly
difficult problem. Dealing with higher level aggregates, such as flows
instead of packets, does not solve the problem because these
aggregates tend to be quite numerous and exhibit dynamics of their
own.
In this thesis, we investigate a novel approach to deal with the
immense amounts of data associated with problems in network
measurement and monitoring. Building upon the paradigm of Data
Streaming, which processes a large stream of data using a small
working memory to answer a class of queries, we develop an
architecture for Network Data Streaming that can accommodate
additional constraints imposed in the context of network monitoring.
Using this architecture, we design algorithms for monitoring
properties of network traffic that have traditionally been considered
too difficult to monitor at high speed network links and routers. Our
first algorithm provides the ability to accurately estimate the size
of individual flows. A second algorithm to estimate the distribution of
flow sizes enables network operators to monitor anomalies in the
traffic. Incorporating the use of packet sampling, we can extend the
latter algorithm to estimate the flow size distribution of arbitrary
subpopulations.
Finally, we apply the tools of Network Data Streaming to the operation
of packet sampling itself. Using the ability to efficiently estimate
flow-statistics such as approximate per-flow size, we design a family
of mechanisms where the sampling decision is guided by this knowledge.
The individual solutions developed in this thesis share a common
architectural theme, supporting the monitoring of highly dynamic
populations. Integrating this with the traditional sampling based
framework for network monitoring will enable a broad range of
applications for accurate and comprehensive monitoring of network
traffic.
|
43 |
EM-Based Joint Detection and Estimation for Two-Way Relay NetworkYen, Kai-wei 01 August 2012 (has links)
In this paper, the channel estimation problem for a two-way relay network (TWRN) based on two different wireless channel assumptions is considered. Previous works have proposed a training-based channel estimation method to obtain the channel state information (CSI). But in practice the channel change from one data block to another, which may cause the performance degradation due to the outdated CSI. To enhance the performance, the system has to insert more training signal. In order to improve the bandwidth efficiency, we propose a joint channel estimation and data detection method based on expectation-maximization (EM) algorithm. From the simulation results, the proposed method can combat the effect of fading channel and still the MSE results are very close to Cramer-Rao Lower Bound (CRLB) at the high signal-to-noise ratio (SNR) region. Additionally, as compare with the previous work, the proposed scheme also has a better detection performance for both time-varying and time-invariant channels.
|
44 |
Computer aided diagnosis in digital mammography [electronic resource]: classification of mass and normal tissue / by Monika Shinde.Shinde, Monika. January 2003 (has links)
Title from PDF of title page. / Document formatted into pages; contains 63 pages. / Thesis (M.S.C.S.)--University of South Florida, 2003. / Includes bibliographical references. / Text (Electronic thesis) in PDF format. / ABSTRACT: The work presented here is an important component of an on going project of developing an automated mass classification system for breast cancer screening and diagnosis for Digital Mammogram applications. Specifically, in this work the task of automatically separating mass tissue from normal breast tissue given a region of interest in a digitized mammogram is investigated. This is the crucial stage in developing a robust automated classification system because the classification depends on the accurate assessment of the tumor-normal tissue border as well as information gathered from the tumor area. In this work the Expectation Maximization (EM) method is developed and applied to high resolution digitized screen-film mammograms with the aim of segmenting normal tissue from mass tissue. / ABSTRACT: Both the raw data and summary data generated by Laws' texture analysis are investigated. Since the ultimate goal is robust classification, the merits of the tissue segmentation are assessed by its impact on the overall classification performance. Based on the 300 image dataset consisting of 97 malignant and 203 benign cases, a 63% sensitivity and 89% specificity was achieved. Although, the segmentation requires further investigation, the development and related computer coding of the EM algorithm was successful. The method was developed to take in account the input feature correlation. This development allows other researchers at this facility to investigate various input features without having the intricate understanding of the EM approach. / System requirements: World Wide Web browser and PDF reader. / Mode of access: World Wide Web.
|
45 |
Analysis of circular data in the dynamic model and mixture of von Mises distributionsLan, Tian, active 2013 10 December 2013 (has links)
Analysis of circular data becomes more and more popular in many fields of studies. In this report, I present two statistical analysis of circular data using von Mises distributions. Firstly, the maximization-expectation algorithm is reviewed and used to classify and estimate circular data from the mixture of von Mises distributions. Secondly, Forward Filtering Backward Smoothing method via particle filtering is reviewed and implemented when circular data appears in the dynamic state-space models. / text
|
46 |
Weakly supervised part-of-speech tagging for Chinese using label propagationDing, Weiwei, 1985- 02 February 2012 (has links)
Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission
parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit.
In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text
|
47 |
Statistical Analysis of Operational Data for Manufacturing System Performance ImprovementWang, Zhenrui January 2013 (has links)
The performance of a manufacturing system relies on its four types of elements: operators, machines, computer system and material handling system. To ensure the performance of these elements, operational data containing various aspects of information are collected for monitoring and analysis. This dissertation focuses on the operator performance evaluation and machine failure prediction. The proposed research work is motivated by the following challenges in analyzing operational data. (i) the complex relationship between the variables, (ii) the implicit information important to failure prediction, and (iii) data with outliers, missing and erroneous measurements. To overcome these challenges, the following research has been conducted. To compare operator performance, a methodology combining regression modeling and multiple comparisons technique is proposed. The regression model quantifies and removes the complex effects of other impacting factors on the operator performance. A robust zero-inflated Poisson (ZIP) model is developed to reduce the impacts of the excessive zeros and outliers in the performance metric, i.e. the number of defects (NoD), on regression analysis. The model residuals are plotted in non-parametric statistical charts for performance comparison. The estimated model coefficients are also used to identify under-performing machines. To detect temporal patterns from operational data sequence, an algorithm is proposed for detecting interval-based asynchronous periodic patterns (APP). The algorithm effectively and efficiently detects pattern through a modified clustering and a convolution-based template matching method. To predict machine failures based on the covariates with erroneous measurements, a new method is proposed for statistical inference of proportional hazard model under a mixture of classical and Berkson errors. The method estimates the model coefficients with an expectation-maximization (EM) algorithm with expectation step achieved by Monte Carlo simulation. The model estimated with the proposed method will improve the accuracy of the inference on machine failure probability. The research work presented in this dissertation provides a package of solutions to improve manufacturing system performance. The effectiveness and efficiency of the proposed methodologies have been demonstrated and justified with both numerical simulations and real-world case studies.
|
48 |
Time series analysis of Saudi Arabia oil production dataAlbarrak, Abdulmajeed Barrak 14 December 2013 (has links)
Saudi Arabia is the largest petroleum producer and exporter in the world. Saudi Arabian
economy hugely depends on production and export of oil. This motivates us to do research on oil
production of Saudi Arabia. In our research the prime objective is to find the most appropriate
models for analyzing Saudi Arabia oil production data. Initially we think of considering
integrated autoregressive moving average (ARIMA) models to fit the data. But most of the
variables under study show some kind of volatility and for this reason we finally decide to
consider autoregressive conditional heteroscedastic (ARCH) models for them. If there is no
ARCH effect, it will automatically become an ARIMA model. But the existence of missing
values for almost each of the variable makes the analysis part complicated since the estimation of
parameters in an ARCH model does not converge when observations are missing. As a remedy
to this problem we estimate missing observations first. We employ the expectation maximization
(EM) algorithm for estimating the missing values. But since our data are time series data, any
simple EM algorithm is not appropriate for them. There is also evidence of the presence of
outliers in the data. Therefore we finally employ robust regression least trimmed squares (LTS) based EM algorithm to estimate the missing values. After the estimation of missing values we
employ the White test to select the most appropriate ARCH models for all sixteen variables
under study. Normality test on resulting residuals is performed for each of the variable to check
the validity of the fitted model. / ARCH/GARCH models, outliers and robustness : tests for normality and estimation of missing values in time series -- Outlier analysis and estimation of missing values by robust EM algorithm for Saudi Arabia oil production data -- Selection of ARCH models for Saudi Arabia oil production data. / Department of Mathematical Sciences
|
49 |
Towards Finding Optimal Mixture Of Subspaces For Data ClassificationMusa, Mohamed Elhafiz Mustafa 01 October 2003 (has links) (PDF)
In pattern recognition, when data has different structures in different parts of the
input space, fitting one global model can be slow and inaccurate. Learning methods
can quickly learn the structure of the data in local regions, consequently, offering faster
and more accurate model fitting. Breaking training data set into smaller subsets may
lead to curse of dimensionality problem, as a training sample subset may not be enough
for estimating the required set of parameters for the submodels. Increasing the size of
training data may not be at hand in many situations. Interestingly, the data in local
regions becomes more correlated. Therefore, by decorrelation methods we can reduce
data dimensions and hence the number of parameters. In other words, we can find
uncorrelated low dimensional subspaces that capture most of the data variability. The
current subspace modelling methods have proved better performance than the global
modelling methods for the given type of training data structure. Nevertheless these
methods still need more research work as they are suffering from two limitations
2 There is no standard method to specify the optimal number of subspaces.
² / There is no standard method to specify the optimal dimensionality for each
subspace.
In the current models these two parameters are determined beforehand. In this dissertation
we propose and test algorithms that try to find a suboptimal number of
principal subspaces and a suboptimal dimensionality for each principal subspaces automatically.
|
50 |
Estimating parameters in markov models for longitudinal studies with missing data or surrogate outcomes /Yeh, Hung-Wen. Chan, Wenyaw. January 2007 (has links)
Thesis (Ph. D.)--University of Texas Health Science Center at Houston, School of Public Health, 2007. / Includes bibliographical references (leaves 58-59).
|
Page generated in 0.1579 seconds