Spelling suggestions: "subject:"[een] MODEL SELECTION"" "subject:"[enn] MODEL SELECTION""
191 |
Mixture of Factor Analyzers with Information Criteria and the Genetic AlgorithmTuran, Esra 01 August 2010 (has links)
In this dissertation, we have developed and combined several statistical techniques in Bayesian factor analysis (BAYFA) and mixture of factor analyzers (MFA) to overcome the shortcoming of these existing methods. Information Criteria are brought into the context of the BAYFA model as a decision rule for choosing the number of factors m along with the Press and Shigemasu method, Gibbs Sampling and Iterated Conditional Modes deterministic optimization. Because of sensitivity of BAYFA on the prior information of the factor pattern structure, the prior factor pattern structure is learned directly from the given sample observations data adaptively using Sparse Root algorithm.
Clustering and dimensionality reduction have long been considered two of the fundamental problems in unsupervised learning or statistical pattern recognition. In this dissertation, we shall introduce a novel statistical learning technique by focusing our attention on MFA from the perspective of a method for model-based density estimation to cluster the high-dimensional data and at the same time carry out factor analysis to reduce the curse of dimensionality simultaneously in an expert data mining system. The typical EM algorithm can get trapped in one of the many local maxima therefore, it is slow to converge and can never converge to global optima, and highly dependent upon initial values. We extend the EM algorithm proposed by cite{Gahramani1997} for the MFA using intelligent initialization techniques, K-means and regularized Mahalabonis distance and introduce the new Genetic Expectation Algorithm (GEM) into MFA in order to overcome the shortcomings of typical EM algorithm. Another shortcoming of EM algorithm for MFA is assuming the variance of the error vector and the number of factors is the same for each mixture. We propose Two Stage GEM algorithm for MFA to relax this constraint and obtain different numbers of factors for each population. In this dissertation, our approach will integrate statistical modeling procedures based on the information criteria as a fitness function to determine the number of mixture clusters and at the same time to choose the number factors that can be extracted from the data.
|
192 |
Detection of long-range dependence : applications in climatology and hydrologyRust, Henning January 2007 (has links)
It is desirable to reduce the potential threats that result from the
variability of nature, such as droughts or heat waves that lead to
food shortage, or the other extreme, floods that lead to severe
damage. To prevent such catastrophic events, it is necessary to
understand, and to be capable of characterising, nature's variability.
Typically one aims to describe the underlying dynamics of geophysical
records with differential equations. There are, however, situations
where this does not support the objectives, or is not feasible, e.g.,
when little is known about the system, or it is too complex for the
model parameters to be identified. In such situations it is beneficial
to regard certain influences as random, and describe them with
stochastic processes. In this thesis I focus on such a description
with linear stochastic processes of the FARIMA type and concentrate on
the detection of long-range dependence. Long-range dependent processes
show an algebraic (i.e. slow) decay of the autocorrelation
function. Detection of the latter is important with respect to,
e.g. trend tests and uncertainty analysis.
Aiming to provide a reliable and powerful strategy for the detection
of long-range dependence, I suggest a way of addressing the problem
which is somewhat different from standard approaches. Commonly used
methods are based either on investigating the asymptotic behaviour
(e.g., log-periodogram regression), or on finding a suitable
potentially long-range dependent model (e.g., FARIMA[p,d,q]) and test
the fractional difference parameter d for compatibility with
zero. Here, I suggest to rephrase the problem as a model selection
task, i.e.comparing the most suitable long-range dependent and the
most suitable short-range dependent model. Approaching the task this
way requires a) a suitable class of long-range and short-range
dependent models along with suitable means for parameter estimation
and b) a reliable model selection strategy, capable of discriminating
also non-nested models. With the flexible FARIMA model class together
with the Whittle estimator the first requirement is
fulfilled. Standard model selection strategies, e.g., the
likelihood-ratio test, is for a comparison of non-nested models
frequently not powerful enough. Thus, I suggest to extend this
strategy with a simulation based model selection approach suitable for
such a direct comparison. The approach follows the procedure of
a statistical test, with the likelihood-ratio as the test
statistic. Its distribution is obtained via simulations using the two
models under consideration. For two simple models and different
parameter values, I investigate the reliability of p-value and power
estimates obtained from the simulated distributions. The result turned
out to be dependent on the model parameters. However, in many cases
the estimates allow an adequate model selection to be established.
An important feature of this approach is that it immediately reveals
the ability or inability to discriminate between the two models under
consideration.
Two applications, a trend detection problem in temperature records and
an uncertainty analysis for flood return level estimation, accentuate the
importance of having reliable methods at hand for the detection of
long-range dependence. In the case of trend detection, falsely
concluding long-range dependence implies an underestimation of a trend
and possibly leads to a delay of measures needed to take in order to
counteract the trend. Ignoring long-range dependence, although
present, leads to an underestimation of confidence intervals and thus
to an unjustified belief in safety, as it is the case for the
return level uncertainty analysis. A reliable detection of long-range
dependence is thus highly relevant in practical applications.
Examples related to extreme value analysis are not limited to
hydrological applications. The increased uncertainty of return level
estimates is a potentially problem for all records from autocorrelated
processes, an interesting examples in this respect is the assessment
of the maximum strength of wind gusts, which is important for
designing wind turbines. The detection of long-range dependence is
also a relevant problem in the exploration of financial market
volatility. With rephrasing the detection problem as a model
selection task and suggesting refined methods for model comparison,
this thesis contributes to the discussion on and development of
methods for the detection of long-range dependence. / Die potentiellen Gefahren und Auswirkungen der natürlicher
Klimavariabilitäten zu reduzieren ist ein wünschenswertes Ziel. Solche
Gefahren sind etwa Dürren und Hitzewellen, die zu Wasserknappheit
führen oder, das andere Extrem, Überflutungen, die einen erheblichen
Schaden an der Infrastruktur nach sich ziehen können. Um solche
katastrophalen Ereignisse zu vermeiden, ist es notwendig die Dynamik
der Natur zu verstehen und beschreiben zu können.
Typischerweise wird versucht die Dynamik geophysikalischer Datenreihen
mit Differentialgleichungssystemen zu
beschreiben. Es gibt allerdings Situationen in denen dieses Vorgehen
nicht zielführend oder technisch nicht möglich ist. Dieses sind
Situationen in denen wenig Wissen über das System vorliegt oder es zu
komplex ist um die Modellparameter zu identifizieren.
Hier ist es sinnvoll einige Einflüsse als zufällig zu
betrachten und mit Hilfe stochastischer Prozesse zu modellieren.
In dieser Arbeit wird eine solche Beschreibung mit linearen
stochastischen Prozessen der FARIMA-Klasse angestrebt. Besonderer
Fokus liegt auf der Detektion von langreichweitigen
Korrelationen. Langreichweitig korrelierte Prozesse sind solche mit
einer algebraisch, d.h. langsam, abfallenden
Autokorrelationsfunktion. Eine verläßliche Erkennung dieser Prozesse
ist relevant für Trenddetektion und Unsicherheitsanalysen.
Um eine verläßliche Strategie für die Detektion
langreichweitig korrelierter Prozesse zur Verfügung zu stellen, wird
in der Arbeit ein anderer als der Standardweg vorgeschlagen.
Gewöhnlich werden Methoden eingesetzt, die das
asymptotische Verhalten untersuchen, z.B. Regression im Periodogramm.
Oder aber es wird versucht ein passendes potentiell langreichweitig
korreliertes Modell zu finden, z.B. aus der FARIMA Klasse, und den
geschätzten fraktionalen Differenzierungsparameter d auf Verträglichkeit
mit dem trivialen Wert Null zu testen. In der Arbeit wird
vorgeschlagen das Problem der Detektion langreichweitiger
Korrelationen als Modellselektionsproblem umzuformulieren, d.h. das
beste kurzreichweitig und das beste langreichweitig
korrelierte Modell zu vergleichen. Diese Herangehensweise erfordert a)
eine geeignete Klasse von lang- und kurzreichweitig korrelierten
Prozessen und b) eine verläßliche Modellselektionsstrategie, auch für
nichtgenestete Modelle. Mit der flexiblen FARIMA-Klasse und dem
Whittleschen Ansatz zur Parameterschätzung ist die erste
Voraussetzung erfüllt. Hingegen sind standard Ansätze zur
Modellselektion, wie z.B. der Likelihood-Ratio-Test, für
nichtgenestete Modelle oft nicht trennscharf genug. Es wird daher
vorgeschlagen diese Strategie mit einem simulationsbasierten Ansatz zu
ergänzen, der insbesondere für die direkte Diskriminierung
nichtgenesteter Modelle geeignet ist. Der Ansatz folgt
einem statistischen Test mit dem Quotienten der Likelihood
als Teststatistik. Ihre Verteilung wird über
Simulationen mit den beiden zu unterscheidenden Modellen
ermittelt. Für zwei einfache Modelle und verschiedene Parameterwerte
wird die Verläßlichkeit der Schätzungen für p-Wert und Power
untersucht. Das Ergebnis hängt von den Modellparametern ab. Es konnte
jedoch in vielen Fällen eine adäquate Modellselektion etabliert
werden. Ein wichtige Eigenschaft dieser Strategie ist, dass
unmittelbar offengelegt wird, wie gut sich die betrachteten Modelle
unterscheiden lassen.
Zwei Anwendungen, die Trenddetektion in Temperaturzeitreihen und die
Unsicherheitsanalyse für Bemessungshochwasser, betonen den Bedarf an
verläßlichen Methoden für die Detektion langreichweitiger
Korrelationen. Im Falle der Trenddetektion führt ein fälschlicherweise
gezogener Schluß auf langreichweitige Korrelationen zu einer
Unterschätzung eines Trends, was wiederum zu einer möglicherweise
verzögerten Einleitung von Maßnahmen führt, die diesem entgegenwirken
sollen. Im Fall von Abflußzeitreihen führt die Nichtbeachtung von
vorliegenden langreichweitigen Korrelationen zu einer Unterschätzung
der Unsicherheit von Bemessungsgrößen. Eine verläßliche Detektion von
langreichweitig Korrelierten Prozesse ist somit von hoher Bedeutung in
der praktischen Zeitreihenanalyse. Beispiele mit Bezug zu extremem
Ereignissen beschränken sich nicht nur auf die Hochwasseranalyse. Eine
erhöhte Unsicherheit in der Bestimmung von extremen Ereignissen ist
ein potentielles Problem von allen autokorrelierten Prozessen. Ein
weiteres interessantes Beispiel ist hier die Abschätzung von maximalen
Windstärken in Böen, welche bei der Konstruktion von Windrädern eine
Rolle spielt. Mit der Umformulierung des Detektionsproblems als
Modellselektionsfrage und mit der Bereitstellung geeigneter
Modellselektionsstrategie trägt diese Arbeit zur Diskussion und
Entwicklung von Methoden im Bereich der Detektion von
langreichweitigen Korrelationen bei.
|
193 |
Robust inference of gene regulatory networks : System properties, variable selection, subnetworks, and design of experimentsNordling, Torbjörn E. M. January 2013 (has links)
In this thesis, inference of biological networks from in vivo data generated by perturbation experiments is considered, i.e. deduction of causal interactions that exist among the observed variables. Knowledge of such regulatory influences is essential in biology. A system property–interampatteness–is introduced that explains why the variation in existing gene expression data is concentrated to a few “characteristic modes” or “eigengenes”, and why previously inferred models have a large number of false positive and false negative links. An interampatte system is characterized by strong INTERactions enabling simultaneous AMPlification and ATTEnuation of different signals and we show that perturbation of individual state variables, e.g. genes, typically leads to ill-conditioned data with both characteristic and weak modes. The weak modes are typically dominated by measurement noise due to poor excitation and their existence hampers network reconstruction. The excitation problem is solved by iterative design of correlated multi-gene perturbation experiments that counteract the intrinsic signal attenuation of the system. The next perturbation should be designed such that the expected response practically spans an additional dimension of the state space. The proposed design is numerically demonstrated for the Snf1 signalling pathway in S. cerevisiae. The impact of unperturbed and unobserved latent state variables, that exist in any real biological system, on the inferred network and required set-up of the experiments for network inference is analysed. Their existence implies that a subnetwork of pseudo-direct causal regulatory influences, accounting for all environmental effects, in general is inferred. In principle, the number of latent states and different paths between the nodes of the network can be estimated, but their identity cannot be determined unless they are observed or perturbed directly. Network inference is recognized as a variable/model selection problem and solved by considering all possible models of a specified class that can explain the data at a desired significance level, and by classifying only the links present in all of these models as existing. As shown, these links can be determined without any parameter estimation by reformulating the variable selection problem as a robust rank problem. Solution of the rank problem enable assignment of confidence to individual interactions, without resorting to any approximation or asymptotic results. This is demonstrated by reverse engineering of the synthetic IRMA gene regulatory network from published data. A previously unknown activation of transcription of SWI5 by CBF1 in the IRMA strain of S. cerevisiae is proven to exist, which serves to illustrate that even the accumulated knowledge of well studied genes is incomplete. / Denna avhandling behandlar inferens av biologiskanätverk från in vivo data genererat genom störningsexperiment, d.v.s. bestämning av kausala kopplingar som existerar mellan de observerade variablerna. Kunskap om dessa regulatoriska influenser är väsentlig för biologisk förståelse. En system egenskap—förstärksvagning—introduceras. Denna förklarar varför variationen i existerande genexpressionsdata är koncentrerat till några få ”karakteristiska moder” eller ”egengener” och varför de modeller som konstruerats innan innehåller många falska positiva och falska negativa linkar. Ett system med förstärksvagning karakteriseras av starka kopplingar som möjliggör simultan FÖRSTÄRKning och förSVAGNING av olika signaler. Vi demonstrerar att störning av individuella tillståndsvariabler, t.ex. gener, typiskt leder till illakonditionerat data med både karakteristiska och svaga moder. De svaga moderna domineras typiskt av mätbrus p.g.a. dålig excitering och försvårar rekonstruktion av nätverket. Excitationsproblemet löses med iterativdesign av experiment där korrelerade störningar i multipla gener motverkar systemets inneboende försvagning av signaller. Följande störning bör designas så att det förväntade svaret praktiskt spänner ytterligare en dimension av tillståndsrummet. Den föreslagna designen demonstreras numeriskt för Snf1 signalleringsvägen i S. cerevisiae. Påverkan av ostörda och icke observerade latenta tillståndsvariabler, som existerar i varje verkligt biologiskt system, på konstruerade nätverk och planeringen av experiment för nätverksinferens analyseras. Existens av dessa tillståndsvariabler innebär att delnätverk med pseudo-direkta regulatoriska influenser, som kompenserar för miljöeffekter, generellt bestäms. I princip så kan antalet latenta tillstånd och alternativa vägar mellan noder i nätverket bestämmas, men deras identitet kan ej bestämmas om de inte direkt observeras eller störs. Nätverksinferens behandlas som ett variabel-/modelselektionsproblem och löses genom att undersöka alla modeller inom en vald klass som kan förklara datat på den önskade signifikansnivån, samt klassificera endast linkar som är närvarande i alla dessa modeller som existerande. Dessa linkar kan bestämmas utan estimering av parametrar genom att skriva om variabelselektionsproblemet som ett robustrangproblem. Lösning av rangproblemet möjliggör att statistisk konfidens kan tillskrivas individuella linkar utan approximationer eller asymptotiska betraktningar. Detta demonstreras genom rekonstruktion av det syntetiska IRMA genreglernätverket från publicerat data. En tidigare okänd aktivering av transkription av SWI5 av CBF1 i IRMA stammen av S. cerevisiae bevisas. Detta illustrerar att t.o.m. den ackumulerade kunskapen om välstuderade gener är ofullständig. / <p>QC 20130508</p>
|
194 |
Bayesian Methods in Gaussian Graphical ModelsMitsakakis, Nikolaos 31 August 2010 (has links)
This thesis contributes to the field of Gaussian Graphical Models by exploring either numerically or theoretically various topics of Bayesian Methods in Gaussian Graphical Models and by providing a number of interesting results, the further exploration of which would be promising, pointing to numerous future research directions.
Gaussian Graphical Models are statistical methods for the investigation and representation of interdependencies between components of continuous random vectors. This thesis aims to investigate some issues related to the application of Bayesian methods for Gaussian Graphical Models. We adopt the popular $G$-Wishart conjugate prior $W_G(\delta,D)$ for the precision matrix. We propose an efficient sampling method for the $G$-Wishart distribution based on the Metropolis Hastings algorithm and show its validity through a number of numerical experiments. We show that this method can be easily used to estimate the Deviance Information Criterion, providing a computationally inexpensive approach for model selection.
In addition, we look at the marginal likelihood of a graphical model given a set of data. This is proportional to the ratio of the posterior over the prior normalizing constant. We explore methods for the estimation of this ratio, focusing primarily on applying the Monte Carlo simulation method of path sampling. We also explore numerically the effect of the completion of the incomplete matrix $D^{\mathcal{V}}$, hyperparameter of the $G$-Wishart distribution, for the estimation of the normalizing constant.
We also derive a series of exact and approximate expressions for the Bayes Factor between two graphs that differ by one edge. A new theoretical result regarding the limit of the normalizing constant multiplied by the hyperparameter $\delta$ is given and its implications to the validity of an improper prior and of the subsequent Bayes Factor are discussed.
|
195 |
Bayesian Methods in Gaussian Graphical ModelsMitsakakis, Nikolaos 31 August 2010 (has links)
This thesis contributes to the field of Gaussian Graphical Models by exploring either numerically or theoretically various topics of Bayesian Methods in Gaussian Graphical Models and by providing a number of interesting results, the further exploration of which would be promising, pointing to numerous future research directions.
Gaussian Graphical Models are statistical methods for the investigation and representation of interdependencies between components of continuous random vectors. This thesis aims to investigate some issues related to the application of Bayesian methods for Gaussian Graphical Models. We adopt the popular $G$-Wishart conjugate prior $W_G(\delta,D)$ for the precision matrix. We propose an efficient sampling method for the $G$-Wishart distribution based on the Metropolis Hastings algorithm and show its validity through a number of numerical experiments. We show that this method can be easily used to estimate the Deviance Information Criterion, providing a computationally inexpensive approach for model selection.
In addition, we look at the marginal likelihood of a graphical model given a set of data. This is proportional to the ratio of the posterior over the prior normalizing constant. We explore methods for the estimation of this ratio, focusing primarily on applying the Monte Carlo simulation method of path sampling. We also explore numerically the effect of the completion of the incomplete matrix $D^{\mathcal{V}}$, hyperparameter of the $G$-Wishart distribution, for the estimation of the normalizing constant.
We also derive a series of exact and approximate expressions for the Bayes Factor between two graphs that differ by one edge. A new theoretical result regarding the limit of the normalizing constant multiplied by the hyperparameter $\delta$ is given and its implications to the validity of an improper prior and of the subsequent Bayes Factor are discussed.
|
196 |
Bayesian Model Selection for High-dimensional High-throughput DataJoshi, Adarsh 2010 May 1900 (has links)
Bayesian methods are often criticized on the grounds of subjectivity. Furthermore, misspecified
priors can have a deleterious effect on Bayesian inference. Noting that model
selection is effectively a test of many hypotheses, Dr. Valen E. Johnson sought to eliminate
the need of prior specification by computing Bayes' factors from frequentist test statistics.
In his pioneering work that was published in the year 2005, Dr. Johnson proposed
using so-called local priors for computing Bayes? factors from test statistics. Dr. Johnson
and Dr. Jianhua Hu used Bayes' factors for model selection in a linear model setting. In
an independent work, Dr. Johnson and another colleage, David Rossell, investigated two
families of non-local priors for testing the regression parameter in a linear model setting.
These non-local priors enable greater separation between the theories of null and alternative
hypotheses.
In this dissertation, I extend model selection based on Bayes' factors and use nonlocal
priors to define Bayes' factors based on test statistics. With these priors, I have been
able to reduce the problem of prior specification to setting to just one scaling parameter.
That scaling parameter can be easily set, for example, on the basis of frequentist operating
characteristics of the corresponding Bayes' factors. Furthermore, the loss of information by basing a Bayes' factors on a test statistic is minimal.
Along with Dr. Johnson and Dr. Hu, I used the Bayes' factors based on the likelihood
ratio statistic to develop a method for clustering gene expression data. This method has
performed well in both simulated examples and real datasets. An outline of that work is
also included in this dissertation. Further, I extend the clustering model to a subclass of
the decomposable graphical model class, which is more appropriate for genotype data sets,
such as single-nucleotide polymorphism (SNP) data. Efficient FORTRAN programming has
enabled me to apply the methodology to hundreds of nodes.
For problems that produce computationally harder probability landscapes, I propose a
modification of the Markov chain Monte Carlo algorithm to extract information regarding
the important network structures in the data. This modified algorithm performs well in
inferring complex network structures. I use this method to develop a prediction model for
disease based on SNP data. My method performs well in cross-validation studies.
|
197 |
Model Selection and Uniqueness Analysis for Reservoir History MatchingRafiee, Mohammad Mohsen 28 March 2011 (has links) (PDF)
“History matching” (model calibration, parameter identification) is an established method for determination of representative reservoir properties such as permeability, porosity, relative permeability and fault transmissibility from a measured production history; however the uniqueness of selected model is always a challenge in a successful history matching.
Up to now, the uniqueness of history matching results in practice can be assessed only after individual and technical experience and/or by repeating history matching with different reservoir models (different sets of parameters as the starting guess).
The present study has been used the stochastical theory of Kullback & Leibler (K-L) and its further development by Akaike (AIC) for the first time to solve the uniqueness problem in reservoir engineering. In addition - based on the AIC principle and the principle of parsimony - a penalty term for OF has been empirically formulated regarding geoscientific and technical considerations. Finally a new formulation (Penalized Objective Function, POF) has been developed for model selection in reservoir history matching and has been tested successfully in a North German gas field. / „History Matching“ (Modell-Kalibrierung, Parameter Identifikation) ist eine bewährte Methode zur Bestimmung repräsentativer Reservoireigenschaften, wie Permeabilität, Porosität, relative Permeabilitätsfunktionen und Störungs-Transmissibilitäten aus einer gemessenen Produktionsgeschichte (history).
Bis heute kann die Eindeutigkeit der identifizierten Parameter in der Praxis nicht konstruktiv nachgewiesen werden. Die Resultate eines History-Match können nur nach individueller Erfahrung und/oder durch vielmalige History-Match-Versuche mit verschiedenen Reservoirmodellen (verschiedenen Parametersätzen als Startposition) auf ihre Eindeutigkeit bewertet werden.
Die vorliegende Studie hat die im Reservoir Engineering erstmals eingesetzte stochastische Theorie von Kullback & Leibler (K-L) und ihre Weiterentwicklung nach Akaike (AIC) als Basis für die Bewertung des Eindeutigkeitsproblems genutzt. Schließlich wurde das AIC-Prinzip als empirischer Strafterm aus geowissenschaftlichen und technischen Überlegungen formuliert. Der neu formulierte Strafterm (Penalized Objective Function, POF) wurde für das History Matching eines norddeutschen Erdgasfeldes erfolgreich getestet.
|
198 |
Portfolio management using computational intelligence approaches : forecasting and optimising the stock returns and stock volatilities with fuzzy logic, neural network and evolutionary algorithmsSkolpadungket, Prisadarng January 2013 (has links)
Portfolio optimisation has a number of constraints resulting from some practical matters and regulations. The closed-form mathematical solution of portfolio optimisation problems usually cannot include these constraints. Exhaustive search to reach the exact solution can take prohibitive amount of computational time. Portfolio optimisation models are also usually impaired by the estimation error problem caused by lack of ability to predict the future accurately. A number of Multi-Objective Genetic Algorithms are proposed to solve the problem with two objectives subject to cardinality constraints, floor constraints and round-lot constraints. Fuzzy logic is incorporated into the Vector Evaluated Genetic Algorithm (VEGA) to but solutions tend to cluster around a few points. Strength Pareto Evolutionary Algorithm 2 (SPEA2) gives solutions which are evenly distributed portfolio along the effective front while MOGA is more time efficient. An Evolutionary Artificial Neural Network (EANN) is proposed. It automatically evolves the ANN's initial values and structures hidden nodes and layers. The EANN gives a better performance in stock return forecasts in comparison with those of Ordinary Least Square Estimation and of Back Propagation and Elman Recurrent ANNs. Adaptation algorithms for selecting a pair of forecasting models, which are based on fuzzy logic-like rules, are proposed to select best models given an economic scenario. Their predictive performances are better than those of the comparing forecasting models. MOGA and SPEA2 are modified to include a third objective to handle model risk and are evaluated and tested for their performances. The result shows that they perform better than those without the third objective.
|
199 |
Exploring the Boundaries of Gene Regulatory Network InferenceTjärnberg, Andreas January 2015 (has links)
To understand how the components of a complex system like the biological cell interact and regulate each other, we need to collect data for how the components respond to system perturbations. Such data can then be used to solve the inverse problem of inferring a network that describes how the pieces influence each other. The work in this thesis deals with modelling the cell regulatory system, often represented as a network, with tools and concepts derived from systems biology. The first investigation focuses on network sparsity and algorithmic biases introduced by penalised network inference procedures. Many contemporary network inference methods rely on a sparsity parameter such as the L1 penalty term used in the LASSO. However, a poor choice of the sparsity parameter can give highly incorrect network estimates. In order to avoid such poor choices, we devised a method to optimise the sparsity parameter, which maximises the accuracy of the inferred network. We showed that it is effective on in silico data sets with a reasonable level of informativeness and demonstrated that accurate prediction of network sparsity is key to elucidate the correct network parameters. The second investigation focuses on how knowledge from association networks can be transferred to regulatory network inference procedures. It is common that the quality of expression data is inadequate for reliable gene regulatory network inference. Therefore, we constructed an algorithm to incorporate prior knowledge and demonstrated that it increases the accuracy of network inference when the quality of the data is low. The third investigation aimed to understand the influence of system and data properties on network inference accuracy. L1 regularisation methods commonly produce poor network estimates when the data used for inference is ill-conditioned, even when the signal to noise ratio is so high that all links in the network can be proven to exist for the given significance. In this study we elucidated some general principles for under what conditions we expect strongly degraded accuracy. Moreover, it allowed us to estimate expected accuracy from conditions of simulated data, which was used to predict the performance of inference algorithms on biological data. Finally, we built a software package GeneSPIDER for solving problems encountered during previous investigations. The software package supports highly controllable network and data generation as well as data analysis and exploration in the context of network inference. / <p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.</p><p> </p>
|
200 |
Analysis of price transmission and asymmetric adjustment using Bayesian econometric methodologyAcquah, Henry de-Graft 31 January 2008 (has links)
No description available.
|
Page generated in 0.0424 seconds