621 |
A Computational Perspective of Causal Inference and the Data Fusion ProblemCorrea, Juan David January 2021 (has links)
The ability to process and reason with causal information is fundamental in many aspects of human cognition and is pervasive in the way we probe reality in many of the empirical sciences. Given the centrality of causality through many aspects of human experience, we expect that the next generation of AI systems will need to represent causal knowledge, combine heterogeneous and biased datasets, and generalize across changing conditions and disparate domains to attain human-like intelligence.
This dissertation investigates a problem in causal inference known as Data Fusion, which is concerned with inferring causal and statistical relationships from a combination of heterogeneous data collections from different domains, with various experimental conditions, and with nonrandom sampling (sampling selection bias). Despite the general conditions and algorithms developed so far for many aspects of the fusion problem, there are still significant aspects that are not well-understood and have not been studied together, as they appear in many challenging real-world applications.
Specifically, this work advances our understanding of several dimensions of data fusion problems, which include the following capabilities and research questions: Reasoning with Soft Interventions. How to identify the effect of conditional and stochastic policies in a complex data fusion setting? Specifically, under what conditions can the effect of a new stochastic policy be evaluated using data from disparate sources and collected under different experimental conditions?
Deciding Statistical Transportability. Under what conditions can statistical relationships (e.g., conditional distributions, classifiers) be extrapolated across disparate domains, where the target is somewhat related but not the same as the source domain where the data was initially collected? How to leverage additional data over a few variables in the target domain to help with the generalization process?
Recovering from Selection Bias. How to determine whether a sample that was preferentially selected can be recovered so as to make a claim about the general underlying super-population? How can additional data over a subset of the variables, but sampled randomly, be used to achieve this goal?
Instead of developing conditions and algorithms for each problem independently, this thesis introduces a computational framework capable of solving those research problems when appearing together. The approach decomposes the query and available heterogeneous distributions into factors with a canonical form. Then, the inference process is reduced to mapping the required factors to those available from the data, and then evaluating the query as a function of the input based on the mapping.
The problems and methods discussed have several applications in the empirical sciences, statistics, machine learning, and artificial intelligence.
|
622 |
Me, Myself and I: time-inconsistent stochastic control, contract theory and backward stochastic Volterra integral equationsHernandez Ramirez, Miguel Camilo January 2021 (has links)
This thesis studies the decision-making of agents exhibiting time-inconsistent preferences and its implications in the context of contract theory. We take a probabilistic approach to continuous-time non-Markovian time-inconsistent stochastic control problems for sophisticated agents. By introducing a refinement of the notion of equilibrium, an extended dynamic programming principle is established. In turn, this leads to consider an infinite family of BSDEs analogous to the classical Hamilton–Jacobi–Bellman equation. This system is fundamental in the sense that its well-posedness is both necessary and sufficient to characterise equilibria and its associated value function. In addition, under modest assumptions, the existence and uniqueness of a solution is established.
With the previous results in mind, we then study a new general class of multidimensional type-I backward stochastic Volterra integral equations. Towards this goal, the well-posedness of a system of an infinite family of standard backward stochastic differential equations is established. Interestingly, its well-posedness is equivalent to that of the type-I backward stochastic Volterra integral equation. This result yields a representation formula in terms of semilinear partial differential equation of Hamilton–Jacobi–Bellman type. In perfect analogy to the theory of backward stochastic differential equations, the case of Lipschitz continuous generators is addressed first and subsequently the quadratic case. In particular, our results show the equivalence of the probabilistic and analytic approaches to time-inconsistent stochastic control problems.
Finally, this thesis studies the contracting problem between a standard utility maximiser principal and a sophisticated time-inconsistent agent. We show that the contracting problem faced by the principal can be reformulated as a novel class of control problems exposing the complications of the agent’s preferences. This corresponds to the control of a forward Volterra equation via constrained Volterra type controls. The structure of this problem is inherently related to the representation of the agent’s value function via extended type-I backward stochastic differential equations.
Despite the inherent challenges of this class of problems, our reformulation allows us to study the solution for different specifications of preferences for the principal and the agent. This allows us to discuss the qualitative and methodological implications of our results in the context of contract theory: (i) from a methodological point of view, unlike in the time-consistent case, the solution to the moral hazard problem does not reduce, in general, to a standard stochastic control problem; (ii) our analysis shows that slight deviations of seminal models in contracting theory seem to challenge the virtues attributed to linear contracts and suggests that such contracts would typically cease to be optimal in general for time-inconsistent agents; (iii) in line with some recent developments in the time-consistent literature, we find that the optimal contract in the time-inconsistent scenario is, in general, non-Markovian in the state process X.
|
623 |
Automated Valuation Models (AVMs) a jejich využití / Automated Valuation Models (AVMs) with applicationsŠmardová, Eva January 2019 (has links)
Predictions of market values are important for investment decisions and risk management of banking institutions, developers and, last but not least, households. Increased access to real estate market data and reducing valuatin costs were one of the most important reasons to be interested in developing and subsequent using of automated valuation models (AVM) worldwide. However, the implementation of AVM in the Czech Republic is still limited to a minimum. The aim of the thesis is to theoretically describe some alternative statistical methods used by AVM, such as fuzzy logic, ANN, spatial econometrics or hedonic models, characterize AVM, their use and analyze their current application in the Czech Republic and outline further possible development.
|
624 |
Použití statistických metod pro hodnocení progrese Parkinsonovy nemoci / Use of Statistical Methods for Progression Evaluation of Parkinson’s DiseasePecha, Jiří January 2015 (has links)
This master’s thesis takes aim with the use of statistical methods for progression evaluation of Parkinson’s disease. There is a brief description of Parkinson’s disease. It is further stated processing and evaluation of values of speech parameters which are affected by Parkinson’s disease. The thesis describes the process using the values of classification and regression trees and evaluate results using the mean absolute error and estimated error. Processing and evaluation of values was done in MATLAB software.
|
625 |
Statistické metody pro vyhodnocování senzorických dat / Statistical methods for evaluation of sensorial dataKozielová, Magda January 2009 (has links)
\par The thesis deals with the statistical evaluation of data gained by the sensory analysis of the foodstuff. It brings a selection of the suitable statistical tests, a detailed analysis of these tests and their comparision based on the particular power functions for given parameters. As an important part of the thesis, there is a creating of custom software for the evaluating of sensorial data.
|
626 |
Statistical analysis of large scale data with perturbation subsamplingYao, Yujing January 2022 (has links)
The past two decades have witnessed rapid growth in the amount of data available to us. Many fields, including physics, biology, and medical studies, generate enormous datasets with a large sample size, a high number of dimensions, or both. For example, some datasets in physics contains millions of records. It is forecasted by Statista Survey that in 2022, there will be over 86 millions users of health apps in United States, which will generate massive mHealth data. In addition, more and more large studies have been carried out, such as the UK Biobank study. This gives us unprecedented access to data and allows us to extract and infer vital information. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms.
For increasingly large datasets, computation can be a big hurdle for valid analysis. Conventional statistical methods lack the scalability to handle such large sample size. In addition, data storage and processing might be beyond usual computer capacity. The UK Biobank genotypes and phenotypes dataset contains about 500,000 individuals and more than 800,000 genotyped single nucleotide polymorphism (SNP) measurements per person, the size of which may well exceed a computer's physical memory. Further, the high dimensionality combined with the large sample size could lead to heavy computational cost and algorithmic instability.
The aim of this dissertation is to provide some statistical approaches to address the issues. Chapter 1 provides a review on existing literature. In Chapter 2, a novel perturbation subsampling approach is developed based on independent and identically distributed stochastic weights for the analysis of large scale data. The method is justified based on optimizing convex criterion functions by establishing asymptotic consistency and normality for the resulting estimators. The method can provide consistent point estimator and variance estimator simultaneously. The method is also feasible for a distributed framework. The finite sample performance of the proposed method is examined through simulation studies and real data analysis.
In Chapter 3, a repeated block perturbation subsampling is developed for the analysis of large scale longitudinal data using generalized estimating equation (GEE) approach. The GEE approach is a general method for the analysis of longitudinal data by fitting marginal models. The proposed method can provide consistent point estimator and variance estimator simultaneously. The asymptotic properties of the resulting subsample estimators are also studied. The finite sample performances of the proposed methods are evaluated through simulation studies and mHealth data analysis.
With the development of technology, large scale high dimensional data is also increasingly prevailing. Conventional statistical methods for high dimensional data such as adaptive lasso (AL) lack the scalability to handle processing of such large sample size. Chapter 4 introduces the repeated perturbation subsampling adaptive lasso (RPAL), a new procedure which incorporates features of both perturbation and subsampling to yield a robust, computationally efficient estimator for variable selection, statistical inference and finite sample false discovery control in the analysis of big data. RPAL is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency. The theoretical properties of RPAL are studied and simulation studies are carried out by comparing the proposed estimator to the full data estimator and traditional subsampling estimators. The proposed method is also illustrated with the analysis of omics datasets.
|
627 |
Statistical Perspectives on Modern Network Embedding MethodsDavison, Andrew January 2022 (has links)
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction being performed on diverse data sets, including protein-protein interaction networks, social networks and citation networks. A frequent approach to approaching these tasks begins by learning an Euclidean embedding of the network, to which machine learning algorithms developed for vector-valued data are applied.
For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. This distinguishes it from the setting of traditional i.i.d data where there is essentially only one way of subsampling the data - selecting the data points uniformly and without replacement. Despite the strong empirical performance when using embeddings produced in such a manner, they are not well understood theoretically, particularly with regards to the role of the sampling scheme.
Here, we develop a unifying framework which encapsulates representation learning methods for networks which are trained via performing gradient updates obtained by subsampling the network, including random-walk based approaches such as node2vec. In particular, we prove, under the assumption that the network has an exchangeable law, that the distribution of the learned embedding vectors asymptotically decouples. We characterize the asymptotic distribution of the learned embedding vectors, and give the corresponding rates of convergence, which depend on factors such as the sampling scheme, the choice of loss function, and the choice of embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks; in particular, we apply our results to argue that the embedding vectors produced by node2vec can be used to perform weakly consistent community detection.
|
628 |
Statistical and Machine Learning Methods for Precision MedicineChen, Yuan January 2021 (has links)
Heterogeneous treatment responses are commonly observed in patients with mental disorders. Thus, a universal treatment strategy may not be adequate, and tailored treatments adapted to individual characteristics could improve treatment responses. The theme of the dissertation is to develop statistical and machine learning methods to address patients heterogeneity and derive robust and generalizable individualized treatment strategies by integrating evidence from multi-domain data and multiple studies to achieve precision medicine. Unique challenges arising from the research of mental disorders need to be addressed in order to facilitate personalized medical decision-making in clinical practice. This dissertation contains four projects to achieve these goals while addressing the challenges: (i) a statistical method to learn dynamic treatment regimes (DTRs) by synthesizing independent trials over different stages when sequential randomization data is not available; (ii) a statistical method to learn optimal individualized treatment rules (ITRs) for mental disorders by modeling patients' latent mental states using probabilistic generative models; (iii) an integrative learning algorithm to incorporate multi-domain and multi-treatment-phase measures for optimizing individualized treatments; (iv) a statistical machine learning method to optimize ITRs that can benefit subjects in a target population for mental disorders with improved learning efficiency and generalizability.
DTRs adaptively prescribe treatments based on patients' intermediate responses and evolving health status over multiple treatment stages. Data from sequential multiple assignment randomization trials (SMARTs) are recommended to be used for learning DTRs. However, due to the re-randomization of the same patients over multiple treatment stages and a prolonged follow-up period, SMARTs are often difficult to implement and costly to manage, and patient adherence is always a concern in practice. To lessen such practical challenges, in the first part of the dissertation, we propose an alternative approach to learn optimal DTRs by synthesizing independent trials over different stages without using data from SMARTs. Specifically, at each stage, data from a single randomized trial along with patients' natural medical history and health status in previous stages are used. We use a backward learning method to estimate optimal treatment decisions at a particular stage, where patients' future optimal outcome increment is estimated using data observed from independent trials with future stages' information. Under some conditions, we show that the proposed method yields consistent estimation of the optimal DTRs, and we obtain the same learning rates as those from SMARTs. We conduct simulation studies to demonstrate the advantage of the proposed method. Finally, we learn DTRs for treating major depressive disorder (MDD) by stage-wise synthesis of two randomized trials. We perform a validation study on independent subjects and show that the synthesized DTRs lead to the greatest MDD symptom reduction compared to alternative methods.
The second part of the dissertation focuses on optimizing individualized treatments for mental disorders. Due to disease complexity, substantial diversity in patients' symptomatology within the same diagnostic category is widely observed. Leveraging the measurement model theory in psychiatry and psychology, we learn patient's intrinsic latent mental status from psychological or clinical symptoms under a probabilistic generative model, restricted Boltzmann machine (RBM), through which patients' heterogeneous symptoms are represented using an economic number of latent variables and yet remains flexible. These latent mental states serve as a better characterization of the underlying disorder status than a simple summary score of the symptoms. They also serve as more reliable and representative features to differentiate treatment responses. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real-world studies. We demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression and identify patient subgroups informative for treatment recommendations.
In the third part of the dissertation, based on the general framework introduced in the previous part, we propose an integrated learning algorithm that can simultaneously learn patients' underlying mental states and recommend optimal treatments for each individual with improved learning efficiency. It allows incorporation of both the pre- and post-treatment outcomes in learning the invariant latent structure and allows integration of outcome measures from different domains to characterize patients' mental health more comprehensively. A multi-layer neural network is used to allow complex treatment effect heterogeneity. Optimal treatment policy can be inferred for future patients by comparing their potential mental states under different treatments given the observed multi-domain pre-treatment measurements. Experiments on simulated data and real-world clinical trial data show that the learned treatment polices compare favorably to alternative methods on heterogeneous treatment effects and have broad utilities which lead to better patient outcomes on multiple domains.
The fourth part of the dissertation aims to infer optimal treatments of mental disorders for a target population considering the potential distribution disparities between the patient data in a study we collect and the target population of interest. To achieve that, we propose a learning approach that connects measurement theory, efficient weighting procedure, and flexible neural network architecture through latent variables. In our method, patients' underlying mental states are represented by a reduced number of latent state variables allowing for incorporating domain knowledge, and the invariant latent structure is preserved for interpretability and validity. Subject-specific weights to balance population differences are constructed using these compact latent variables, which capture the major variations and facilitate the weighting procedure due to the reduced dimensionality. Data from multiple studies can be integrated to learn the latent structure to improve learning efficiency and generalizability. Extensive simulation studies demonstrate consistent superiority of the proposed method and the weighting scheme to alternative methods when applying to the target population. Application of our method to real-world studies is conducted to recommend treatments to patients with major depressive disorder and has shown a broader utility of the ITRs learned from the proposed method in improving the mental states of patients in the target population.
|
629 |
ESSAYS ON SPATIAL ECONOMETRICS: THEORIES AND APPLICATIONSXiaotian Liu (11090646) 22 July 2021 (has links)
<div>
<div>
<div>
<p>First Chapter: The ordinary least squares (OLS) estimator for spatial autoregressions may be consistent as pointed out by Lee (2002), provided that each spatial unit is influenced aggregately by a significant portion of the total units. This paper presents a unified asymptotic distribution result of the properly recentered OLS estimator and proposes a new estimator that is based on the indirect inference (II) procedure. The resulting estimator can always be used regardless of the degree of aggregate influence on each spatial unit from other units and is consistent and asymptotically normal. The new estimator does not rely on distributional assumptions and is robust to unknown heteroscedasticity. Its good finite-sample performance, in comparison with existing estimators that are also robust to heteroscedasticity, is demonstrated by a Monte Carlo study.<br></p><p><br></p><p>Second Chapter: This paper proposes a new estimation procedure for the first-order spatial autoregressive (SAR) model, where the disturbance term also follows a first-order autoregression and its innovations may be heteroscedastic. The estimation procedure is based on the principle of indirect inference that matches the ordinary least squares estimator of the two SAR coefficients (one in the outcome equation and the other in the disturbance equation) with its approximate analytical expectation. The resulting estimator is shown to be consistent, asymptotically normal and robust to unknown heteroscedasticity. Monte Carlo experiments are provided to show its finite-sample performance in comparison with existing estimators that are based on the generalized method of moments. The new estimation procedure is applied to empirical studies on teenage pregnancy rates and Airbnb accommodation prices.<br></p><p><br></p><p>Third Chapter: This paper presents a sample selection model with spatial autoregressive interactions and studies the maximum likelihood (ML) approach to estimating this model. Consistency and asymptotic normality of the ML estimator are established by the spatial near-epoch dependent (NED) properties of the selection and outcome variables. Monte Carlo simulations, based on the characteristics of female labor supply example, show that the proposed estimator has good finite-sample performance. The new model is applied to empirical study on examining the impact of climate change on agriculture in Southeast Asia.<br></p></div></div></div><div><div><div>
</div>
</div>
</div>
|
630 |
Analysis of trends in ambient air qualityMartin, Michael Kelly. January 1977 (has links)
Thesis: M.S., Massachusetts Institute of Technology, Sloan School of Management, 1977 / Includes bibliographical references. / by Michael K. Martin. / M.S. / M.S. Massachusetts Institute of Technology, Sloan School of Management
|
Page generated in 0.1261 seconds