381 |
New regression methods for measures of central tendencyAristodemou, Katerina January 2014 (has links)
Measures of central tendency have been widely used for summarising statistical data, with the mean being the most popular summary statistic. However, in reallife applications it is not always the most representative measure of central location, especially when dealing with data which is skewed or contains outliers. Alternative statistics with less bias are the median and the mode. Median and quantile regression has been used in different fields to examine the effect of factors at different points of the distribution. Mode estimation, on the other hand, has found many applications in cases where the analysis focuses on obtaining information about the most typical value or pattern. This thesis demonstrates that mode also plays an important role in the analysis of big data, which is becoming increasingly important in many sectors of the global economy. However, mode regression has not been widely applied, even though there is a clear conceptual benefit, due to the computational and theoretical limitations of the existing estimators. Similarly, despite the popularity of the binary quantile regression model, computational straight forward estimation techniques do not exist. Driven by the demand for simple, well-found and easy to implement inference tools, this thesis develops a series of new regression methods for mode and binary quantile regression. Chapter 2 deals with mode regression methods from the Bayesian perspective and presents one parametric and two non-parametric methods of inference. Chapter 3 demonstrates a mode-based, fast pattern-identification method for big data and proposes the first fully parametric mode regression method, which effectively uncovers the dependency of typical patterns on a number of covariates. The proposed approach is demonstrated through the analysis of a decade-long dataset on the Body Mass Index and associated factors, taken from the Health Survey for England. Finally, Chapter 4 presents an alternative binary quantile regression approach, based on the nonlinear least asymmetric weighted squares, which can be implemented using standard statistical packages and guarantees a unique solution.
|
382 |
Redovisnings- och revisionsbranschens påverkan av digitalisering / The impact of digitizationin the accounting and auditing industryHalvars, Viktoria, Svantorp, Petra January 2016 (has links)
Tidigare forskning har visat att den teknologiska utvecklingen har påverkat många branscher. Vi har valt att fokusera på en särskild bransch och därmed är syftet med denna studie att förklara och förstå hur redovisnings- och revisionsbranschens har påverkats av digitaliserings framfart. Studien bygger vidare på tre forskningsfrågor, där första undersöker hur redovisnings- och revisionsbranschen utvecklats och förändrats under 2000-talet, den andra undersöker viktiga faktorer att beakta vid implementeringen av digitalisering och den tredje undersöker vilken förändring redovisningskonsulter och revisorer står inför. Studien bygger på empiri insamlat av redovisningskonsulter och revisorer och besvarar de tre forskningsfrågor med utgångspunkt i den teoretiska referensramen. För att uppnå syftet med studien har en kvalitativ studie valts där 11 stycken semistrukturerade intervjuer har genomförts med både redovisningskonsulter och revisorer. För att få en djupare förståelse kring ämnet diskuteras relevanta begrepp och teorier i en teoretisk referensram. Analysen bygger sedan på teori och på citat från informanterna. Utifrån informanternas uppfattningar är vår slutsats att redovisnings- och revisionsbranschen påverkats av digitaliseringens framfart. Framförallt genom förändrade arbetsuppgifter och att tillgängligheten och mobiliteten som digitaliserade arbetsmetoder för med sig, givit informanterna mer frihet i arbetet. Vi har till skillnad från tidigare forskning inom ämnet märkt av att revisionsbranschen ligger något efter redovisningsbranschen i dess arbete med att implementera digitaliserade arbetsmetoder. / Previous research has shown that the technological development has affected many industries. We have chosen to focus on a particular industry and the main purpose of this study is to explain and understand how the accounting and auditing industry has been affected by digitization. The study consists of three research questions, the first one explores how the accounting and auditing profession has changed during the 2000s, the second examines the key factors to consider in the implementation of digitization and the third examines the change that accounting consultants and auditors are facing. The study is based on empirical data collected by accountants and auditors and our three research questions that are based on our theoretical framework. In order to answer our research questions and main purpose, our study has a qualitative approach. To get a deeper understanding of our topics, we have collected relevant theories in a theoretical framework. We have conducted eleven semi-structured interviews, with both accounting consultants and auditors. The analysis is based on our theoretical framework and our empirical data. Based on informants' perceptions, our conclusion is that the accounting and auditing industry has been affected by digitization in many ways. Unlike previous research, we have noticed that the auditing industry is far behind when it comes to digitization in the daily work activities.
|
383 |
(Mis)trusting health research synthesis studies : exploring transformations of 'evidence'Petrova, Mila January 2014 (has links)
This thesis explores the transformations of evidence in health research synthesis studies – studies that bring together evidence from a number of research reports on the same/ similar topic. It argues that health research synthesis is a broad and intriguing field in a state of pre-formation, in spite of the fact that it may appear well established if equated with its exemplar method – the systematic review inclusive of meta-analysis. Transformations of evidence are processes by which pieces of evidence are modified from what they are in the primary study report into what is needed in the synthesis study while, supposedly, having their integrity fully preserved. Such processes have received no focused attention in the literature. Yet they are key to the validity and reliability of synthesis studies. This work begins to describe them and explore their frequency, scope and drivers. A ‘meta-scientific’ perspective is taken, where ‘meta-scientific’ is understood to include primarily ideas from the philosophy of science and methodological texts in health research, and, to a lesser extent, social studies of science and psychology of science thinking. A range of meta-scientific ideas on evidence and factors that shape it guide the analysis of processes of “data extraction” and “coding” during which much evidence is transformed. The core of the analysis involves the application of an extensive Analysis Framework to 17 highly heterogeneous research papers on cancer. Five non-standard ‘injunctions’ complement the Analysis Framework – for comprehensiveness, extensive multiple coding, extreme transparency, combination of critical appraisal and critique, and for first coding as close as possible to the original and then extending towards larger transformations. Findings suggest even lower credibility of the current overall model of health research synthesis than initially expected. Implications are discussed and a radical vision for the future proposed.
|
384 |
New Statistical Methods and Computational Tools for Mining Big Data, with Applications in Plant SciencesMichels, Kurt Andrew January 2016 (has links)
The purpose of this dissertation is to develop new statistical tools for mining big data in plant sciences. In particular, the dissertation consists of four inter-related projects to address various methodological and computational challenges in phylogenetic methods. Project 1 aims to systematically test different optimization tools and provide useful strategies to improve optimization in practice. Project 2 develops a new R package rPlant, which provides a friendly and convenient toolbox for users of iPlant. Project 3 presents a fast and effective group-screening method to identify important genetic factors in GWAS, with theoretical justifications and nice asymptotic properties. Project 4 develops a new statistical tool to identify gene-gene interactions, with the ability of handling the interactions between groups of covariates.
|
385 |
Studierendensymposium Informatik 2016 der TU Chemnitz / Students Symposium Computer Science in 2016 at the TU Chemnitz04 May 2016 (has links) (PDF)
Im Rahmen des 180jährigen Jubiläums der technischen Universität Chemnitz fand am 28. April 2016 das zweite Studierendensymposium der Fakultät Informatik statt. Das Studierendensymposium Informatik richtete sich inhaltlich an alle Themen rund um die Informatik und ihre Anwendungen: Ob Hardware oder Software, ob technische Lösungen oder Anwenderstudien, ob Programmierung oder Verwendung, ob Hardcore-Technik oder gesellschaftliche Fragestellungen – alles, was mit informatischen Lösungen zu tun hat, war willkommen. Das Studierendensymposium Informatik war dabei weder auf die Fakultät Informatik noch auf die TU Chemnitz begrenzt. Es wurden explizit Einreichungen aus thematisch angrenzenden Fächern beworben und Hochschulen der Region in die Planung und Organisation eingebunden. Der Tagungsband enthält die 21 Beitrage, die auf dem Symposium vorgestellt wurden. / In the course of the 180 year anniversary of the Technische Universität Chemnitz the Department of Computer Science held the second Students Symposium on April 18, 2016. The symposium addressed topics related to computer science and its applications: Whether hardware or software, whether technical solutions or user studies, whether programming or use, whether hardcore technology or social issues - everything concerned with computational solutions was welcomed. The Students Symposium included explicitly submissions from thematically adjacent departments and involved universities in the region in planning and organization. The proceedings contain the 21 papers (full and short), which were presented at the symposium.
|
386 |
A Socio-technical Investigation of the Smart Grid: Implications for Demand-side Activities of Electricity Service ProvidersCorbett, JACQUELINE 21 January 2013 (has links)
Enabled by advanced communication and information technologies, the smart grid represents a major transformation for the electricity sector. Vast quantities of data and two-way communications abilities create the potential for a flexible, data-driven, multi-directional supply and consumption network well equipped to meet the challenges of the next century. For electricity service providers (“utilities”), the smart grid provides opportunities for improved business practices and new business models; however, a transformation of such magnitude is not without risks.
Three related studies are conducted to explore the implications of the smart grid on utilities’ demand-side activities. An initial conceptual framework, based on organizational information processing theory, suggests that utilities’ performance depends on the fit between the information processing requirements and capacities associated with a given demand-side activity. Using secondary data and multiple regression analyses, the first study finds, consistent with OIPT, a positive relationship between utilities’ advanced meter deployments and demand-side management performance. However, it also finds that meters with only data collection capacities are associated with lower performance, suggesting the presence of information waste causing operational inefficiencies. In the second study, interviews with industry participants provide partial support for the initial conceptual model, new insights are gained with respect to information processing fit and information waste, and “big data” is identified as a central theme of the smart grid. To derive richer theoretical insights, the third study employs a grounded theory approach examining the experience of one successful utility in detail. Based on interviews and documentary data, the paradox of dynamic stability emerges as an essential enabler of utilities’ performance in the smart grid environment. Within this context, the frames of opportunity, control, and data limitation interact to support dynamic stability and contribute to innovation within tradition.
The main contributions of this thesis include theoretical extensions to OIPT and the development of an emergent model of dynamic stability in relation to big data. The thesis also adds to the green IS literature and identifies important practical implications for utilities as they endeavour to bring the smart grid to reality. / Thesis (Ph.D, Management) -- Queen's University, 2013-01-21 12:04:43.652
|
387 |
An artefact to analyse unstructured document data stores / by André Romeo BotesBotes, André Romeo January 2014 (has links)
Structured data stores have been the dominating technologies for the past few decades. Although dominating, structured data stores lack the functionality to handle the ‘Big Data’ phenomenon. A new technology has recently emerged which stores unstructured data and can handle the ‘Big Data’ phenomenon. This study describes the development of an artefact to aid in the analysis of NoSQL document data stores in terms of relational database model constructs. Design science research (DSR) is the methodology implemented in the study and it is used to assist in the understanding, design and development of the problem, artefact and solution. This study explores the existing literature on DSR, in addition to structured and unstructured data stores. The literature review formulates the descriptive and prescriptive knowledge used in the development of the artefact. The artefact is developed using a series of six activities derived from two DSR approaches. The problem domain is derived from the existing literature and a real application environment (RAE). The reviewed literature provided a general problem statement. A representative from NFM (the RAE) is interviewed for a situation analysis providing a specific problem statement. An objective is formulated for the development of the artefact and suggestions are made to address the problem domain, assisting the artefact’s objective. The artefact is designed and developed using the descriptive knowledge of structured and unstructured data stores, combined with prescriptive knowledge of algorithms, pseudo code, continuous design and object-oriented design. The artefact evolves through multiple design cycles into a final product that analyses document data stores in terms of relational database model constructs. The artefact is evaluated for acceptability and utility. This provides credibility and rigour to the research in the DSR paradigm. Acceptability is demonstrated through simulation and the utility is evaluated using a real application environment (RAE). A representative from NFM is interviewed for the evaluation of the artefact. Finally, the study is communicated by describing its findings, summarising the artefact and looking into future possibilities for research and application. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014
|
388 |
Data-Centric Network of Things : A Method for Exploiting the Massive Amount of Heterogeneous Data of Internet of Things in Support of ServicesXiao, Bin January 2017 (has links)
Internet of things (IoT) generates massive amount of heterogeneous data, which should be efficiently utilized to support services in different domains. Specifically, data need to be supplied to services by understanding the needs of services and by understanding the environment changes, so that necessary data can be provided efficiently but without overfeeding. However, it is still very difficult for IoT to fulfill such data supply with only the existing supports of communication, network, and infrastructure; while the most essential issues are still unaddressed, namely the heterogeneity issue, the recourse coordination issue, and the environments’ dynamicity issue. Thus, this necessitates to specifically study on those issues and to propose a method to utilize the massive amount of heterogeneous data to support services in different domains. This dissertation presents a novel method, called the data-centric network of things (DNT), which handles heterogeneity, coordinates resources, and understands the changing IoT entity relations in dynamic environments to supply data in support of services. As results, various services based on IoT (e.g., smart cities, smart transport, smart healthcare, smart homes, etc.) are supported by receiving enough necessary data without overfeeding. The contributions of the DNT to IoT and big data research are: firstly the DNT enables IoT to perceive data, resources, and the relations among IoT entities in dynamic environments. This perceptibility enhances IoT to handle the heterogeneity in different levels. Secondly, the DNT coordinates IoT edge resources to process and disseminate data based on the perceived results. This releases the big data pressure caused by centralized analytics to certain degrees. Thirdly, the DNT manages entity relations for data supply by handling the environment dynamicity. Finally, the DNT supply necessary data to satisfy different service needs, by avoiding either data-hungry or data-overfed status.
|
389 |
Bayesian Inference in Large-scale ProblemsJohndrow, James Edward January 2016 (has links)
<p>Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here. </p><p>Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.</p><p>One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.</p><p>Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.</p><p>In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models. </p><p>Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data. </p><p>The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.</p><p>Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.</p> / Dissertation
|
390 |
New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big DataZhao, Shiwen January 2016 (has links)
<p>Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.</p><p>Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.</p> / Dissertation
|
Page generated in 0.1033 seconds