Global ETD Search

381	New Statistical Methods and Computational Tools for Mining Big Data, with Applications in Plant Sciences Michels, Kurt Andrew January 2016 (has links) The purpose of this dissertation is to develop new statistical tools for mining big data in plant sciences. In particular, the dissertation consists of four inter-related projects to address various methodological and computational challenges in phylogenetic methods. Project 1 aims to systematically test different optimization tools and provide useful strategies to improve optimization in practice. Project 2 develops a new R package rPlant, which provides a friendly and convenient toolbox for users of iPlant. Project 3 presents a fast and effective group-screening method to identify important genetic factors in GWAS, with theoretical justifications and nice asymptotic properties. Project 4 develops a new statistical tool to identify gene-gene interactions, with the ability of handling the interactions between groups of covariates. Forward Regression Genome Wide Association Study Group Data Interactions R Statistics Big Data
382	Studierendensymposium Informatik 2016 der TU Chemnitz / Students Symposium Computer Science in 2016 at the TU Chemnitz 04 May 2016 (has links) (PDF) Im Rahmen des 180jährigen Jubiläums der technischen Universität Chemnitz fand am 28. April 2016 das zweite Studierendensymposium der Fakultät Informatik statt. Das Studierendensymposium Informatik richtete sich inhaltlich an alle Themen rund um die Informatik und ihre Anwendungen: Ob Hardware oder Software, ob technische Lösungen oder Anwenderstudien, ob Programmierung oder Verwendung, ob Hardcore-Technik oder gesellschaftliche Fragestellungen – alles, was mit informatischen Lösungen zu tun hat, war willkommen. Das Studierendensymposium Informatik war dabei weder auf die Fakultät Informatik noch auf die TU Chemnitz begrenzt. Es wurden explizit Einreichungen aus thematisch angrenzenden Fächern beworben und Hochschulen der Region in die Planung und Organisation eingebunden. Der Tagungsband enthält die 21 Beitrage, die auf dem Symposium vorgestellt wurden. / In the course of the 180 year anniversary of the Technische Universität Chemnitz the Department of Computer Science held the second Students Symposium on April 18, 2016. The symposium addressed topics related to computer science and its applications: Whether hardware or software, whether technical solutions or user studies, whether programming or use, whether hardcore technology or social issues - everything concerned with computational solutions was welcomed. The Students Symposium included explicitly submissions from thematically adjacent departments and involved universities in the region in planning and organization. The proceedings contain the 21 papers (full and short), which were presented at the symposium. Algorithmen Kontexterkennung Intelligente Systeme Collaboration Big Data Web ddc:000 Algorithmus Robotik Medien
383	A Socio-technical Investigation of the Smart Grid: Implications for Demand-side Activities of Electricity Service Providers Corbett, JACQUELINE 21 January 2013 (has links) Enabled by advanced communication and information technologies, the smart grid represents a major transformation for the electricity sector. Vast quantities of data and two-way communications abilities create the potential for a flexible, data-driven, multi-directional supply and consumption network well equipped to meet the challenges of the next century. For electricity service providers (“utilities”), the smart grid provides opportunities for improved business practices and new business models; however, a transformation of such magnitude is not without risks. Three related studies are conducted to explore the implications of the smart grid on utilities’ demand-side activities. An initial conceptual framework, based on organizational information processing theory, suggests that utilities’ performance depends on the fit between the information processing requirements and capacities associated with a given demand-side activity. Using secondary data and multiple regression analyses, the first study finds, consistent with OIPT, a positive relationship between utilities’ advanced meter deployments and demand-side management performance. However, it also finds that meters with only data collection capacities are associated with lower performance, suggesting the presence of information waste causing operational inefficiencies. In the second study, interviews with industry participants provide partial support for the initial conceptual model, new insights are gained with respect to information processing fit and information waste, and “big data” is identified as a central theme of the smart grid. To derive richer theoretical insights, the third study employs a grounded theory approach examining the experience of one successful utility in detail. Based on interviews and documentary data, the paradox of dynamic stability emerges as an essential enabler of utilities’ performance in the smart grid environment. Within this context, the frames of opportunity, control, and data limitation interact to support dynamic stability and contribute to innovation within tradition. The main contributions of this thesis include theoretical extensions to OIPT and the development of an emergent model of dynamic stability in relation to big data. The thesis also adds to the green IS literature and identifies important practical implications for utilities as they endeavour to bring the smart grid to reality. / Thesis (Ph.D, Management) -- Queen's University, 2013-01-21 12:04:43.652 demand-side smart grid big data dynamic stability information system information processing sociotechnical
384	An artefact to analyse unstructured document data stores / by André Romeo Botes Botes, André Romeo January 2014 (has links) Structured data stores have been the dominating technologies for the past few decades. Although dominating, structured data stores lack the functionality to handle the ‘Big Data’ phenomenon. A new technology has recently emerged which stores unstructured data and can handle the ‘Big Data’ phenomenon. This study describes the development of an artefact to aid in the analysis of NoSQL document data stores in terms of relational database model constructs. Design science research (DSR) is the methodology implemented in the study and it is used to assist in the understanding, design and development of the problem, artefact and solution. This study explores the existing literature on DSR, in addition to structured and unstructured data stores. The literature review formulates the descriptive and prescriptive knowledge used in the development of the artefact. The artefact is developed using a series of six activities derived from two DSR approaches. The problem domain is derived from the existing literature and a real application environment (RAE). The reviewed literature provided a general problem statement. A representative from NFM (the RAE) is interviewed for a situation analysis providing a specific problem statement. An objective is formulated for the development of the artefact and suggestions are made to address the problem domain, assisting the artefact’s objective. The artefact is designed and developed using the descriptive knowledge of structured and unstructured data stores, combined with prescriptive knowledge of algorithms, pseudo code, continuous design and object-oriented design. The artefact evolves through multiple design cycles into a final product that analyses document data stores in terms of relational database model constructs. The artefact is evaluated for acceptability and utility. This provides credibility and rigour to the research in the DSR paradigm. Acceptability is demonstrated through simulation and the utility is evaluated using a real application environment (RAE). A representative from NFM is interviewed for the evaluation of the artefact. Finally, the study is communicated by describing its findings, summarising the artefact and looking into future possibilities for research and application. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Design science research Structured data stores Unstructured data stores Artefact NoSQL Big data
385	Data-Centric Network of Things : A Method for Exploiting the Massive Amount of Heterogeneous Data of Internet of Things in Support of Services Xiao, Bin January 2017 (has links) Internet of things (IoT) generates massive amount of heterogeneous data, which should be efficiently utilized to support services in different domains. Specifically, data need to be supplied to services by understanding the needs of services and by understanding the environment changes, so that necessary data can be provided efficiently but without overfeeding. However, it is still very difficult for IoT to fulfill such data supply with only the existing supports of communication, network, and infrastructure; while the most essential issues are still unaddressed, namely the heterogeneity issue, the recourse coordination issue, and the environments’ dynamicity issue. Thus, this necessitates to specifically study on those issues and to propose a method to utilize the massive amount of heterogeneous data to support services in different domains. This dissertation presents a novel method, called the data-centric network of things (DNT), which handles heterogeneity, coordinates resources, and understands the changing IoT entity relations in dynamic environments to supply data in support of services. As results, various services based on IoT (e.g., smart cities, smart transport, smart healthcare, smart homes, etc.) are supported by receiving enough necessary data without overfeeding. The contributions of the DNT to IoT and big data research are: firstly the DNT enables IoT to perceive data, resources, and the relations among IoT entities in dynamic environments. This perceptibility enhances IoT to handle the heterogeneity in different levels. Secondly, the DNT coordinates IoT edge resources to process and disseminate data based on the perceived results. This releases the big data pressure caused by centralized analytics to certain degrees. Thirdly, the DNT manages entity relations for data supply by handling the environment dynamicity. Finally, the DNT supply necessary data to satisfy different service needs, by avoiding either data-hungry or data-overfed status. Internet of Things Big Data Artificial Intelligence Data Supply Distributed System Computer Systems Datorsystem
386	Bayesian Inference in Large-scale Problems Johndrow, James Edward January 2016 (has links) <p>Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here. </p><p>Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.</p><p>One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.</p><p>Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.</p><p>In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models. </p><p>Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data. </p><p>The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.</p><p>Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.</p> / Dissertation Statistics Bayesian big data contingency table high-dimensional Markov chain Monte Carlo tail dependence
387	New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data Zhao, Shiwen January 2016 (has links) <p>Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.</p><p>Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.</p> / Dissertation Statistics Bioinformatics Mathematics Bayesian Statistics Big Data Dimension Reduction Latent Structure Method of Moment
388	Improvement of recommendation system for a wholesale store chain using advanced data mining techniques Videla Cavieres, Iván Fernando January 2015 (has links) Magíster en Gestión de Operaciones / Ingeniero Civil Industrial / En las empresas de Retail, las áreas de Customer Intelligence tienen muchas oportunidades de mejorar sus decisiones estratégicas a partir de la información que podrían obtener de los registros de interacciones con sus clientes. Sin embargo se ha convertido en un desafío poder procesar estos grandes volúmenes de datos. Uno de los problemas que se enfrentan día a día es segmentar o agrupar clientes. La mayoría de las empresas generan agrupaciones según nivel de gasto, no por similitud en sus canastas de compra, como propone la literatura. Otro desafío de estas empresas es aumentar las ventas en cada visita del cliente y fidelizar. Una de las técnicas utilizadas para lograrlo es usar sistemas de recomendación. En este trabajo se proceso ́ alrededor de medio billón de registros transaccionales de una cadena de supermercados mayorista. Al aplicar las técnicas tradicionales de Clustering y Market Basket Analysis los resultados son de baja calidad, haciendo muy difícil la interpretación, además no se logra identificar grupos que permitan clasificar a un cliente de acuerdo a sus compras históricas. Entendiendo que la presencia simultánea de dos productos en una misma boleta implica una relación entre ellos, se usó un método de graph mining basado en redes sociales que permitió obtener grupos de productos identificables que denominamos comunidades, a las que puede pertenecer un cliente. La robustez del modelo se comprueba por la estabilidad de los grupos generados en distintos periodos de tiempo. Bajo las mismas restricciones que la empresa exige, se generan recomendaciones basadas en las compras históricas y en la pertenencia de los clientes a los distintos grupos de productos. De esta manera, los clientes reciben recomendaciones mucho más pertinentes y no solo son basadas en los que otros clientes también compraron. La novedosa forma de resolver el problema de segmentar clientes ayuda a mejorar en un 140% el actual método de recomendaciones que utiliza la cadena Chilena de supermercados mayoristas. Esto se traduce en un aumento de más de 430% de los ingresos posibles. Minería de datos Mediciones multidimensionales Preferencia de los consumidores Consumidores - Investigaciones Retail Big data
389	INFLUENCE ANALYSIS TOWARDS BIG SOCIAL DATA Han, Meng 03 May 2017 (has links) Large scale social data from online social networks, instant messaging applications, and wearable devices have seen an exponential growth in a number of users and activities recently. The rapid proliferation of social data provides rich information and infinite possibilities for us to understand and analyze the complex inherent mechanism which governs the evolution of the new technology age. Influence, as a natural product of information diffusion (or propagation), which represents the change in an individual’s thoughts, attitudes, and behaviors resulting from interaction with others, is one of the fundamental processes in social worlds. Therefore, influence analysis occupies a very prominent place in social related data analysis, theory, model, and algorithms. In this dissertation, we study the influence analysis under the scenario of big social data. Firstly, we investigate the uncertainty of influence relationship among the social network. A novel sampling scheme is proposed which enables the development of an efficient algorithm to measure uncertainty. Considering the practicality of neighborhood relationship in real social data, a framework is introduced to transform the uncertain networks into deterministic weight networks where the weight on edges can be measured as Jaccard-like index. Secondly, focusing on the dynamic of social data, a practical framework is proposed by only probing partial communities to explore the real changes of a social network data. Our probing framework minimizes the possible difference between the observed topology and the actual network through several representative communities. We also propose an algorithm that takes full advantage of our divide-and-conquer strategy which reduces the computational overhead. Thirdly, if let the number of users who are influenced be the depth of propagation and the area covered by influenced users be the breadth, most of the research results are only focused on the influence depth instead of the influence breadth. Timeliness, acceptance ratio, and breadth are three important factors that significantly affect the result of influence maximization in reality, but they are neglected by researchers in most of time. To fill the gap, a novel algorithm that incorporates time delay for timeliness, opportunistic selection for acceptance ratio, and broad diffusion for influence breadth has been investigated. In our model, the breadth of influence is measured by the number of covered communities, and the tradeoff between depth and breadth of influence could be balanced by a specific parameter. Furthermore, the problem of privacy preserved influence maximization in both physical location network and online social network was addressed. We merge both the sensed location information collected from cyber-physical world and relationship information gathered from online social network into a unified framework with a comprehensive model. Then we propose the resolution for influence maximization problem with an efficient algorithm. At the same time, a privacy-preserving mechanism are proposed to protect the cyber physical location and link information from the application aspect. Last but not least, to address the challenge of large-scale data, we take the lead in designing an efficient influence maximization framework based on two new models which incorporate the dynamism of networks with consideration of time constraint during the influence spreading process in practice. All proposed problems and models of influence analysis have been empirically studied and verified by different, large-scale, real-world social data in this dissertation. Algorithm Influence Analysis Big Data Social Network Data Mining Data Privacy
390	What are the Potential Impacts of Big Data, Artificial Intelligence and Machine Learning on the Auditing Profession? Evett, Chantal 01 January 2017 (has links) To maintain public confidence in the financial system, it is essential that most financial fraud is prevented and that incidents of fraud are detected and punished. The responsibility of uncovering creatively implemented fraud is placed, in a large part, on auditors. Recent advancements in technology are helping auditors turn the tide against fraudsters. Big Data, made possible by the proliferation, widespread availability and amalgamation of diverse digital data sets, has become an important driver of technological change. Big Data analytics are already transforming the traditional audit. Sampling and testing a limited number of random samples has turned into a much more comprehensive audit that analyzes the entire population of transactions within an account, allowing auditors to flag and investigate all sorts of potentially fraudulent anomalies that were previously invisible. Artificial intelligence (AI) programs, typified by IBM’s Watson, can mimic the thought processes of the human mind and will soon be adopted by the auditing profession. Machine learning (ML) programs, with the ability to change when exposed to new data, are developing rapidly and may take over many of the decision-making functions currently performed by auditors. The SEC has already implemented pioneering fraud-detection software based on AI and ML programs. The evolution of the auditor’s role has already begun. Current accounting students must understand the traditional auditing skillset will not longer be sufficient. While facing a future with fewer auditing positions available due to increased automation, auditors will need training for roles that will be more data analytical and computer-science based. Auditing Big Data Analytics Artificial Intelligence Machine Learning Fraud Accounting Accounting Technology and Innovation

Search results