Spelling suggestions: "subject:"massive data"" "subject:"massive mata""
1 |
Robust A-optimal Subsampling for Massive Data Robust Linear RegressionZiting Tang (8081000) 05 December 2019 (has links)
<div>This thesis is concerned with massive data analysis via robust A-optimally efficient non-uniform subsampling. Motivated by the fact that massive data often contain outliers and that uniform sampling is not efficient, we give numerous sampling distributions by minimizing the sum of the component variances of the subsampling estimate. And these sampling distributions are robust against outliers. Massive data pose two computational bottlenecks. Namely, data exceed a computer’s storage space, and computation requires too long waiting time. The two bottle necks can be simultaneously addressed by selecting a subsample as a surrogate for the full sample and completing the data analysis. We develop our theory in a typical setting for robust linear regression in which the estimating functions are not differentiable. For an arbitrary sampling distribution, we establish consistency for the subsampling estimate for both fixed and growing dimension( as high dimensionality is common in massive data). We prove asymptotic normality for fixed dimension. We discuss the A-optimal scoring method for fast computing. We conduct large simulations to evaluate the numerical performance of our proposed A-optimal sampling distribution. Real data applications are also performed.</div>
|
2 |
Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for High-Throughput ScienceBlocker, Alexander Weaver 18 September 2013 (has links)
The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire S. cerevisiae genome in less than 1 hour on EC2. We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing. / Statistics
|
3 |
Energetic Path Finding Across Massive Terrain DataTsui, Andrew N 01 June 2009 (has links)
Before there were airplanes, cars, trains, boats, or bicycles, the primary means of transportation was on foot. Unfortunately, many of the trails used by ancient travelers have long since been abandoned. We present a software tool which can help visualize and predict where these forgotten trails might lie through the use of a human-centered cost metric. By comparing the paths generated by our software with known historical trails, we demonstrate how the tool can indicate likely trails used by ancient travelers. In addition, this new tool provides novel visualizations to better help the user understand alternate paths, effect of terrain, and nearby areas of interest. Such a tool could be used by archaeologists and historians to better visualize and understand the terrain and paths around sites of historical interest.
This thesis is a continuation of previous work, with emphasis on the ability to generate paths which traverse several thousand kilometers. To accomplish this, various graph simplification and path approximation algorithms are explored to construct a real-time path finding algorithm. To this end, we show that it is possible to restrict the search space for a path finding algorithm while not disrupting accuracy. Combined with a multi-threaded variant of Dijkstra's shortest path algorithm, we present a tool capable of traversing the contiguous US, a dataset containing over 19 billion datapoints, in under three hours on a 2.5 Ghz dual core processor. The tool is demonstrated on several examples which show the potential archaeological and historical applicability, and provide avenues for future improvements.
|
4 |
Multi-Resolution Statistical Modeling in Space and Time With Application to Remote Sensing of the EnvironmentJohannesson, Gardar 12 May 2003 (has links)
No description available.
|
5 |
Connaissance et optimisation de la prise en charge des patients : la science des réseaux appliquée aux parcours de soins / Understanding and optimization of patient care and services : networks science applied to healthcare pathwaysJaffré, Marc-Olivier 26 October 2018 (has links)
En France, la nécessaire rationalisation des moyens alloués aux hôpitaux a abouti à une concentration des ressources et une augmentation de la complexité des plateaux techniques. Leur pilotage et leur répartition territoriale s’avèrent d’autant plus difficile, soulevant ainsi la problématique de l’optimisation des systèmes de soins. L’utilisation des données massives produites pas ces systèmes pourrait constituer une nouvelle approche en matière d’analyse et d’aide à la décision. Méthode : A partir d’une réflexion sur la notion de performance, différentes approches d’optimisation préexistantes sont d’abord mis en évidence. Le bloc opératoire a été choisi en tant que terrain expérimental. Suit une analyse sur une fusion d’établissements en tant qu’exemple d’une approche d’optimisation par massification.Ces deux étapes permettent de défendre une approche alternative qui associe l’usage de données massives, la science des réseaux et la visualisation des données sous forme cartographique. Deux sets de séjours en chirurgie orthopédique sur la région ex-Midi-Pyrénées sont utilisés. L’enchainement des séjours de soins est considéré en tant en réseau de données. L’ensemble est projeté dans un environnement visuel développé en JavaScript et permettant une fouille dynamique du graphe. Résultats : La possibilité de visualiser des parcours de santé sous forme de graphes NŒUDS-LIENS est démontrée. Les graphes apportent une perception supplémentaire sur les enchainements de séjours et les redondances des parcours. Le caractère dynamique des graphes permet en outre leur fouille. L’approche visuelle subjective est complétée par une série de mesures objectives issues de la science des réseaux. Les plateaux techniques de soins produisent des données massives utiles à leur analyse et potentiellement à leur optimisation. La visualisation graphique de ces données associées à un cadre d’analyse tel que la science des réseaux donne des premiers indicateurs positifs avec notamment la mise en évidence de motifs redondants. La poursuite d’expérimentations à plus large échelle est requise pour valider, renforcer et diffuser ces observations et cette méthode. / In France, the streamlining of means assigned hospitals result in concentration of resources ana growing complexily of heallhcare facilities. Piloting and planning (them turn out to be all the more difficult, thus leading of optimjzation problems. The use of massive data produced by these systems in association with network science an alternative approach for analyzing and improving decision-making support jn healthcare. Method : Various preexisting optimisation are first highblighted based on observations in operating theaters chosen as experirnentai sites. An analysis of merger of two hospitlas also follows as an example of an optimization method by massification. These two steps make it possible to defend an alternative approach that combines the use of big data science of networks data visualization techniques. Two sets of patient data in orthopedic surgery in the ex-Midi-Pyrénées region in France are used to create a network of all sequences of care. The whole is displayed in a visual environment developed in JavaScript allowing a dynamic mining of the graph. Results: Visualizing healthcare sequences in the form of nodes and links graphs has been sel out. The graphs provide an additional perception of' the redundancies of he healthcare pathways. The dynamic character of the graphs also allows their direct rnining. The initial visual approach is supplernented by a series of objcctive measures from the science of networks. Conciusion: Healthcare facilities produce massive data valuable for their analysis and optimization. Data visualizalion together with a framework such as network science gives prelimiaary encouraging indicators uncovering redondant healthcare pathway patterns. Furthev experimentations with various and larger sets of data is required to validate and strengthen these observations and methods.
|
6 |
Algorithm design on multicore processors for massive-data analysisAgarwal, Virat 28 June 2010 (has links)
Analyzing massive-data sets and streams is computationally very challenging. Data sets in
systems biology, network analysis and security use network abstraction to construct large-scale
graphs. Graph algorithms such as traversal and search are memory-intensive and typically require
very little computation, with access patterns that are irregular and fine-grained. The increasing
streaming data rates in various domains such as security, mining, and finance leaves algorithm
designers with only a handful of clock cycles (with current general purpose computing technology)
to process every incoming byte of data in-core at real-time. This along with increasing complexity of
mining patterns and other analytics puts further pressure on already high computational requirement.
Processing streaming data in finance comes with an additional constraint to process at low latency,
that restricts the algorithm to use common techniques such as batching to obtain high throughput.
The primary contributions of this dissertation are the design of novel parallel data analysis algorithms
for graph traversal on large-scale graphs, pattern recognition and keyword scanning on massive
streaming data, financial market data feed processing and analytics, and data transformation,
that capture the machine-independent aspects, to guarantee portability with performance to future
processors, with high performance implementations on multicore processors that embed processorspecific
optimizations. Our breadth first search graph traversal algorithm demonstrates a capability
to process massive graphs with billions of vertices and edges on commodity multicore processors
at rates that are competitive with supercomputing results in the recent literature. We also present
high performance scalable keyword scanning on streaming data using novel automata compression
algorithm, a model of computation based on small software content addressable memories (CAMs)
and a unique data layout that forces data re-use and minimizes memory traffic. Using a high-level
algorithmic approach to process financial feeds we present a solution that decodes and normalizes
option market data at rates an order of magnitude more than the current needs of the market, yet
portable and flexible to other feeds in this domain. In this dissertation we discuss in detail algorithm
design challenges to process massive-data and present solutions and techniques that we believe can
be used and extended to solve future research problems in this domain.
|
7 |
Large scale data collection and storage using smart vehicles : An information-centric approach / Collecte et stockage de données à large échelle par des véhicules intelligents : une approche centrée sur le contenuKhan, Junaid 04 November 2016 (has links)
De nos jours, Le nombre de dispositifs ne cesse d’augmenter ce qui induit une forte demande des applications en données multimédia. Cependant gérer des données massives générées et consommées par les utilisateurs mobiles dans une zone urbaine reste une problématique de taille pour les réseaux cellulaires existants qui sont à la fois limités en termes de cout et de bande passante mais aussi due à la nature de telles données centrées- connexion. D’autre part, l’avancée technologique en matière de véhicules autonomes permet de constituer une infrastructure prometteuse capable de prendre en charge le traitement, la sauvegarde, et la communication de ces données. En effet, Il est maintenant possible de recruter des véhicules intelligents pour des fins de collecte, de stockage, et de partage des données hétérogènes en provenance d’un réseau routier afin de répondre aux demandes des citoyens via des applications. Par conséquent, nous tirons profit de l'évolution récente en « information Centric Networking » ICN afin d'introduire deux nouvelles approches de collecte et de stockage de contenu par les véhicules, nommées respectivement VISIT et SAVING, plus efficaces et plus proches de l'utilisateur mobile en zone urbaine ainsi nous remédions aux problèmes liés à la bande passante et le coût. VISIT est une plate-forme qui définit de nouvelles mesures de centralité basées sur l'intérêt social des citoyens afin d’identifier et de sélectionner l'ensemble approprié des meilleurs véhicules candidats pour la collecte des données urbaines. SAVING est un système de stockage de données sociales, qui présente une solution de mise en cache des données d’une façon collaborative entre un ensemble de véhicules parmi d’autres désignés et recrutés selon une stratégie des théorie des jeux basée sur les réseaux complexes. Nous avons testé ces deux méthodes VISIT et SAVING sur des données simulées pour environ 2986 véhicules avec des traces de mobilité réalistes en zone urbaine, et les résultats ont prouvés que les deux méthodes permettent non seulement une collecte et un stockage efficaces mais aussi bien scalables / The growth in the number of mobile devices today result in an increasing demand for large amount of rich multimedia content to support numerous applications. It is however challenging for the current cellular networks to deal with such increasing demand, both in terms of cost and bandwidth for the ``massive'' content generated and consumed by mobile users in an urban environment due to its connection-centric nature. The technological advancement in modern vehicles allow us to harness their computing, caching and communication capabilities to supplement infrastructure network. It is now possible to recruit smart vehicles to collect, store and share heterogeneous data on urban streets in order to provide citizens with different services. Therefore, we leverage the recent shift towards Information Centric Networking (ICN) to introduce two schemes, VISIT and SAVING for the efficient collection and storage of content at vehicles, closer to the urban mobile user to avoid bandwidth and cost. VISIT is a platform which defines novel centrality metrics based on the social interest of urban users to identify and select the appropriate set of best candidate vehicles to perform urban data collection. SAVING is a social-aware data storage system which exploits complex networks to present game-theoretic solutions for finding and recruiting vehicles adequate to perform collaborative content caching in an urban environment. VISIT and SAVING are simulated for around 2986 vehicles with realistic urban mobility traces and comparison results with other schemes in literature suggest both not only efficient but also scalable data collection and storage systems
|
8 |
Analyzing YouTube Content Demand Patternsand Cacheabilityin a Swedish Municipal NetworkWang, Hantao January 2013 (has links)
User Generated Content (UGC) has boosted a high popularity since the birth of a wide range of web services allowing the distribution of such user-produced media content, whose patterns vary from textual information, photo galleries to videos on site. The boom of Internet of Things and the newly released HTML5 accelerate the development of multimedia patterns as well as the technology of distributing it. YouTube, as one of the most popular video sharing site, enjoys the top most numbers of video views and video uploads per day in the world. With the rapid growing of multimedia patterns as well as huge bandwidth demand from subscribers, the sheer volume of the traffic is going to severely strain the network resources.</p><p>Therefore, analyzing media streaming traffic patterns and cacheability in live IP-access networks today leads a hot issue among network operators and content providers. One possible solution could be caching popular contents with a high replay rate in a proxy server on LAN border or in users' terminals.</p><p>Based on the solution, this thesis project focuses on developing a measurement framework to associate network cacheability with video category and video duration under a typical Swedish municipal network. Experiments of focused parameters are performed to investigate potential user behavior rules. From the analysis of the results, Music traffic gets a rather ideal network gain as well as a remarkable terminal gain, indicating that it is more efficient to be stored close to end user. Film&amp;Animation traffic, however, is preferable to be cached in the network due to its high net gain. Besides, it is optimal to cache the video clips with a length between 3 and 5 minutes, especially the Music and Film&amp;Animation traffic. In addition, more than half of the replays occur during 16.00-24.00 and peak hours appear on average from 18.00 to 22.00. Lastly, only around 16% of the videos are global popular and very few heavy users tend to be local popular video viewers, depicting local limits and independent user interests
|
9 |
A Computational Fluid Dynamics Feature Extraction Method Using Subjective LogicMortensen, Clifton H. 08 July 2010 (has links) (PDF)
Computational fluid dynamics simulations are advancing to correctly simulate highly complex fluid flow problems that can require weeks of computation on expensive high performance clusters. These simulations can generate terabytes of data and pose a severe challenge to a researcher analyzing the data. Presented in this document is a general method to extract computational fluid dynamics flow features concurrent with a simulation and as a post-processing step to drastically reduce researcher post-processing time. This general method uses software agents governed by subjective logic to make decisions about extracted features in converging and converged data sets. The software agents are designed to work inside the Concurrent Agent-enabled Feature Extraction concept and operate efficiently on massively parallel high performance computing clusters. Also presented is a specific application of the general feature extraction method to vortex core lines. Each agent's belief tuple is quantified using a pre-defined set of information. The information and functions necessary to set each component in each agent's belief tuple is given along with an explanation of the methods for setting the components. A simulation of a blunt fin is run showing convergence of the horseshoe vortex core to its final spatial location at 60% of the converged solution. Agents correctly select between two vortex core extraction algorithms and correctly identify the expected probabilities of vortex cores as the solution converges. A simulation of a delta wing is run showing coherently extracted primary vortex cores as early as 16% of the converged solution. Agents select primary vortex cores extracted by the Sujudi-Haimes algorithm as the most probable primary cores. These simulations show concurrent feature extraction is possible and that intelligent agents following the general feature extraction method are able to make appropriate decisions about converging and converged features based on pre-defined information.
|
10 |
Visualisation de données volumiques massives : application aux données sismiques / Visualization of massive data volumes : applications to seismic dataCastanié, Laurent 24 November 2006 (has links)
Les données de sismique réflexion sont une source d'information essentielle pour la modélisation tridimensionnelle des structures du sous-sol dans l'exploration-production des hydrocarbures. Ce travail vise à fournir des outils de visualisation pour leur interprétation. Les défis à relever sont à la fois d'ordre qualitatif et quantitatif. Il s'agit en effet de considérer (1) la nature particulière des données et la démarche d'interprétation (2) la taille des données. Notre travail s'est donc axé sur ces deux aspects : 1) Du point de vue qualitatif, nous mettons tout d'abord en évidence les principales caractéristiques des données sismiques, ce qui nous permet d'implanter une technique de visualisation volumique adaptée. Nous abordons ensuite l'aspect multimodal de l'interprétation qui consiste à combiner plusieurs sources d'information (sismique et structurale). Selon la nature de ces sources (strictement volumique ou volumique et surfacique), nous proposons deux systèmes de visualisation différents. 2) Du point de vue quantitatif, nous définissons tout d'abord les principales contraintes matérielles intervenant dans l'interprétation, ce qui nous permet d'implanter un système générique de gestion de la mémoire. Initialement destiné au couplage de la visualisation et des calculs sur des données volumiques massives, il est ensuite amélioré et spécialisé pour aboutir à un système dynamique de gestion distribuée de la mémoire sur cluster de PCs. Cette dernière version, dédiée à la visualisation, permet de manipuler des données sismiques à échelle régionale (100-200 Go) en temps réel. Les problématiques sont abordées à la fois dans le contexte scientifique de la visualisation et dans le contexte d'application des géosciences et de l'interprétation sismique / Seismic reflection data are a valuable source of information for the three-dimensional modeling of subsurface structures in the exploration-production of hydrocarbons. This work focuses on the implementation of visualization techniques for their interpretation. We face both qualitative and quantitative challenges. It is indeed necessary to consider (1) the particular nature of seismic data and the interpretation process (2) the size of data. Our work focuses on these two distinct aspects : 1) From the qualitative point of view, we first highlight the main characteristics of seismic data. Based on this analysis, we implement a volume visualization technique adapted to the specificity of the data. We then focus on the multimodal aspect of interpretation which consists in combining several sources of information (seismic and structural). Depending on the nature of these sources (strictly volumes or both volumes and surfaces), we propose two different visualization systems. 2) From the quantitative point of view, we first define the main hardware constraints involved in seismic interpretation. Focused on these constraints, we implement a generic memory management system. Initially able to couple visualization and data processing on massive data volumes, it is then improved and specialised to build a dynamic system for distributed memory management on PC clusters. This later version, dedicated to visualization, allows to manipulate regional scale seismic data (100-200 GB) in real-time. The main aspects of this work are both studied in the scientific context of visualization and in the application context of geosciences and seismic interpretation
|
Page generated in 0.0722 seconds