Spelling suggestions: "subject:"large scale data"" "subject:"marge scale data""
1 |
Evolutionary Granular Kernel MachinesJin, Bo 03 May 2007 (has links)
Kernel machines such as Support Vector Machines (SVMs) have been widely used in various data mining applications with good generalization properties. Performance of SVMs for solving nonlinear problems is highly affected by kernel functions. The complexity of SVMs training is mainly related to the size of a training dataset. How to design a powerful kernel, how to speed up SVMs training and how to train SVMs with millions of examples are still challenging problems in the SVMs research. For these important problems, powerful and flexible kernel trees called Evolutionary Granular Kernel Trees (EGKTs) are designed to incorporate prior domain knowledge. Granular Kernel Tree Structure Evolving System (GKTSES) is developed to evolve the structures of Granular Kernel Trees (GKTs) without prior knowledge. A voting scheme is also proposed to reduce the prediction deviation of GKTSES. To speed up EGKTs optimization, a master-slave parallel model is implemented. To help SVMs challenge large-scale data mining, a Minimum Enclosing Ball (MEB) based data reduction method is presented, and a new MEB-SVM algorithm is designed. All these kernel methods are designed based on Granular Computing (GrC). In general, Evolutionary Granular Kernel Machines (EGKMs) are investigated to optimize kernels effectively, speed up training greatly and mine huge amounts of data efficiently.
|
2 |
Flexible and efficient computation in large data centresGog, Ionel Corneliu January 2018 (has links)
Increasingly, online computer applications rely on large-scale data analyses to offer personalised and improved products. These large-scale analyses are performed on distributed data processing execution engines that run on thousands of networked machines housed within an individual data centre. These execution engines provide, to the programmer, the illusion of running data analysis workflows on a single machine, and offer programming interfaces that shield developers from the intricacies of implementing parallel, fault-tolerant computations. Many such execution engines exist, but they embed assumptions about the computations they execute, or only target certain types of computations. Understanding these assumptions involves substantial study and experimentation. Thus, developers find it difficult to determine which execution engine is best, and even if they did, they become “locked in” because engineering effort is required to port workflows. In this dissertation, I first argue that in order to execute data analysis computations efficiently, and to flexibly choose the best engines, the way we specify data analysis computations should be decoupled from the execution engines that run the computations. I propose an architecture for decoupling data processing, together with Musketeer, my proof-of-concept implementation of this architecture. In Musketeer, developers express data analysis computations using their preferred programming interface. These are translated into a common intermediate representation from which code is generated and executed on the most appropriate execution engine. I show that Musketeer can be used to write data analysis computations directly, and these can execute on many execution engines because Musketeer automatically generates code that is competitive with optimised hand-written implementations. The diverse execution engines cause different workflow types to coexist within a data centre, opening up both opportunities for sharing and potential pitfalls for co-location interference. However, in practice, workflows are either placed by high-quality schedulers that avoid co-location interference, but choose placements slowly, or schedulers that choose placements quickly, but with unpredictable workflow run time due to co-location interference. In this dissertation, I show that schedulers can choose high-quality placements with low latency. I develop several techniques to improve Firmament, a high-quality min-cost flow-based scheduler, to choose placements quickly in large data centres. Finally, I demonstrate that Firmament chooses placements at least as good as other sophisticated schedulers, but at the speeds associated with simple schedulers. These contributions enable more efficient and effective use of data centres for large-scale computation than current solutions.
|
3 |
Computational Models of Nuclear ProliferationFrankenstein, William 01 May 2016 (has links)
This thesis utilizes social influence theory and computational tools to examine the disparate impact of positive and negative ties in nuclear weapons proliferation. The thesis is broadly in two sections: a simulation section, which focuses on government stakeholders, and a large-scale data analysis section, which focuses on the public and domestic actor stakeholders. In the simulation section, it demonstrates that the nonproliferation norm is an emergent behavior from political alliance and hostility networks, and that alliances play a role in current day nuclear proliferation. This model is robust and contains second-order effects of extended hostility and alliance relations. In the large-scale data analysis section, the thesis demonstrates the role that context plays in sentiment evaluation and highlights how Twitter collection can provide useful input to policy processes. It first highlights the results of an on-campus study where users demonstrated that context plays a role in sentiment assessment. Then, in an analysis of a Twitter dataset of over 7.5 million messages, it assesses the role of ‘noise’ and biases in online data collection. In a deep dive analyzing the Iranian nuclear agreement, we demonstrate that the middle east is not facing a nuclear arms race, and show that there is a structural hole in online discussion surrounding nuclear proliferation. By combining both approaches, policy analysts have a complete and generalizable set of computational tools to assess and analyze disparate stakeholder roles in nuclear proliferation.
|
4 |
Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data MassiveAdjout Rehab, Moufida 01 April 2016 (has links)
Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction.
|
5 |
Prediction of DNA-Binding Proteins and their Binding SitesPokhrel, Pujan 01 May 2018 (has links)
DNA-binding proteins play an important role in various essential biological processes such as DNA replication, recombination, repair, gene transcription, and expression. The identification of DNA-binding proteins and the residues involved in the contacts is important for understanding the DNA-binding mechanism in proteins. Moreover, it has been reported in the literature that the mutations of some DNA-binding residues on proteins are associated with some diseases. The identification of these proteins and their binding mechanism generally require experimental techniques, which makes large scale study extremely difficult. Thus, the prediction of DNA-binding proteins and their binding sites from sequences alone is one of the most challenging problems in the field of genome annotation. Since the start of the human genome project, many attempts have been made to solve the problem with different approaches, but the accuracy of these methods is still not suitable to do large scale annotation of proteins. Rather than relying solely on the existing machine learning techniques, I sought to combine those using novel “stacking technique” and used the problem-specific architectures to solve the problem with better accuracy than the existing methods. This thesis presents a possible solution to the DNA-binding proteins prediction problem which performs better than the state-of-the-art approaches.
|
6 |
Machine-learning based automated segmentation tool development for large-scale multicenter MRI data analysisKim, Eun Young 01 December 2013 (has links)
Background: Volumetric analysis of brain structures from structural Mag- netic Resonance (MR) images advances the understanding of the brain by providing means to study brain morphometric changes quantitatively along aging, development, and disease status. Due to the recent increased emphasis on large-scale multicenter brain MR study design, the demand for an automated brain MRI processing tool has increased as well. This dissertation describes an automatic segmentation framework for subcortical structures of brain MRI that is robust for a wide variety of MR data.
Method: The proposed segmentation framework, BRAINSCut, is an inte- gration of robust data standardization techniques and machine-learning approaches. First, a robust multi-modal pre-processing tool for automated registration, bias cor- rection, and tissue classification, has been implemented for large-scale heterogeneous multi-site longitudinal MR data analysis. The segmentation framework was then constructed to achieve robustness for large-scale data via the following comparative experiments: 1) Find the best machine-learning algorithm among several available approaches in the field. 2) Find an efficient intensity normalization technique for the proposed region-specific localized normalization with a choice of robust statistics. 3) Find high quality features that best characterize the MR brain subcortical structures. Our tool is built upon 32 handpicked multi-modal muticenter MR images with man- ual traces of six subcortical structures (nucleus accumben, caudate nucleus, globus pallidum, putamen, thalamus, and hippocampus) from three experts.
A fundamental task associated with brain MR image segmentation for re- search and clinical trials is the validation of segmentation accuracy. This dissertation evaluated the proposed segmentation framework in terms of validity and reliability. Three groups of data were employed for the various evaluation aspects: 1) traveling human phantom data for the multicenter reliability, 2) a set of repeated scans for the measurement stability across various disease statuses, and 3) a large-scale data from Huntington's disease (HD) study for software robustness as well as segmentation accuracy.
Result: Segmentation accuracy of six subcortical structures was improved with 1) the bias-corrected inputs, 2) the two region-specific intensity normalization strategies and 3) the random forest machine-learning algorithm with the selected feature-enhanced image. The analysis of traveling human phantom data showed no center-specific bias in volume measurements from BRAINSCut. The repeated mea- sure reliability of the most of structures also displayed no specific association to disease progression except for caudate nucleus from the group of high risk for HD. The constructed segmentation framework was successfully applied on multicenter MR data from PREDICT-HD [133] study ( < 10% failure rate over 3000 scan sessions pro- cessed).
Conclusion: Random-forest based segmentation method is effective and robust to large-scale multicenter data variation, especially with a proper choice of the intensity normalization techniques. Benefits of proper normalization approaches are more apparent compared to the custom set of feature-enhanced images for the ccuracy and robustness of the segmentation tool. BRAINSCut effectively produced subcortical volumetric measurements that are robust to center and disease status with validity confirmed by human experts and low failure rate from large-scale multicenter MR data. Sample size estimation, which is crutial for designing efficient clinical and research trials, is provided based on our experiments for six subcortical structures.
|
7 |
Large Scale ETL Design, Optimization and Implementation Based On Spark and AWS PlatformZhu, Di January 2017 (has links)
Nowadays, the amount of data generated by users within an Internet product is increasing exponentially, for instance, clickstream for a website application from millions of users, geospatial information from GIS-based APPs of Android and IPhone, or sensor data from cars or any electronic equipment, etc. All these data may be yielded billions every day, which is not surprisingly essential that insights could be extracted or built. For instance, monitoring system, fraud detection, user behavior analysis and feature verification, etc.Nevertheless, technical issues emerge accordingly. Heterogeneity, massiveness and miscellaneous requirements for taking use of the data from different dimensions make it much harder when it comes to the design of data pipelines, transforming and persistence in data warehouse. Undeniably, there are traditional ways to build ETLs from mainframe [1], RDBMS, to MapReduce and Hive. Yet with the emergence and popularization of Spark framework and AWS, this procedure could be evolved to a more robust, efficient, less costly and easy-to-implement architecture for collecting, building dimensional models and proceeding analytics on massive data. With the advantage of being in a car transportation company, billions of user behavior events come in every day, this paper contributes to an exploratory way of building and optimizing ETL pipelines based on AWS and Spark, and compare it with current main Data pipelines from different aspects. / Mängden data som genereras internet-produkt-användare ökar lavinartat och exponentiellt. Det finns otaliga exempel på detta; klick-strömmen från hemsidor med miljontals användare, geospatial information från GISbaserade Android och iPhone appar, eller från sensorer på autonoma bilar.Mängden händelser från de här typerna av data kan enkelt uppnå miljardantal dagligen, därför är det föga förvånande att det är möjligt att extrahera insikter från de här data-strömmarna. Till exempel kan man sätta upp automatiserade övervakningssystem eller kalibrera bedrägerimodeller effektivt. Att handskas med data i de här storleksordningarna är dock inte helt problemfritt, det finns flertalet tekniska bekymmer som enkelt kan uppstå. Datan är inte alltid på samma form, den kan vara av olika dimensioner vilket gör det betydligt svårare att designa en effektiv data-pipeline, transformera datan och lagra den persistent i ett data-warehouse. Onekligen finns det traditionella sätt att bygga ETL’s på från mainframe [1], RDBMS, till MapReduce och Hive. Dock har det med upptäckten och ökade populariteten av Spark och AWS blivit mer robust, effektivt, billigare och enklare att implementera system för att samla data, bygga dimensions-enliga modeller och genomföra analys av massiva data-set. Den här uppsatsen bidrar till en ökad förståelse kring hur man bygger och optimerar ETL-pipelines baserade på AWS och Spark och jämför med huvudsakliga nuvarande Data-pipelines med hänsyn till diverse aspekter. Uppsatsen drar nytta av att ha tillgång till ett massivt data-set med miljarder användar-events genererade dagligen från ett bil-transport-bolag i mellanöstern.
|
8 |
Dataflow parallelism for large scale data miningDaruru, Srivatsava 20 December 2010 (has links)
The unprecedented and exponential growth of data along with the advent
of multi-core processors has triggered a massive paradigm shift from traditional
single threaded programming to parallel programming. A number of
parallel programming paradigms have thus been proposed and have become
pervasive and inseparable from any large production environment. Also with
the massive amounts of data available and with the ever increasing business
need to process and analyze this data quickly at the minimum cost, there is
much more demand for implementing fast data mining algorithms on cheap
hardware.
This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through
a graph of operators. This paradigm exhibits pipeline, horizontal and vertical
parallelism and requires only the data of the active operators in memory at
any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on
huge datasets. We first develop an efficient dataflow implementation of a
Collaborative Filtering (CF) algorithm based on weighted co-clustering and
test its effectiveness on a large and sparse Netflix data. This implementation
of the recommender system was able to rapidly train and predict over 100
million ratings within 17 minutes on a commodity multi-core machine. We
then describe a dataflow implementation of a non-parametric density based
clustering algorithm called Auto-HDS to automatically detect small and
dense clusters on a massive astronomy dataset. This implementation was able
to discover dense clusters at varying density thresholds and generate a compact
cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability
to scale to millions of points as we increase the number of available resources.
Our experimental results illustrate the ability of this model to “scale”
well to massive datasets and its ability to rapidly discover useful patterns in
two different applications. / text
|
9 |
Interactive High-Quality Visualization of Large-Scale Particle DataIbrahim, Mohamed 20 November 2019 (has links)
Large-scale particle data sets, such as those computed in molecular dynamics (MD) simulations, are crucial to investigating important processes in physics and thermodynamics. The simulated atoms are usually visualized as hard spheres with Phong shading, where individual particles can be perceived well in close-up views. However, for large-scale simulations with millions of particles, the visualization of large fields-of-view usually suffers from strong aliasing artifacts, because the mismatch between data size and output resolution leads to severe under-sampling of the geometry. In this dissertation, we present novel visualization methods for large-scale particle data that address aliasing while enabling interactive high-quality rendering by sampling only the visible particles of a data set from a given view. The first contribution of this thesis is the novel concept of screen-space normal distribution functions (S-NDFs) for particle data. S-NDFs represent the distribution of surface normals that map to a given pixel in screen space, which enables high-quality re-lighting without re-rendering particles. In order to facilitate interactive zooming, we cache S-NDFs in a screen-space mipmap (S-MIP). Together, these two concepts enable interactive, scaleconsistent re-lighting and shading changes, as well as zooming, without having to re-sample the particle data. Our second contribution is a novel architecture for probabilistic culling of large particle data. Wedecouplethesuper-samplingforrenderingfromthedeterminationofsub-pixelparticle visibility, and perform culling probabilistically in multiple stages, while incrementally tracking confidence in the visibility data gathered so far to avoid wrong visibility decisions with high probability. Our architecture determines particle visibility with high accuracy, while only sampling a small part of the whole data set. The particles that are not occluded are then super-sampled for high rendering quality, at a fraction of the cost of sampling the entire data set.
|
10 |
BlobSeer: Towards efficient data storage management for large-scale, distributed systemsNicolae, Bogdan 30 November 2010 (has links) (PDF)
With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintainance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentations on the Grid'5000 testbed show that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
|
Page generated in 0.2783 seconds