Global ETD Search

1	Finding delta difference in large data sets Arvidsson, Johan January 2019 (has links) To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files. Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time. Delta Difference Large Data Sets Computer Sciences Datavetenskap (datalogi)
2	Pixel Oriented Visualization in XmdvTool Patro, Anilkumar G 07 September 2004 (has links) "Many approaches to the visualization of multivariate data have been proposed to date. Pixel oriented techniques map each attribute value of the data to a single colored pixel, theoretically yielding the display of the maximum possible information at a time. A large number of pixel layout methods have been proposed, each of which enables users to perform their visual exploration tasks to varying degrees. Pixel oriented techniques typically maintain the global view of large amounts of data while still preserving the perception of small regions of interest, which makes them particularly interesting for visualizing very large multidimensional data sets. Pixel based methods also provide feedback on the given query by presenting not only the data items fulfilling the query but also the data that approximately fulfill the query. The goal of this thesis was to extend XmdvTool, a public domain multivariate data visualization package, to incorporate pixel based techniques and to explore their strengths and weaknesses. The main challenge here was to seamlessly apply the interaction and distortion techniques used in other visualization methods within XmdvTool to pixel based methods and investigate the capabilities made possible by fusing the various multivariate visualization techniques." visualizing large data sets exploratory multivariate visualization Visualization Data processing Database management
3	Bayesian Inference in Large Data Problems Quiroz, Matias January 2015 (has links) In the last decade or so, there has been a dramatic increase in storage facilities and the possibility of processing huge amounts of data. This has made large high-quality data sets widely accessible for practitioners. This technology innovation seriously challenges traditional modeling and inference methodology. This thesis is devoted to developing inference and modeling tools to handle large data sets. Four included papers treat various important aspects of this topic, with a special emphasis on Bayesian inference by scalable Markov Chain Monte Carlo (MCMC) methods. In the first paper, we propose a novel mixture-of-experts model for longitudinal data. The model and inference methodology allows for manageable computations with a large number of subjects. The model dramatically improves the out-of-sample predictive density forecasts compared to existing models. The second paper aims at developing a scalable MCMC algorithm. Ideas from the survey sampling literature are used to estimate the likelihood on a random subset of data. The likelihood estimate is used within the pseudomarginal MCMC framework and we develop a theoretical framework for such algorithms based on subsets of the data. The third paper further develops the ideas introduced in the second paper. We introduce the difference estimator in this framework and modify the methods for estimating the likelihood on a random subset of data. This results in scalable inference for a wider class of models. Finally, the fourth paper brings the survey sampling tools for estimating the likelihood developed in the thesis into the delayed acceptance MCMC framework. We compare to an existing approach in the literature and document promising results for our algorithm. / <p>At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 1: Submitted. Paper 2: Submitted. Paper 3: Manuscript. Paper 4: Manuscript.</p> Bayesian inference Large data sets Markov chain Monte Carlo Survey sampling Pseudo-marginal MCMC Delayed acceptance MCMC
4	Financial crisis forecasts and applications to systematic trading strategies / Indicateurs de crises financières et applications aux stratégies de trading algorithmique Kornprobst, Antoine 23 October 2017 (has links) Cette thèse, constituée de trois papiers de recherche, est organisée autour de la construction d’indicateurs de crises financières dont les signaux sont ensuite utilisés pour l’élaboration de stratégies de trading algorithmique. Le premier papier traite de l’établissement d’un cadre de travail permettant la construction des indicateurs de crises financière. Le pouvoir de prédiction de nos indicateurs est ensuite démontré en utilisant l’un d’eux pour construire une stratégie de type protective-put active qui est capable de faire mieux en termes de performances qu’une stratégie passive ou, la plupart du temps, que de multiples réalisations d’une stratégie aléatoire. Le second papier va plus loin dans l’application de nos indicateurs de crises à la création de stratégies de trading algorithmique en utilisant le signal combiné d’un grand nombre de nos indicateurs pour gouverner la composition d’un portefeuille constitué d’un mélange de cash et de titres d’un ETF répliquant un indice equity comme le SP500. Enfin, dans le troisième papier, nous construisons des indicateurs de crises financières en utilisant une approche complètement différente. En étudiant l’évolution dynamique de la distribution des spreads des composantes d’un indice CDS tel que l’ITRAXXX Europe 125, une bande de Bollinger est construite autour de la fonction de répartition de la distribution empirique des spreads, exprimée sur une base de deux distributions log-normales choisies à l’avance. Le passage par la fonction de répartition empirique de la frontière haute ou de la frontière basse de cette bande de Bollinger est interprétée en termes de risque et permet de produire un signal de trading. / This thesis is constituted of three research papers and is articulated around the construction of financial crisis indicators, which produce signals, which are then applied to devise successful systematic trading strategies. The first paper deals with the establishment of a framework for the construction of our financial crisis indicators. Their predictive power is then demonstrated by using one of them to build an active protective-put strategy, which is able to beat in terms of performance a passive strategy as well as, most of the time, multiple paths of a random strategy. The second paper goes further in the application of our financial crisis indicators to the elaboration of systematic treading strategies by using the aggregated signal produce by many of our indicators to govern a portfolio constituted of a mix of cash and ETF shares, replicating an equity index like the SP500. Finally, in the third paper, we build financial crisis indicators by using a completely different approach. By studying the dynamics of the evolution of the distribution of the spreads of the components of a CDS index like the ITRAXX Europe 125, a Bollinger band is built around the empirical cumulative distribution function of the distribution of the spreads, fitted on a basis constituted of two lognormal distributions, which have been chosen beforehand. The crossing by the empirical cumulative distribution function of either the upper or lower boundary of this Bollinger band is then interpreted in terms of risk and enables us to construct a trading signal. Crises financières Économétrie Indicateurs Quantitative finance Econometrics Simulation methods Forecasting Large data sets Financial crises Random matrix theory 519
5	A multiresolutional approach for large data visualization Wang, Chaoli 30 November 2006 (has links) No description available. Computer Science large data sets multiresolution volume visualization parallel rendering level-of-detail (LOD) image-based quality metric LOD map
6	Shluková analýza rozsáhlých souborů dat: nové postupy založené na metodě k-průměrů / Cluster analysis of large data sets: new procedures based on the method k-means Žambochová, Marta January 2005 (has links) Abstract Cluster analysis has become one of the main tools used in extracting knowledge from data, which is known as data mining. In this area of data analysis, data of large dimensions are often processed, both in the number of objects and in the number of variables, which characterize the objects. Many methods for data clustering have been developed. One of the most widely used is a k-means method, which is suitable for clustering data sets containing large number of objects. It is based on finding the best clustering in relation to the initial distribution of objects into clusters and subsequent step-by-step redistribution of objects belonging to the clusters by the optimization function. The aim of this Ph.D. thesis was a comparison of selected variants of existing k-means methods, detailed characterization of their positive and negative characte- ristics, new alternatives of this method and experimental comparisons with existing approaches. These objectives were met. I focused on modifications of the k-means method for clustering of large number of objects in my work, specifically on the algorithms BIRCH k-means, filtering, k-means++ and two-phases. I watched the time complexity of algorithms, the effect of initialization distribution and outliers, the validity of the resulting clusters. Two real data files and some generated data sets were used. The common and different features of method, which are under investigation, are summarized at the end of the work. The main aim and benefit of the work is to devise my modifications, solving the bottlenecks of the basic procedure and of the existing variants, their programming and verification. Some modifications brought accelerate the processing. The application of the main ideas of algorithm k-means++ brought to other variants of k-means method better results of clustering. The most significant of the proposed changes is a modification of the filtering algorithm, which brings an entirely new feature of the algorithm, which is the detection of outliers. The accompanying CD is enclosed. It includes the source code of programs written in MATLAB development environment. Programs were created specifically for the purpose of this work and are intended for experimental use. The CD also contains the data files used for various experiments.
7	En jämförelse mellan databashanterare med prestandatester och stora datamängder / A comparison between database management systems with performance testing and large data sets Brander, Thomas, Dakermandji, Christian January 2016 (has links) Företaget Nordicstation hanterar stora datamängder åt Swedbank där datalagringen sker i relationsdatabasen Microsoft SQL Server 2012 (SQL Server). Då det finns andra databashanterare designade för stora datavolymer är det oklart om SQL Server är den optimala lösningen för situationen. Detta examensarbete har tagit fram en jämförelse med hjälp av prestandatester, beträffande exekveringstiden av databasfrågor, mellan databaserna SQL Server, Cassandra och NuoDB vid hanteringen av stora datamängder. Cassandra är en kolumnbaserad databas designad för hantering av stora datavolymer, NuoDB är en minnesdatabas som använder internminnet som lagringsutrymme och är designad för skalbarhet. Resultaten togs fram i en virtuell servermiljö med Windows Server 2012 R2 på en testplattform skriven i Java. Jämförelsen visar att SQL Server var den databas mest lämpad för gruppering, sortering och beräkningsoperationer. Däremot var Cassandra bäst i skrivoperationer och NuoDB presterade bäst i läsoperationer. Analysen av resultatet visade att mindre access till disken ger kortare exekveringstid men den skalbara lösningen, NuoDB, lider av kraftiga prestandaförluster av att endast konfigureras med en nod. Nordicstation rekommenderas att uppgradera till Microsoft SQL Server 2014, eller senare, där möjlighet finns att spara tabeller i internminnet. / The company Nordicstation handles large amounts of data for Swedbank, where data is stored using the relational database Microsoft SQL Server 2012 (SQL Server). The existence of other databases designed for handling large amounts of data, makes it unclear if SQL Server is the best solution for this situation. This degree project describes a comparison between databases using performance testing, with regard to the execution time of database queries. The chosen databases were SQL Server, Cassandra and NuoDB. Cassandra is a column-oriented database designed for handling large amounts of data, NuoDB is a database that uses the main memory for data storage and is designed for scalability. The performance tests were executed in a virtual server environment with Windows Server 2012 R2 using an application written in Java. SQL Server was the database most suited for grouping, sorting and arithmetic operations. Cassandra had the shortest execution time for write operations while NuoDB performed best in read operations. This degree project concludes that minimizing disk operations leads to shorter execution times but the scalable solution, NuoDB, suffer severe performance losses when configured as a single-node. Nordicstation is recommended to upgrade to Microsoft SQL Server 2014, or later, because of the possibility to save tables in main memory. Database managment system Performance test Execution Time Large data sets Microsoft SQL Server Cassandra NuoDB Database queries Test environment Databashanterare Prestandatest Exekveringstid Stora datavolymer Microsoft SQL Server Cassandra NuoDB Databasfrågor Testmiljö Software Engineering Programvaruteknik

1

Page generated in 0.0609 seconds