Global ETD Search

21	Dimension reduction of streaming data via random projections Cosma, Ioana Ada January 2009 (has links) A data stream is a transiently observed sequence of data elements that arrive unordered, with repetitions, and at very high rate of transmission. Examples include Internet traffic data, networks of banking and credit transactions, and radar derived meteorological data. Computer science and engineering communities have developed randomised, probabilistic algorithms to estimate statistics of interest over streaming data on the fly, with small computational complexity and storage requirements, by constructing low dimensional representations of the stream known as data sketches. This thesis combines techniques of statistical inference with algorithmic approaches, such as hashing and random projections, to derive efficient estimators for cardinality, l_{alpha} distance and quasi-distance, and entropy over streaming data. I demonstrate an unexpected connection between two approaches to cardinality estimation that involve indirect record keeping: the first using pseudo-random variates and storing selected order statistics, and the second using random projections. I show that l_{alpha} distances and quasi-distances between data streams, and entropy, can be recovered from random projections that exploit properties of alpha-stable distributions with full statistical efficiency. This is achieved by the method of L-estimation in a single-pass algorithm with modest computational requirements. The proposed estimators have good small sample performance, improved by the methods of trimming and winsorising; in other words, the value of these summary statistics can be approximated with high accuracy from data sketches of low dimension. Finally, I consider the problem of convergence assessment of Markov Chain Monte Carlo methods for simulating from complex, high dimensional, discrete distributions. I argue that online, fast, and efficient computation of summary statistics such as cardinality, entropy, and l_{alpha} distances may be a useful qualitative tool for detecting lack of convergence, and illustrate this with simulations of the posterior distribution of a decomposable Gaussian graphical model via the Metropolis-Hastings algorithm. 519.2
22	StreamCloud: un moteur de traitement de streams parallèle et distribué Gulisano, Vincenzo 20 December 2012 (has links) (PDF) En los útimos años, aplicaciones en dominios tales como telecomunicaciones, seguridad de redes y redes de sensores de gran escala se han encontrado con múltiples limitaciones en el paradigma tradicional de bases de datos. En este contexto, los sistemas de procesamiento de flujos de datos han emergido como solución a estas aplicaciones que demandan una alta capacidad de procesamiento con una baja latencia. En los sistemas de procesamiento de flujos de datos, los datos no se persisten y luego se procesan, en su lugar los datos son procesados al vuelo en memoria produciendo resultados de forma continua. Los actuales sistemas de procesamiento de flujos de datos, tanto los centralizados, como los distribuidos, no escalan respecto a la carga de entrada del sistema debido a un cuello de botella producido por la concentración de flujos de datos completos en nodos individuales. Por otra parte, éstos están basados en configuraciones estáticas lo que conducen a un sobre o bajo aprovisionamiento. Esta tesis doctoral presenta StreamCloud, un sistema elástico paralelo-distribuido para el procesamiento de flujos de datos que es capaz de procesar grandes volúmenes de datos. StreamCloud minimiza el coste de distribución y paralelización por medio de una técnica novedosa la cual particiona las queries en subqueries paralelas repartiéndolas en subconjuntos de nodos independientes. Ademas, Stream- Cloud posee protocolos de elasticidad y equilibrado de carga que permiten una optimización de los recursos dependiendo de la carga del sistema. Unidos a los protocolos de paralelización y elasticidad, StreamCloud define un protocolo de tolerancia a fallos que introduce un coste mínimo mientras que proporciona una rápida recuperación. StreamCloud ha sido implementado y evaluado mediante varias aplicaciones del mundo real tales como aplicaciones de detección de fraude o aplicaciones de análisis del tráfico de red. La evaluación ha sido realizada en un cluster con más de 300 núcleos, demostrando la alta escalabilidad y la efectividad tanto de la elasticidad, como de la tolerancia a fallos de StreamCloud. Data Streaming Escalabilidad Elasticidad Equilibrado de Carga Tolerancia a fallos
23	Investigating programming language support for fault-tolerance Demirkoparan, Ismail January 2023 (has links) Dataflow systems have become the norm for developing data-intensive computing applications. These systems provide transparent scalability and fault tolerance. For fault tolerance, many dataflow-system adopt a snapshotting approach which persists the state of an operator once it has received a snapshot marker on all its input channels. This approach requires channels to be blocked for potentially prolonged durations until all other input channels have received their markers to guarantee that no events from the future make it into the operator’s present state snapshot. Alignment can for this reason have a severe performance impact. In particular, for black-box user-defined operators, the system has no knowledge about how events from different channels affect the operator’s state. Thus, the system must conservatively assume that all events affect the same state and align all channels. In this thesis, we argue that alignment between two channels is unnecessary if messages from those channels are not written to the same output channel. We propose a snapshotting approach for the fault tolerance and call it partial approach. The partial approach does not require alignment when an operator’s input channels are independent. Two input channels are independent if their events do not affect the same state and are never written to the same output channel. We propose the use of static code analysis to identify such dependencies. To enable this analysis, we translate operators into finite state machines that make the operator’s state explicit. As a proof of concept, we extend the implementation of Arc-Lang, an existing dataflow language, so that applications written in it transparently execute with fault tolerance. We evaluate our approach by comparing it to a baseline eager approach that always requires alignment between the input channels. The conducted experiments’ results show that the partial approach performs about 47 % better than the eager approach when the streaming sources are producing data at different velocities. / Dataflödessystem har blivit normen för utveckling av dataintensiva datorapplikationer. Dessa system erbjuder transparent skalbarhet och felhantering. För felhantering adopterar många dataflödessystem en snapshot-approach som sparar en operatörs tillstånd när den har fått en snapshot-markör på alla sina ingångskanaler. Denna metod kräver att kanalerna blockeras under möjligen förlängda tidsperioder tills alla andra ingångskanaler har fått sina markörer, vilket görs för att garantera att inga händelser från framtiden når operatörens nuvarande tillstånd. Synkronisering mellan kanaler kan därför ha en allvarlig prestandapåverkan. Särskilt för black-box användardefinierade operatörer där systemet inte har kunskap om hur händelser från olika kanaler påverkar operatörens tillstånd. Systemet måste därför konservativt anta att alla händelser påverkar samma tillstånd och synkronisera alla kanaler. I denna avhandling argumenterar vi för att synkroniseringen mellan två kanaler inte är nödvändig om meddelanden från de kanalerna inte skrivs till samma utgångskanal. Vi föreslår en snapshot-approach för felhantering och kallar den för partial-approach. Partial-approach kräver inte justering när en operatörs ingångskanaler är oberoende. Två ingångskanaler är oberoende om deras händelser inte påverkar samma tillstånd och aldrig skrivs till samma utgångskanal. Vi föreslår användning av statisk kodanalys för att identifiera sådana beroenden. För att möjliggöra denna analys översätter vi operatörer till finite state machines som gör operatörens tillstånd explicit. För att bevisa konceptet utökar vi implementeringen av Arc-Lang, vilket är en existerande dataflödesspråk, så att program skrivna i den transparent körs med felhantering. Vi utvärderar vår approach genom att jämföra den med en baseline eager-approach som alltid kräver justering mellan ingångskanalerna. Resultaten från de genomförda experimenten visar att partial-approach presterar cirka 47 % bättre än eager-approach när sourcestreams producerar data i otakt. Dataflow Fault tolerance Data streaming Distributed systems Checkpointing Logging Lineage Lineage stash Arc-Lang Apache Flink Computer and Information Sciences Data- och informationsvetenskap
24	Experimental Study on Machine Learning with Approximation to Data Streams Jiang, Jiani January 2019 (has links) Realtime transferring of data streams enables many data analytics and machine learning applications in the areas of e.g. massive IoT and industrial automation. Big data volume of those streams is a significant burden or overhead not only to the transportation network, but also to the corresponding application servers. Therefore, researchers and scientists focus on reducing the amount of data needed to be transferred via data compressions and approximations. Data compression techniques like lossy compression can significantly reduce data volume with the price of data information loss. Meanwhile, how to do data compression is highly dependent on the corresponding applications. However, when apply the decompressed data in some data analysis application like machine learning, the results may be affected due to the information loss. In this paper, the author did a study on the impact of data compression to the machine learning applications. In particular, from the experimental perspective, it shows the tradeoff among the approximation error bound, compression ratio and the prediction accuracy of multiple machine learning methods. The author believes that, with proper choice, data compression can dramatically reduce the amount of data transferred with limited impact on the machine learning applications. / Realtidsöverföring av dataströmmar möjliggör många dataanalyser och maskininlärningsapplikationer inom områdena t.ex. massiv IoT och industriell automatisering. Stor datavolym för dessa strömmar är en betydande börda eller omkostnad inte bara för transportnätet utan också för motsvarande applikationsservrar. Därför fokuserar forskare och forskare om att minska mängden data som behövs för att överföras via datakomprimeringar och approximationer. Datakomprimeringstekniker som förlustkomprimering kan minska datavolymen betydligt med priset för datainformation. Samtidigt är datakomprimering mycket beroende av motsvarande applikationer. Men när du använder dekomprimerade data i en viss dataanalysapplikation som maskininlärning, kan resultaten påverkas på grund av informationsförlusten. I denna artikel gjorde författaren en studie om effekterna av datakomprimering på maskininlärningsapplikationerna. I synnerhet, från det experimentella perspektivet, visar det avvägningen mellan tillnärmningsfelbundet, kompressionsförhållande och förutsägbarhetsnoggrannheten för flera maskininlärningsmetoder. Författaren anser att datakomprimering med rätt val dramatiskt kan minska mängden data som överförs med begränsad inverkan på maskininlärningsapplikationerna. Elektroteknik och elektronik
25	Etude en vue de la multirésolution de l’apparence Hadim, Julien 11 May 2009 (has links) Les fonctions de texture directionnelle "Bidirectional Texture Function" (BTF) ont rencontrés un certain succès ces dernières années, notamment pour le rendu temps-réel d'images de synthèse, grâce à la fois au réalisme qu'elles apportent et au faible coût de calcul nécessaire. Cependant, un inconvénient de cette approche reste la taille gigantesque des données : de nombreuses méthodes ont été proposées afin de les compresser. Dans ce document, nous proposons une nouvelle représentation des BTFs qui améliore la cohérence des données et qui permet ainsi une compression plus efficace. Dans un premier temps, nous étudions les méthodes d'acquisition et de génération des BTFs et plus particulièrement, les méthodes de compression adaptées à une utilisation sur cartes graphiques. Nous réalisons ensuite une étude à l'aide de notre logiciel "BTFInspect" afin de déterminer parmi les différents phénomènes visuels dans les BTFs, ceux qui influencent majoritairement la cohérence des données par texel. Dans un deuxième temps, nous proposons une nouvelle représentation pour les BTFs, appelées Flat Bidirectional Texture Function (Flat-BTFs), qui améliore la cohérence des données d'une BTF et donc la compression des données. Dans l'analyse des résultats obtenus, nous montrons statistiquement et visuellement le gain de cohérence obtenu ainsi que l'absence d'une perte significative de qualité en comparaison avec la représentation d'origine. Enfin, dans un troisième temps, nous démontrons l'utilisation de notre nouvelle représentation dans des applications de rendu en temps-réel sur cartes graphiques. Puis, nous proposons une compression de l'apparence grâce à une méthode de quantification sur GPU et présentée dans le cadre d'une application de diffusion de données 3D entre un serveur contenant des modèles 3D et un client désirant visualiser ces données. / In recent years, Bidirectional Texture Function (BTF) has emerged as a flexible solution for realistic and real-time rendering of material with complex appearance and low cost computing. However one drawback of this approach is the resulting huge amount of data: several methods have been proposed in order to compress and manage this data. In this document, we propose a new BTF representation that improves data coherency and allows thus a better data compression. In a first part, we study acquisition and digital generation methods of BTFs and more particularly, compression methods suitable for GPU rendering. Then, We realise a study with our software BTFInspect in order to determine among the different visual phenomenons present in BTF which ones induce mainly the data coherence per texel. In a second part, we propose a new BTF representation, named Flat Bidirectional Texture Function (Flat-BTF), which improves data coherency and thus, their compression. The analysis of results show statistically and visually the gain in coherency as well as the absence of a noticeable loss of quality compared to the original representation. In a third and last part, we demonstrate how our new representation may be used for realtime rendering applications on GPUs. Then, we introduce a compression of the appearance thanks to a quantification method on GPU which is presented in the context of a 3D data streaming between a server of 3D data and a client which want visualize them. Synthèse d'images Rendu en temps-réel Apparence réaliste Multirésolution BTF Méso-structure GPU Diffusion de données 3D Computer graphics Realtime 3D rendering Realistic appearance Multiresolution BTF Mesostructure GPU 3D data streaming
26	Finding A Subset Of Non-defective Items From A Large Population : Fundamental Limits And Efficient Algorithms Sharma, Abhay 05 1900 (has links) (PDF) Consider a large population containing a small number of defective items. A commonly encountered goal is to identify the defective items, for example, to isolate them. In the classical non-adaptive group testing (NAGT) approach, one groups the items into subsets, or pools, and runs tests for the presence of a defective itemon each pool. Using the outcomes the tests, a fundamental goal of group testing is to reliably identify the complete set of defective items with as few tests as possible. In contrast, this thesis studies a non-defective subset identification problem, where the primary goal is to identify a “subset” of “non-defective” items given the test outcomes. The main contributions of this thesis are: We derive upper and lower bounds on the number of nonadaptive group tests required to identify a given number of non-defective items with arbitrarily small probability of incorrect identification as the population size goes to infinity. We show that an impressive reduction in the number of tests is achievable compared to the approach of first identifying all the defective items and then picking the required number of non-defective items from the complement set. For example, in the asymptotic regime with the population size N → ∞, to identify L nondefective items out of a population containing K defective items, when the tests are reliable, our results show that O _ K logK L N _ measurements are sufficient when L ≪ N − K and K is fixed. In contrast, the necessary number of tests using the conventional approach grows with N as O _ K logK log N K_ measurements. Our results are derived using a general sparse signal model, by virtue of which, they are also applicable to other important sparse signal based applications such as compressive sensing. We present a bouquet of computationally efficient and analytically tractable nondefective subset recovery algorithms. By analyzing the probability of error of the algorithms, we obtain bounds on the number of tests required for non-defective subset recovery with arbitrarily small probability of error. By comparing with the information theoretic lower bounds, we show that the upper bounds bounds on the number of tests are order-wise tight up to a log(K) factor, where K is the number of defective items. Our analysis accounts for the impact of both the additive noise (false positives) and dilution noise (false negatives). We also provide extensive simulation results that compare the relative performance of the different algorithms and provide further insights into their practical utility. The proposed algorithms significantly outperform the straightforward approaches of testing items one-by-one, and of first identifying the defective set and then choosing the non-defective items from the complement set, in terms of the number of measurements required to ensure a given success rate. We investigate the use of adaptive group testing in the application of finding a spectrum hole of a specified bandwidth in a given wideband of interest. We propose a group testing based spectrum hole search algorithm that exploits sparsity in the primary spectral occupancy by testing a group of adjacent sub-bands in a single test. This is enabled by a simple and easily implementable sub-Nyquist sampling scheme for signal acquisition by the cognitive radios. Energy-based hypothesis tests are used to provide an occupancy decision over the group of sub-bands, and this forms the basis of the proposed algorithm to find contiguous spectrum holes of a specified bandwidth. We extend this framework to a multistage sensing algorithm that can be employed in a variety of spectrum sensing scenarios, including non-contiguous spectrum hole search. Our analysis allows one to identify the sparsity and SNR regimes where group testing can lead to significantly lower detection delays compared to a conventional bin-by-bin energy detection scheme. We illustrate the performance of the proposed algorithms via Monte Carlo simulations. Non-Adaptive Group Testing (NAGT) Acoustic Noise Noisy Nonadaptive Group Testing Non-Defective Subset Identification Non-Defective Subset Recovery Algorithms Cognitive Radios Additive Noise Dilution Noise Approximation Algorithms Healthy Subset Identification Data Streaming Communication Engineering
27	Performance of message brokers in event-driven architecture: Amazon SNS/SQS vs Apache Kafka / Prestanda av meddelandeköer i händelsedriven arkitektur: Amazon SNS/SQS vs Apache Kafka Edeland, Johan, Zivkovic, Ivan January 2023 (has links) Microservice architecture, which involves breaking down applications into smaller and loosely coupled components, is becoming increasingly common in the development of modern systems. Connections between these components can be established in various ways. One of these approaches is event-driven architecture, where components in the system communicate asynchronously with each other through message queues. AWS, Amazon Web Services, the largest provider of cloud-based services, offers such a messaging queue: SQS, Amazon Simple Queue Service, which can be integrated with SNS, Amazon Simple Notification Service, to enable "one-to-many" asynchronous communication. An alternative tool is Apache Kafka, created by LinkedIn and later open-sourced under the Apache Software Foundation. Apache Kafka is an event logging and streaming platform that can also function as a message queue in an event-driven architecture. The authors of this thesis have been commissioned by Scania to compare and evaluate the performance of these two tools and investigate whether there are use cases where one might be more suitable than the other. To achieve this, two prototypes were developed, each prototype consisting of a producer microservice and a consumer microservice. These prototypes were then used to conduct latency and load tests by producing messages and measuring the time interval until they were consumed. The results of the tests show that Apache Kafka has a lower average latency than SNS/SQS and scales more efficiently with increasing data volumes, making it more suitable for use cases involving real-time data streaming. Its potential as a message bus for loosely coupled components in the system is also evident. In this context, SNS/SQS is equally valuable, as it operates as a dedicated message bus with good latency and offers a user-friendly and straightforward setup process. / Mikrotjänstarkitektur, som innebär att applikationer bryts ned till mindre och löst kopplade komponenter, är något som blir allt vanligare vid utvecklingen av moderna system. Kopplingar mellan dessa komponenter kan etableras på olika sätt. Ett av dessa tillvägagångssätt är händelsedriven arkitektur, där komponenterna i systemet kommunicerar asynkront med varandra genom meddelandeköer. AWS, Amazon Web Services, som är den största leverantören av molnbaserade tjänster, tillhandahåller en sådan meddelandekö: SQS, Amazon Simple Queue Service, som kan integreras med SNS, Amazon Simple Notification Service för att möjliggöra ”en-till-många” asynkron kommunikation. Ett alternativt verktyg är Apache Kafka, skapat av Linkedin och senare öppen källkodspublicerad under Apache Software Foundation. Apache Kafka är en händelselogg och strömningsplattform som även kan fungera som en meddelandekö i en händelsedriven arkitektur. Författarna av detta arbete har på uppdrag av Scania blivit ombedda att jämföra och utvärdera prestandan hos de två verktygen samt undersöka om det finns användningsfall där det ena kan vara mer lämpligt än det andra. För att uppnå detta utvecklades två prototyper, där varje prototyp består av en producent- och en konsumentmikrotjänst. Dessa prototyper användes sedan för att genomföra latens- och lasttester genom att producera meddelanden och mäta tidsintervallet till dess att de konsumerades. Resultatet från testerna visar att Apache Kafka har lägre genomsnittlig latens än SNS/SQS och skalar mer effektivt vid ökande datamängder, vilket gör det mer lämpat för användningsfall med realtidsströmning av data. Dess potential som meddelandebuss för löst kopplade komponenter i systemet är också tydlig. I detta sammanhang är SNS/SQS lika användbart, då det fungerar som en dedikerad meddelandebuss med god latens och en användarvänlig och enkel startprocess. event-based architecture message broker latency testing load testing data streaming cloud backend Apache Kafka Amazon Simple Queue Service Amazon Simple Notification Service händelsedriven arkitektur meddelandebuss latenstestning lasttestning data strömning moln backend Apache Kafka Amazon Simple Queue Service Amazon Simple Notification Service Computer Sciences Datavetenskap (datalogi)

Page generated in 0.0582 seconds