Global ETD Search

21	Automata methods and techniques for graph-structured data Shoaran, Maryam 23 April 2011 (has links) Graph-structured data (GSD) is a popular model to represent complex information in a wide variety of applications such as social networks, biological data management, digital libraries, and traffic networks. The flexibility of this model allows the information to evolve and easily integrate with heterogeneous data from many sources. In this dissertation we study three important problems on GSD. A consistent theme of our work is the use of automata methods and techniques to process and reason about GSD. First, we address the problem of answering queries on GSD in a distributed environment. We focus on regular path queries (RPQs) – given by regular expressions matching paths in graph-data. RPQs are the building blocks of almost any mechanism for querying GSD. We present a fault-tolerant, message-efficient, and truly distributed algorithm for answering RPQs. Our algorithm works for the larger class of weighted RPQs on weighted GSDs. Second, we consider the problem of answering RPQs on incomplete GSD, where different data sources are represented by materialized database views. We explore the connection between “certain answers” (CAs) and answers obtained from “view-based rewritings” (VBRs) for RPQs. CAs are answers that can be obtained on each database consistent with the views. Computing all of CAs for RPQs is NP-hard, and one has to resort to an exponential algorithm in the size of the data–view materializations. On the other hand, VBRs are query reformulations in terms of the view definitions. They can be used to obtain query answers in polynomial time in the size of the data. These answers are CAs, but unfortunately for RPQs, not all of the CAs can be obtained in this way. In this work, we show the surprising result that for RPQs under local semantics, using VBRs to answer RPQs gives all the CAs. The importance of this result is that under such semantics, the CAs can be obtained in polynomial time in the size of the data. Third, we focus on XML–an important special case of GSD. The scenario we consider is streaming XML between exchanging parties. The problem we study is flexible validation of streaming XML under the realistic assumption that the schemas of the exchanging parties evolve, and thus diverge from one another. We represent schemas by using Visibly Pushdown Automata (VPAs), which recognize Visibly Pushdown Languages (VPLs). We model evolution for XML by defining formal language operators on VPLs. We show that VPLs are closed under the defined language operators and this enables us to expand the schemas (for XML) in order to account for flexible or constrained evolution. / Graduate Regular Path Queries (RPQs) Graph Data Distributed Algorithm Query Rewriting Local RPQs Certain Answers XML Schema Evolution Streaming Data Visibly Pushdown Automata (VPAs)
22	Algorithm design on multicore processors for massive-data analysis Agarwal, Virat 28 June 2010 (has links) Analyzing massive-data sets and streams is computationally very challenging. Data sets in systems biology, network analysis and security use network abstraction to construct large-scale graphs. Graph algorithms such as traversal and search are memory-intensive and typically require very little computation, with access patterns that are irregular and fine-grained. The increasing streaming data rates in various domains such as security, mining, and finance leaves algorithm designers with only a handful of clock cycles (with current general purpose computing technology) to process every incoming byte of data in-core at real-time. This along with increasing complexity of mining patterns and other analytics puts further pressure on already high computational requirement. Processing streaming data in finance comes with an additional constraint to process at low latency, that restricts the algorithm to use common techniques such as batching to obtain high throughput. The primary contributions of this dissertation are the design of novel parallel data analysis algorithms for graph traversal on large-scale graphs, pattern recognition and keyword scanning on massive streaming data, financial market data feed processing and analytics, and data transformation, that capture the machine-independent aspects, to guarantee portability with performance to future processors, with high performance implementations on multicore processors that embed processorspecific optimizations. Our breadth first search graph traversal algorithm demonstrates a capability to process massive graphs with billions of vertices and edges on commodity multicore processors at rates that are competitive with supercomputing results in the recent literature. We also present high performance scalable keyword scanning on streaming data using novel automata compression algorithm, a model of computation based on small software content addressable memories (CAMs) and a unique data layout that forces data re-use and minimizes memory traffic. Using a high-level algorithmic approach to process financial feeds we present a solution that decodes and normalizes option market data at rates an order of magnitude more than the current needs of the market, yet portable and flexible to other feeds in this domain. In this dissertation we discuss in detail algorithm design challenges to process massive-data and present solutions and techniques that we believe can be used and extended to solve future research problems in this domain. Graph algorithms Parallel computing Massive data Financial market data Streaming data Keyword scanning Data mining Algorithms Pattern recognition systems Computer algorithms
23	Approximation of OLAP queries on data warehouses / Approximation aux requêtes OLAP sur les entrepôts de données Cao, Phuong Thao 20 June 2013 (has links) Nous étudions les réponses proches à des requêtes OLAP sur les entrepôts de données. Nous considérons les réponses relatives aux requêtes OLAP sur un schéma, comme les distributions avec la distance L1 et rapprocher les réponses sans stocker totalement l'entrepôt de données. Nous présentons d'abord trois méthodes spécifiques: l'échantillonnage uniforme, l'échantillonnage basé sur la mesure et le modèle statistique. Nous introduisons également une distance d'édition entre les entrepôts de données avec des opérations d'édition adaptées aux entrepôts de données. Puis, dans l'échange de données OLAP, nous étudions comment échantillonner chaque source et combiner les échantillons pour rapprocher toutes requêtes OLAP. Nous examinons ensuite un contexte streaming, où un entrepôt de données est construit par les flux de différentes sources. Nous montrons une borne inférieure de la taille de la mémoire nécessaire aux requêtes approximatives. Dans ce cas, nous avons les réponses pour les requêtes OLAP avec une mémoire finie. Nous décrivons également une méthode pour découvrir les dépendances statistique, une nouvelle notion que nous introduisons. Nous recherchons ces dépendances en basant sur l'arbre de décision. Nous appliquons la méthode à deux entrepôts de données. Le premier simule les données de capteurs, qui fournissent des paramètres météorologiques au fil du temps et de l'emplacement à partir de différentes sources. Le deuxième est la collecte de RSS à partir des sites web sur Internet. / We study the approximate answers to OLAP queries on data warehouses. We consider the relative answers to OLAP queries on a schema, as distributions with the L1 distance and approximate the answers without storing the entire data warehouse. We first introduce three specific methods: the uniform sampling, the measure-based sampling and the statistical model. We introduce also an edit distance between data warehouses with edit operations adapted for data warehouses. Then, in the OLAP data exchange, we study how to sample each source and combine the samples to approximate any OLAP query. We next consider a streaming context, where a data warehouse is built by streams of different sources. We show a lower bound on the size of the memory necessary to approximate queries. In this case, we approximate OLAP queries with a finite memory. We describe also a method to discover the statistical dependencies, a new notion we introduce. We are looking for them based on the decision tree. We apply the method to two data warehouses. The first one simulates the data of sensors, which provide weather parameters over time and location from different sources. The second one is the collection of RSS from the web sites on Internet. OLAP Réponses proches de la requête Échange de données OLAP Flux de données Distance d'édition Algorithme d'échantillonnage Dépendances statistiques Modèle statistique OLAP Approximate query answering OLAP data exchange Streaming data Edit distance Sampling algorithm Statistical dependencies Statistical model
24	[en] ANOMALY DETECTION IN DATA CENTER MACHINE MONITORING METRICS / [pt] DETECÇÃO DE ANOMALIAS NAS MÉTRICAS DAS MONITORAÇÕES DE MÁQUINAS DE UM DATA CENTER RICARDO SOUZA DIAS 17 January 2020 (has links) [pt] Um data center normalmente possui grande quantidade de máquinas com diferentes configurações de hardware. Múltiplas aplicações são executadas e software e hardware são constantemente atualizados. Para evitar a interrupção de aplicações críticas, que podem causar grandes prejuízos financeiros, os administradores de sistemas devem identificar e corrigir as falhas o mais cedo possível. No entanto, a identificação de falhas em data centers de produção muitas vezes ocorre apenas quando as aplicações e serviços já estão indisponíveis. Entre as diferentes causas da detecção tardia de falhas estão o uso técnicas de monitoração baseadas apenas em thresholds. O aumento crescente na complexidade de aplicações que são constantemente atualizadas torna difícil a configuração de thresholds ótimos para cada métrica e servidor. Este trabalho propõe o uso de técnicas de detecção de anomalias no lugar de técnicas baseadas em thresholds. Uma anomalia é um comportamento do sistema que é incomum e significativamente diferente do comportamento normal anterior. Desenvolvemos um algoritmo para detecção de anomalias, chamado DASRS (Decreased Anomaly Score by Repeated Sequence) que analisa em tempo real as métricas coletadas de servidores de um data center de produção. O DASRS apresentou excelentes resultados de acurácia, compatível com os algoritmos do estado da arte, além de tempo de processamento e consumo de memória menores. Por esse motivo, o DASRS atende aos requisitos de processamento em tempo real de um grande volume de dados. / [en] A data center typically has a large number of machines with different hardware configurations. Multiple applications are executed and software and hardware are constantly updated. To avoid disruption of critical applications, which can cause significant financial loss, system administrators should identify and correct failures as early as possible. However, fault-detection in production data centers often occurs only when applications and services are already unavailable. Among the different causes of late fault-detection are the use of thresholds-only monitoring techniques. The increasing complexity of constantly updating applications makes it difficult to set optimal thresholds for each metric and server. This paper proposes the use of anomaly detection techniques in place of thresholds based techniques. An anomaly is a system behavior that is unusual and significantly different from the previous normal behavior. We have developed an anomaly detection algorithm called Decreased Anomaly Score by Repeated Sequence (DASRS) that analyzes real-time metrics collected from servers in a production data center. DASRS has showed excellent accuracy results, compatible with state-of-the-art algorithms, and reduced processing time and memory consumption. For this reason, DASRS meets the real-time processing requirements of a large volume of data. [pt] SERIE TEMPORAL [pt] DATA CENTER [pt] APRENDIZAGEM NAO SUPERVISIONADA [pt] DADOS DE STREAMING [pt] DETECCAO DE ANOMALIAS [pt] METRICA [pt] MONITORACAO [pt] ALGORITMO [en] TIME SERIE [en] DATA CENTER [en] UNSUPERVISED LEARNING [en] STREAMING DATA [en] ANOMALY DETECTION [en] METRIC [en] MONITORING [en] ALGORITHM

Page generated in 0.0905 seconds