Global ETD Search

11	Scalable Validation of Data Streams Xu, Cheng January 2016 (has links) In manufacturing industries, sensors are often installed on industrial equipment generating high volumes of data in real-time. For shortening the machine downtime and reducing maintenance costs, it is critical to analyze efficiently this kind of streams in order to detect abnormal behavior of equipment. For validating data streams to detect anomalies, a data stream management system called SVALI is developed. Based on requirements by the application domain, different stream window semantics are explored and an extensible set of window forming functions are implemented, where dynamic registration of window aggregations allow incremental evaluation of aggregate functions over windows. To facilitate stream validation on a high level, the system provides two second order system validation functions, model-and-validate and learn-and-validate. Model-and-validate allows the user to define mathematical models based on physical properties of the monitored equipment, while learn-and-validate builds statistical models by sampling the stream in real-time as it flows. To validate geographically distributed equipment with short response time, SVALI is a distributed system where many SVALI instances can be started and run in parallel on-board the equipment. Central analyses are made at a monitoring center where streams of detected anomalies are combined and analyzed on a cluster computer. SVALI is an extensible system where functions can be implemented using external libraries written in C, Java, and Python without any modifications of the original code. The system and the developed functionality have been applied on several applications, both industrial and for sports analytics. Data Stream Management Distributed Data Stream Processing Data Stream Validation Anomaly Detection
12	Self-describing objects with tangible data structures / Objets intelligents avec des données tangibles Sinha, Arnab 28 May 2014 (has links) En informatique ubiquitaire, l'observation du monde physique et de son "contexte" (une représentation haut niveau de la situation physique) est essentielle. Il existe de nombreux moyens pour observer le contexte. Typiquement, cela consiste en un traitement en plusieurs étapes commençant par la récupération de données brutes issues de capteurs. Diverses technologies de capteurs sont utilisées pour la récupération d'informations de bas niveau sur les activités physiques en cours. Ces données sont ensuite rassemblées, analysées et traitées ailleurs dans les systèmes d'information afin d'offrir une reconnaissance de contexte. Les applications déployées réagissent alors en fonction du contexte/de la situation détecté(e). Parmis les capteurs utilisés, les tags RFID, une technologie émergente, permettent de créer un lien virtuel direct entre les objets physiques et les systèmes d'information. En plus de stocker des identifiants, ils offrent un espace mémoire générique aux objets auxquels ils sont attachés, offrant de nouvelles possibilités d'architectures en informatique omniprésente. Dans cette thèse, nous proposons une approche originale tirant parti de l'espace mémoire offerts aux objets réels par les tags RFID. Dans notre approche, les objets supportent directement le système d'information. Ce type d'intégration permet de réduire les communications requises par le traitement à distance. Pour ce faire, des données sémantiques sont tout d'abord attachées aux objets afin de les rendre auto-descriptifs. Ainsi, les données pertinentes concernant une entité physique sont directement disponibles pour un traitement local. Les objets peuvent ensuite être liés virtuellement grâce à des structures de données dédiées ou ad hoc et distribuées sur les objets eux-mêmes. Ce faisant, le traitement des données peut se faire de façon directe. Par exemple, certaines propriétés peuvent être vérifiées localement sur un ensemble d'objets. Une relation physique peut être déduite directement de la structure de données, d'où le nom de "structures de données tangibles". Vis-à-vis des approches conventionnelles tirant parti des identifiants, notre approche offrent des avantages en termes de vie privée, de mise à l'échelle, d'autonomie et d'indépendance vis-à-vis des infrastructures. Le défi se situe au niveau de son expressivité limitée à cause du faible espace mémoire disponible sur les tags RFID. Les principes sont validés dans deux prototypes aux applications différentes. Le premier prototype est développé dans le domaine de la gestion de déchets afin d'aider le tri et d'améliorer le recyclage. Le deuxième offre des services supplémentaires, tels qu'une assistance lors du montage et de la vérification d'objets composés de plusieurs parties, grâce aux structures de données distribuées sur les différentes parties. / Pervasive computing or ambient computing aims to integrate information systems into the environment, in a manner as transparent as possible to the users. It allows the information systems to be tightly coupled with the physical activities within the environment. Everyday used objects, along with their environment, are made smarter with the use of embedded computing, sensors etc. and also have the ability to communicate among themselves. In pervasive computing, it is necessary to sense the real physical world and to perceive its “context” ; a high level representation of the physical situation. There are various ways to derive the context. Typically, the approach is a multi-step process which begins with sensing. Various sensing technologies are used to capture low level information of the physical activities, which are then aggregated, analyzed and computed elsewhere in the information systems, to become aware of the context. Deployed applications then react, depending on the context situation. Among sensors, RFID is an important emerging technology which allows a direct digital link between information systems and physical objects. Besides storing identification data, RFID also provides a general purpose storage space on objects, enabling new architectures for pervasive computing. In this thesis, we defend an original approach adopting the later use of RFID i.e. a digital memory integrated to real objects. The approach uses the principle where the objects self-support information systems. This way of integration reduces the need of communication for remote processing. The principle is realized in two ways. First, objects are piggybacked with semantic information, related to itself ; as self-describing objects. Hence, relevant information associated with the physical entities are readily available locally for processing. Second, group of related objects are digitally linked using dedicated or ad-hoc data structure, distributed over the objects. Hence, it would allow direct data processing - like validating some property involving the objects in proximity. This property of physical relation among objects can be interpreted digitally from the data structure ; this justifies the appellation “Tangible Data Structures”. Unlike the conventional method of using identifiers, our approach has arguments on its benefits in terms of privacy, scalability, autonomy and reduced dependency with respect to infrastructure. But its challenge lies in the expressivity due to limited memory space available in the tags. The principles are validated by prototyping in two different application domains. The first application is developed for waste management domain that helps in efficient sorting and better recycling. And the second, provides added services like assistance while assembling and verification for composite objects, using the distributed data structure across the individual pieces. Informatique diffuse Environnements intelligents Structures de données distribuées Rfid Pervasive computing Context awareness Distributed data structure Rfid
13	Um modelo para manutenção de esquema e de dados em data warehouses implementados em plataformas móveis. / A model to schema and data maintenance in data warehouses implemented at mobile platforms. Italiano, Isabel Cristina 11 June 2007 (has links) O presente trabalho propõe uma arquitetura de utilização de Data Warehouses em computadores móveis, descreve os componentes desta arquitetura (dados e processos) que permite o sincronismo dos dados baseado em metadados e limitado ao escopo de acesso de cada usuário. As estruturas de dados e os processos que compõem a arquitetura proposta são reduzidos a problemas conhecidos e já solucionados, justificando sua viabilidade. Além disso, o presente trabalho também fornece diretrizes para avaliar a complexidade e o impacto causados por alterações de esquema no Data Warehouse central que devem ser refletidas nos data marts localizados nas plataformas móveis. A avaliação da complexidade e impacto das alterações nos esquemas do Data Warehouse pode auxiliar os administradores do ambiente a planejar a implementação destas alterações, propondo melhores alternativas no caso de alterações de esquema mais complexas e que causem um impacto maior no ambiente. A importância do trabalho está relacionada a casos reais de necessidade de evolução nas bases de dados analíticas (Data Warehouse) em computadores móveis, nos quais os usuários mantêm seu próprio subconjunto de dados do Data Warehouse para apoiar os processos de negócios. / This work presents an architecture for using Data Warehouses in mobile computers and describes the architecture components (data and processes) that allow the data synchronism based on metadata and restricted to the scope of access for each user. The data structures and the processes composing the architecture are reduced to already known and solved problems, justifying its feasibility. Besides, this work also provides the guidelines to evaluate the complexity and impact caused by changes of schema in the central Data Warehouse that must be reflected in the data marts located in the mobile platforms. The analysis of the complexity and impact of the changes made to the schemas may help environment administrators to plan these changes and propose better alternatives when dealing with more complex schema changes causing a greater impact on the environment. The relevance of this work is related to real cases that require evolution of analytical databases (Data Warehouse) in mobile computers in which the users keep their own subset of Data Warehouse information to support their business processes. Banco de dados Banco de dados distribuídos Data warehouse Distributed data warehouse Distributed databases Metadata Sistemas de informação
14	Contributions to Collective Dynamical Clustering-Modeling of Discrete Time Series Wang, Chiying 27 April 2016 (has links) The analysis of sequential data is important in business, science, and engineering, for tasks such as signal processing, user behavior mining, and commercial transactions analysis. In this dissertation, we build upon the Collective Dynamical Modeling and Clustering (CDMC) framework for discrete time series modeling, by making contributions to clustering initialization, dynamical modeling, and scaling. We first propose a modified Dynamic Time Warping (DTW) approach for clustering initialization within CDMC. The proposed approach provides DTW metrics that penalize deviations of the warping path from the path of constant slope. This reduces over-warping, while retaining the efficiency advantages of global constraint approaches, and without relying on domain dependent constraints. Second, we investigate the use of semi-Markov chains as dynamical models of temporal sequences in which state changes occur infrequently. Semi-Markov chains allow explicitly specifying the distribution of state visit durations. This makes them superior to traditional Markov chains, which implicitly assume an exponential state duration distribution. Third, we consider convergence properties of the CDMC framework. We establish convergence by viewing CDMC from an Expectation Maximization (EM) perspective. We investigate the effect on the time to convergence of our efficient DTW-based initialization technique and selected dynamical models. We also explore the convergence implications of various stopping criteria. Fourth, we consider scaling up CDMC to process big data, using Storm, an open source distributed real-time computation system that supports batch and distributed data processing. We performed experimental evaluation on human sleep data and on user web navigation data. Our results demonstrate the superiority of the strategies introduced in this dissertation over state-of-the-art techniques in terms of modeling quality and efficiency. discrete time series deviated dynamic time warping semi-Markov chain distributed data processing system
15	Um modelo para manutenção de esquema e de dados em data warehouses implementados em plataformas móveis. / A model to schema and data maintenance in data warehouses implemented at mobile platforms. Isabel Cristina Italiano 11 June 2007 (has links) O presente trabalho propõe uma arquitetura de utilização de Data Warehouses em computadores móveis, descreve os componentes desta arquitetura (dados e processos) que permite o sincronismo dos dados baseado em metadados e limitado ao escopo de acesso de cada usuário. As estruturas de dados e os processos que compõem a arquitetura proposta são reduzidos a problemas conhecidos e já solucionados, justificando sua viabilidade. Além disso, o presente trabalho também fornece diretrizes para avaliar a complexidade e o impacto causados por alterações de esquema no Data Warehouse central que devem ser refletidas nos data marts localizados nas plataformas móveis. A avaliação da complexidade e impacto das alterações nos esquemas do Data Warehouse pode auxiliar os administradores do ambiente a planejar a implementação destas alterações, propondo melhores alternativas no caso de alterações de esquema mais complexas e que causem um impacto maior no ambiente. A importância do trabalho está relacionada a casos reais de necessidade de evolução nas bases de dados analíticas (Data Warehouse) em computadores móveis, nos quais os usuários mantêm seu próprio subconjunto de dados do Data Warehouse para apoiar os processos de negócios. / This work presents an architecture for using Data Warehouses in mobile computers and describes the architecture components (data and processes) that allow the data synchronism based on metadata and restricted to the scope of access for each user. The data structures and the processes composing the architecture are reduced to already known and solved problems, justifying its feasibility. Besides, this work also provides the guidelines to evaluate the complexity and impact caused by changes of schema in the central Data Warehouse that must be reflected in the data marts located in the mobile platforms. The analysis of the complexity and impact of the changes made to the schemas may help environment administrators to plan these changes and propose better alternatives when dealing with more complex schema changes causing a greater impact on the environment. The relevance of this work is related to real cases that require evolution of analytical databases (Data Warehouse) in mobile computers in which the users keep their own subset of Data Warehouse information to support their business processes. Banco de dados Banco de dados distribuídos Sistemas de informação Data warehouse Distributed data warehouse Distributed databases Metadata
16	Técnicas de combinação para agrupamento centralizado e distribuído de dados / Ensemble techniques for centralized and distributed clustering Naldi, Murilo Coelho 24 January 2011 (has links) A grande quantidade de dados gerada em diversas áreas do conhecimento cria a necessidade do desenvolvimento de técnicas de mineração de dados cada vez mais eficientes e eficazes. Técnicas de agrupamento têm sido utilizadas com sucesso em várias áreas, especialmente naquelas em que não há conhecimento prévio sobre a organização dos dados. Contudo, a utilização de diferentes algoritmos de agrupamento, ou variações de um mesmo algoritmo, pode gerar uma ampla variedade de resultados. Tamanha variedade cria a necessidade de métodos para avaliar e selecionar bons resultados. Uma forma de avaliar esses resultados consiste em utilizar índices de validação de agrupamentos. Entretanto, uma grande diversidade de índices de validação foi proposta na literatura, o que torna a escolha de um único índice de validação uma tarefa penosa caso os desempenhos dos índices comparados sejam desconhecidos para a classe de problemas de interesse. Com a finalidade de obter um consenso entre resultados, é possível combinar um conjunto de agrupamentos ou índices de validação em uma única solução final. Combinações de agrupamentos (clustering ensembles) foram bem sucedidas em obter soluções robustas a variações no cenário de aplicação, o que faz do uso de comitês de agrupamentos uma alternativa interessante para encontrar soluções de qualidade razoável, segundo diferentes índices de validação. Adicionalmente, utilizar uma combinação de índices de validação pode tornar a avaliação de agrupamentos mais completa, uma vez que uma maioria dos índices combinados pode compensar o fraco desempenho do restante. Em alguns casos, não é possível lidar com um único conjunto de dados centralizado, por razões físicas ou questões de privacidade, o que gera a necessidade de distribuir o processo de mineração. Combinações de agrupamentos também podem ser estendidas para problemas de agrupamento de dados distribuídos, uma vez que informações sobre os dados, oriundas de diferentes fontes, podem ser combinadas em uma única solução global. O principal objetivo desse trabalho consiste em investigar técnicas de combinação de agrupamentos e de índices de validação aplicadas na seleção de agrupamentos para combinação e na mineração distribuída de dados. Adicionalmente, algoritmos evolutivos de agrupamento são estudados com a finalidade de selecionar soluções de qualidade dentre os resultados obtidos. As técnicas desenvolvidas possuem complexidade computacional reduzida e escalabilidade, o que permite sua aplicação em grandes conjuntos de dados ou cenários em que os dados encontram-se distribuídos / The large amount of data resulting from different areas of knowledge creates the need for development of data mining techniques increasingly efficient and effective. Clustering techniques have been successfully applied to several areas, especially when there is no prior knowledge about the data organization. Nevertheless, the use of different clustering algorithms, or variations of the same algorithm, can generate a wide variety of results, what raises the need to create methods to assess and select good results. One way to evaluate these results consists on using cluster validation indexes. However, a wide variety of validation indexes was proposed in the literature, which can make choosing a single index challenging if the performance of the compared indexes is unknown for the application scenario. In order to obtain a consensus among different options, a set of clustering results or validation indexes can be combined into a single final solution. Clustering ensembles successfully obtained results robust to variations in the application scenario, which makes them an attractive alternative to find solutions of reasonable quality, according to different validation indexes. Moreover, using a combination of validation indexes can promote a more powerful evaluation, as the majority of the combined indexes can compensate the poor performance of individual indexes. In some cases, it is not possible to work with a single centralized data set, for physical reasons or privacy concerns, which creates the need to distribute the mining process. Clustering ensembles can be extended to distributed data mining problems, since information about the data from distributed sources can be combined into a single global solution. The main objective of this research resides in investigating combination techniques for validation indexes and clustering results applied to clustering ensemble selection and distributed clustering. Additionally, evolutionary clustering algorithms are studied to select quality solutions among the obtained results. The techniques developed have scalability and reduced computational complexity, allowing their usage in large data sets or scenarios with distributed data Agrupamento de dados Clutering Distributed data mining Ensemble Índices de validação Mineração distribuída Técnicas de combinação Validation indexes
17	PeerDB-Peering into Personal Databases Ooi, Beng Chin, Tan, Kian Lee 01 1900 (has links) In this talk, we will present the design and evaluation of PeerDB, a peer-to-peer (P2P) distributed data sharing system. PeerDB distinguishes itself from existing P2P systems in several ways. First, it is a full-fledge data management system that supports fine-grain content-based searching. Second, it facilitates sharing of data without shared schema. Third, it combines the power of mobile agents into P2P systems to perform operations at peers' sites. Fourth, PeerDB network is self-configurable, i.e., a node can dynamically optimize the set of peers that it can communicate directly with based on some optimization criterion. / Singapore-MIT Alliance (SMA) PeerDB P2P peer-to-peer distributed data sharing system fine-grain content-based searching self-configuration
18	Policy architecture for distributed storage systems Belaramani, Nalini Moti 15 October 2009 (has links) Distributed data storage is a building block for many distributed systems such as mobile file systems, web service replication systems, enterprise file systems, etc. New distributed data storage systems are frequently built as new environment, requirements or workloads emerge. The goal of this dissertation is to develop the science of distributed storage systems by making it easier to build new systems. In order to achieve this goal, it proposes a new policy architecture, PADS, that is based on two key ideas: first, by providing a set of common mechanisms in an underlying layer, new systems can be implemented by defining policies that orchestrate these mechanisms; second, policy can be separated into routing and blocking policy, each addresses different parts of the system design. Routing policy specifies how data flow among nodes in order to meet performance, availability, and resource usage goals, whereas blocking policy specifies when it is safe to access data in order to meet consistency and durability goals. This dissertation presents a PADS prototype that defines a set of distributed storage mechanisms that are sufficiently flexible and general to support a large range of systems, a small policy API that is easy to use and captures the right abstractions for distributed storage, and a declarative language for specifying policy that enables quick, concise implementations of complex systems. We demonstrate that PADS is able to significantly reduce development effort by constructing a dozen significant distributed storage systems spanning a large portion of the design space over the prototype. We find that each system required only a couple of weeks of implementation effort and required a few dozen lines of policy code. / text Distributed data storage Policy architecture PADS Routing policy Blocking policy Common mechanisms Distributed systems
19	A heuristic information retrieval study : an investigation of methods for enhanced searching of distributed data objects exploiting bidirectional relevance feedback Petratos, Panagiotis January 2004 (has links) The primary aim of this research is to investigate methods of improving the effectiveness of current information retrieval systems. This aim can be achieved by accomplishing numerous supporting objectives. A foundational objective is to introduce a novel bidirectional, symmetrical fuzzy logic theory which may prove valuable to information retrieval, including internet searches of distributed data objects. A further objective is to design, implement and apply the novel theory to an experimental information retrieval system called ANACALYPSE, which automatically computes the relevance of a large number of unseen documents from expert relevance feedback on a small number of documents read. A further objective is to define a methodology used in this work as an experimental information retrieval framework consisting of multiple tables including various formulae which anow a plethora of syntheses of similarity functions, ternl weights, relative term frequencies, document weights, bidirectional relevance feedback and history adjusted term weights. The evaluation of bidirectional relevance feedback reveals a better correspondence between system ranking of documents and users' preferences than feedback free system ranking. The assessment of similarity functions reveals that the Cosine and Jaccard functions perform significantly better than the DotProduct and Overlap functions. The evaluation of history tracking of the documents visited from a root page reveals better system ranking of documents than tracking free information retrieval. The assessment of stemming reveals that system information retrieval performance remains unaffected, while stop word removal does not appear to be beneficial and can sometimes be harmful. The overall evaluation of the experimental information retrieval system in comparison to a leading edge commercial information retrieval system and also in comparison to the expert's golden standard of judged relevance according to established statistical correlation methods reveal enhanced system information retrieval effectiveness. 025.524
20	Técnicas de combinação para agrupamento centralizado e distribuído de dados / Ensemble techniques for centralized and distributed clustering Murilo Coelho Naldi 24 January 2011 (has links) A grande quantidade de dados gerada em diversas áreas do conhecimento cria a necessidade do desenvolvimento de técnicas de mineração de dados cada vez mais eficientes e eficazes. Técnicas de agrupamento têm sido utilizadas com sucesso em várias áreas, especialmente naquelas em que não há conhecimento prévio sobre a organização dos dados. Contudo, a utilização de diferentes algoritmos de agrupamento, ou variações de um mesmo algoritmo, pode gerar uma ampla variedade de resultados. Tamanha variedade cria a necessidade de métodos para avaliar e selecionar bons resultados. Uma forma de avaliar esses resultados consiste em utilizar índices de validação de agrupamentos. Entretanto, uma grande diversidade de índices de validação foi proposta na literatura, o que torna a escolha de um único índice de validação uma tarefa penosa caso os desempenhos dos índices comparados sejam desconhecidos para a classe de problemas de interesse. Com a finalidade de obter um consenso entre resultados, é possível combinar um conjunto de agrupamentos ou índices de validação em uma única solução final. Combinações de agrupamentos (clustering ensembles) foram bem sucedidas em obter soluções robustas a variações no cenário de aplicação, o que faz do uso de comitês de agrupamentos uma alternativa interessante para encontrar soluções de qualidade razoável, segundo diferentes índices de validação. Adicionalmente, utilizar uma combinação de índices de validação pode tornar a avaliação de agrupamentos mais completa, uma vez que uma maioria dos índices combinados pode compensar o fraco desempenho do restante. Em alguns casos, não é possível lidar com um único conjunto de dados centralizado, por razões físicas ou questões de privacidade, o que gera a necessidade de distribuir o processo de mineração. Combinações de agrupamentos também podem ser estendidas para problemas de agrupamento de dados distribuídos, uma vez que informações sobre os dados, oriundas de diferentes fontes, podem ser combinadas em uma única solução global. O principal objetivo desse trabalho consiste em investigar técnicas de combinação de agrupamentos e de índices de validação aplicadas na seleção de agrupamentos para combinação e na mineração distribuída de dados. Adicionalmente, algoritmos evolutivos de agrupamento são estudados com a finalidade de selecionar soluções de qualidade dentre os resultados obtidos. As técnicas desenvolvidas possuem complexidade computacional reduzida e escalabilidade, o que permite sua aplicação em grandes conjuntos de dados ou cenários em que os dados encontram-se distribuídos / The large amount of data resulting from different areas of knowledge creates the need for development of data mining techniques increasingly efficient and effective. Clustering techniques have been successfully applied to several areas, especially when there is no prior knowledge about the data organization. Nevertheless, the use of different clustering algorithms, or variations of the same algorithm, can generate a wide variety of results, what raises the need to create methods to assess and select good results. One way to evaluate these results consists on using cluster validation indexes. However, a wide variety of validation indexes was proposed in the literature, which can make choosing a single index challenging if the performance of the compared indexes is unknown for the application scenario. In order to obtain a consensus among different options, a set of clustering results or validation indexes can be combined into a single final solution. Clustering ensembles successfully obtained results robust to variations in the application scenario, which makes them an attractive alternative to find solutions of reasonable quality, according to different validation indexes. Moreover, using a combination of validation indexes can promote a more powerful evaluation, as the majority of the combined indexes can compensate the poor performance of individual indexes. In some cases, it is not possible to work with a single centralized data set, for physical reasons or privacy concerns, which creates the need to distribute the mining process. Clustering ensembles can be extended to distributed data mining problems, since information about the data from distributed sources can be combined into a single global solution. The main objective of this research resides in investigating combination techniques for validation indexes and clustering results applied to clustering ensemble selection and distributed clustering. Additionally, evolutionary clustering algorithms are studied to select quality solutions among the obtained results. The techniques developed have scalability and reduced computational complexity, allowing their usage in large data sets or scenarios with distributed data Agrupamento de dados Índices de validação Mineração distribuída Técnicas de combinação Clutering Distributed data mining Ensemble Validation indexes

Search results