1 |
Interactive Exploration of Text DatabasesHuang, Ziqi January 2015 (has links)
No description available.
|
2 |
Provenance of visual interpretations in the exploration of dataAl-Naser, Aqeel January 2015 (has links)
The thesis addresses the problem of capturing and tracking multi-user interpretations of 3D spatial datasets. These interpretations are completed after the end of the visualization pipeline to identify and extract features of interest, and are subjective to human intuition and knowledge. Users may also assess regions of these interpretations. Consequently, the thesis proposes a provenance-enabled interpretation pipeline. It adopts and extends the W3C PROV data model, producing a provenance model for visual interpretations. This was implemented for seismic imaging interpretation in a proof-of-concept prototype architecture and application. Accumulation of users' interpretations and annotations are captured by the provenance model in a fine-grained form. The captured provenance information can be utilised to filter data. The work of this thesis was evaluated in three parts. First, a usability evaluation by geoscientists was conducted by postgraduate students in the field of geoscience to illustrate the system's ability in allowing users to amend others' interpretations and trace the history of amendments. Second, a conceptual evaluation of this research was approached by interviewing domain experts. The importance of this research to the industry was assured. Interviewees perceived and shared potential implementations of this work in the workflow of seismic interpretation. Limitations and concerns of the work were highlighted. Third, a performance evaluation was conducted to illustrate the behaviour of the architecture on commodity machines as well as on a multi-node parallel database, such that a new functionality in fine-grained provenance can be implemented simply but with an acceptable performance in realistic visualization tasks. The measures suggested that the current implementation achieved an acceptable performance in comparison to conventional methods. The proposed provenance model in an interpretation pipeline is believed to be a promising shift in methods of data management and storage which can record and preserve interpretations by users as a result of visualization. The approach and software development in this thesis represented a step in this direction.
|
3 |
A Stand-Alone Methodology for Data Exploration in Support of Data Mining and AnalyticsGage, Michael 01 June 2013 (has links) (PDF)
With the emergence of Big Data, data high in volume, variety, and velocity, new analysis techniques need to be developed to effectively use the data that is being collected. Knowledge discovery from databases is a larger methodology encompassing a process for gathering knowledge from that data. Analytics pair the knowledge with decision making to improve overall outcomes. Organizations have conclusive evidence that analytics provide competitive advantages and improve overall performance. This paper proposes a stand-alone methodology for data exploration. Data exploration is one part of the data mining process, used in knowledge discovery from databases and analytics. The goal of the methodology is to reduce the amount of time to gain meaningful information about a previously unanalyzed data set using tabular summaries and visualizations. The reduced time will enable faster implementation of analytics in an organization. Two case studies using a prototype implementation are presented showing the benefits of the methodology.
|
4 |
A note on intelligent exploration of semantic dataThakker, Dhaval, Schwabe, D., Garcia, D., Kozaki, K., Brambilla, M., Dimitrova, V. 15 July 2019 (has links)
Yes / Welcome to this special issue of the Semantic Web (SWJ) journal. The special issue compiles three technical contributions
that significantly advance the state-of-the-art in exploration of semantic data using semantic web techniques and technologies.
|
5 |
Information Extraction from dataSottovia, Paolo 22 October 2019 (has links)
Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.
|
6 |
Interpretação de clusters gerados por algoritmos de clustering hierárquico / Interpreting clusters generated by hierarchical clustering algorithmsMetz, Jean 04 August 2006 (has links)
O processo de Mineração de Dados (MD) consiste na extração automática de padrões que representam o conhecimento implícito em grandes bases de dados. Em geral, a MD pode ser classificada em duas categorias: preditiva e descritiva. Tarefas da primeira categoria, tal como a classificação, realizam inferências preditivas sobre os dados enquanto que tarefas da segunda categoria, tal como o clustering, exploram o conjunto de dados em busca de propriedades que o descrevem. Diferentemente da classificação, que analisa exemplos rotulados, o clustering utiliza exemplos para os quais o rótulo da classe não é previamente conhecido. Nessa tarefa, agrupamentos são formados de modo que exemplos de um mesmo cluster apresentam alta similaridade, ao passo que exemplos em clusters diferentes apresentam baixa similaridade. O clustering pode ainda facilitar a organização de clusters em uma hierarquia de agrupamentos, na qual são agrupados eventos similares, criando uma taxonomia que pode simplificar a interpretação de clusters. Neste trabalho, é proposto e desenvolvido um módulo de aprendizado não-supervisionado, que agrega algoritmos de clustering hierárquico e ferramentas de análise de clusters para auxiliar o especialista de domínio na interpretação dos resultados do clustering. Uma vez que o clustering hierárquico agrupa exemplos de acordo com medidas de similaridade e organiza os clusters em uma hierarquia, o usuário/especialista pode analisar e explorar essa hierarquia de agrupamentos em diferentes níveis para descobrir conceitos descritos por essa estrutura. O módulo proposto está integrado em um sistema maior, em desenvolvimento no Laboratório de Inteligência Computacional ? LABIC ?, que contempla todas as etapas do processo de MD, desde o pré-processamento de dados ao pós-processamento de conhecimento. Para avaliar o módulo proposto e seu uso para descoberta de conceitos a partir da estrutura hierárquica de clusters, foram realizados diversos experimentos sobre conjuntos de dados naturais, assim como um estudo de caso utilizando um conjunto de dados real. Os resultados mostram a viabilidade da metodologia proposta para interpretação dos clusters, apesar da complexidade do processo ser dependente das características do conjunto de dados. / The Data Mining (DM) process consists of the automated extraction of patterns representing knowledge implicitly stored in large databases. In general, DM tasks can be classified into two categories: predictive and descriptive. Tasks in the first category, such as classification and prediction, perform inference on the data in order to make predictions, while tasks in the second category, such as clustering, characterize the general properties of the data. Unlike classification and prediction, which analyze class-labeled data objects, clustering analyses data objects without a known class-label. Clusters of objects are formed so that objects that are in the same cluster have a close similarity among them, but are very dissimilar to objects in other clusters. Clustering can also facilitate the organization of clusters into a hierarchy of clusters that group similar events together. This taxonomy formation can facilitate interpretation of clusters. In this work, we propose and develop tools to deal with this task by implementing a module which comprises hierarchical clustering algorithms and several cluster analysis tools, aiming to help the domain specialist to interpret the clustering results. Once clusters group objects based on similarity measures which are organized into a hierarchy, the user/specialist is able to carry out an analysis and exploration of the agglomeration hierarchy at different levels of the hierarchy in order to discover concepts described by this structure. The proposed module is integrated into a large system under development by researchers from the Computational Intelligence Laboratory ? LABIC ?- which contemplates all the DM process steps, from data pre-processing to knowledge post-processing. To evaluate the implemented module and its use to discover concepts from the hierarchical structure of clusters, several experiments on natural databases were carried out as well as a case study using a real database. Results show the viability of the proposed methodology although the process could be complex depending on the characteristics of the database.
|
7 |
Visualização e exploração de dados multidimensionais na web / Exploratory multidimensional data visualization on the webPagliosa, Lucas de Carvalho 13 November 2015 (has links)
Com o crescimento do volume e dos tipos de dados, a necessidade de analisar e entender o que estes representam e como estão relacionados tem se tornado crucial. Técnicas de visualização baseadas em projeções multidimensionais ganharam espaço e interesse como uma das possíveis ferramentas de auxílio para esse problema, proporcionando um forma simples e rápida de identificar padrões, reconhecer tendências e extrair características antes não óbvias no conjunto original. No entanto, a projeção do conjunto de dados em um espaço de menor dimensão pode não ser suficiente, em alguns casos, para responder ou esclarecer certas perguntas feitas pelo usuário, tornando a análise posterior à projeção crucial para a correta interpretação da visualização observada. Logo, a interatividade, aplicada à necessidade do usuário, é uma fator essencial para análise. Neste contexto, este projeto de mestrado tem como principal objetivo criar metáforas visuais baseadas em atributos, através de medidas estatísticas e artefatos para detecção de ruídos e grupos similares, para auxiliar na exploração e análise dos dados projetados. Além disso, propõe-se disponibilizar, em navegadores Web, as técnicas de visualização de dados multidimensionais desenvolvidas pelo Grupo de Processamento Visual e Geométrico do ICMC-USP. O desenvolvimento do projeto como plataforma Web inspira-se na dificuldade de instalação e execução que certos projetos de visualização possuem, como problemas causados por diferentes versões de IDEs, compiladores e sistemas operacionais. Além disso, o fato do projeto estar disponível online para execução tem como propósito facilitar o acesso e a divulgação das técnicas propostas para o público geral. / With the growing number and types of data, the need to analyze and understand what they represent and how they are related has become crucial. Visualization techniques based on multidimensional projections have gained space and interest as one of the possible tools to aid this problem, providing a simple and quick way to identify patterns, recognize trends and extract features previously not obvious in the original set. However, the data set projection in a smaller space may not be sufficient in some cases to answer or clarify certain questions asked by the user, making the posterior projection analysis crucial for the exploration and understanding of the data. Thus, interactivity in the visualization, applied to the users needs, is an essential factor for analysis. In this context, this master projects main objective consists to create visual metaphors based on attributes, through statistical measures and artifacts for detecting noise and similar groups, to assist the exploration and analysis of projected data. In addition, it is proposed to make available, in Web browsers, the multidimensional data visualization techniques developed by the Group of Visual and Geometric Processing at ICMC-USP. The development of the project as a Web platform was inspired by the difficulty of installation and running that certain visualization projects have, mainly due different versions of IDEs, compilers and operating systems. In addition, the fact that the project is available online for execution aims to facilitate the access and dissemination of technical proposals for the general public.
|
8 |
Distributed indexing and scalable query processing for interactive big data explorationsGuzun, Gheorghi 01 August 2016 (has links)
The past few years have brought a major surge in the volumes of collected data. More and more enterprises and research institutions find tremendous value in data analysis and exploration. Big Data analytics is used for improving customer experience, perform complex weather data integration and model prediction, as well as personalized medicine and many other services.
Advances in technology, along with high interest in big data, can only increase the demand on data collection and mining in the years to come.
As a result, and in order to keep up with the data volumes, data processing has become increasingly distributed. However, most of the distributed processing for large data is done by batch processing and interactive exploration is hardly an option. To efficiently support queries over large amounts of data, appropriate indexing mechanisms must be in place.
This dissertation proposes an indexing and query processing framework that can run on top of a distributed computing engine, to support fast, interactive data explorations in data warehouses. Our data processing layer is built around bit-vector based indices. This type of indexing features fast bit-wise operations and scales up well for high dimensional data. Additionally, compression can be applied to reduce the index size, and thus utilize less memory and network communication.
Our work can be divided into two areas: index compression and query processing.
Two compression schemes are proposed for sparse and dense bit-vectors. The design of these encoding methods is hardware-driven, and the query processing is optimized for the available computing hardware. Query algorithms are proposed for selection, aggregation, and other specialized queries. The query processing is supported on single machines, as well as computer clusters.
|
9 |
Automatic assessment of OLAP exploration quality / Evaluation automatique de la qualité des explorations OLAPDjedaini, Mahfoud 06 December 2017 (has links)
Avant l’arrivée du Big Data, la quantité de données contenues dans les bases de données était relativement faible et donc plutôt simple à analyser. Dans ce contexte, le principal défi dans ce domaine était d’optimiser le stockage des données, mais aussi et surtout le temps de réponse des Systèmes de Gestion de Bases de Données (SGBD). De nombreux benchmarks, notamment ceux du consortium TPC, ont été mis en place pour permettre l’évaluation des différents systèmes existants dans des conditions similaires. Cependant, l’arrivée de Big Data a complètement changé la situation, avec de plus en plus de données générées de jour en jour. Parallèlement à l’augmentation de la mémoire disponible, nous avons assisté à l’émergence de nouvelles méthodes de stockage basées sur des systèmes distribués tels que le système de fichiers HDFS utilisé notamment dans Hadoop pour couvrir les besoins de stockage technique et le traitement Big Data. L’augmentation du volume de données rend donc leur analyse beaucoup plus difficile. Dans ce contexte, il ne s’agit pas tant de mesurer la vitesse de récupération des données, mais plutôt de produire des séquences de requêtes cohérentes pour identifier rapidement les zones d’intérêt dans les données, ce qui permet d’analyser ces zones plus en profondeur, et d’extraire des informations permettant une prise de décision éclairée. / In a Big Data context, traditional data analysis is becoming more and more tedious. Many approaches have been designed and developed to support analysts in their exploration tasks. However, there is no automatic, unified method for evaluating the quality of support for these different approaches. Current benchmarks focus mainly on the evaluation of systems in terms of temporal, energy or financial performance. In this thesis, we propose a model, based on supervised automatic leaming methods, to evaluate the quality of an OLAP exploration. We use this model to build an evaluation benchmark of exploration support sys.terns, the general principle of which is to allow these systems to generate explorations and then to evaluate them through the explorations they produce.
|
10 |
Visualização e exploração de dados multidimensionais na web / Exploratory multidimensional data visualization on the webLucas de Carvalho Pagliosa 13 November 2015 (has links)
Com o crescimento do volume e dos tipos de dados, a necessidade de analisar e entender o que estes representam e como estão relacionados tem se tornado crucial. Técnicas de visualização baseadas em projeções multidimensionais ganharam espaço e interesse como uma das possíveis ferramentas de auxílio para esse problema, proporcionando um forma simples e rápida de identificar padrões, reconhecer tendências e extrair características antes não óbvias no conjunto original. No entanto, a projeção do conjunto de dados em um espaço de menor dimensão pode não ser suficiente, em alguns casos, para responder ou esclarecer certas perguntas feitas pelo usuário, tornando a análise posterior à projeção crucial para a correta interpretação da visualização observada. Logo, a interatividade, aplicada à necessidade do usuário, é uma fator essencial para análise. Neste contexto, este projeto de mestrado tem como principal objetivo criar metáforas visuais baseadas em atributos, através de medidas estatísticas e artefatos para detecção de ruídos e grupos similares, para auxiliar na exploração e análise dos dados projetados. Além disso, propõe-se disponibilizar, em navegadores Web, as técnicas de visualização de dados multidimensionais desenvolvidas pelo Grupo de Processamento Visual e Geométrico do ICMC-USP. O desenvolvimento do projeto como plataforma Web inspira-se na dificuldade de instalação e execução que certos projetos de visualização possuem, como problemas causados por diferentes versões de IDEs, compiladores e sistemas operacionais. Além disso, o fato do projeto estar disponível online para execução tem como propósito facilitar o acesso e a divulgação das técnicas propostas para o público geral. / With the growing number and types of data, the need to analyze and understand what they represent and how they are related has become crucial. Visualization techniques based on multidimensional projections have gained space and interest as one of the possible tools to aid this problem, providing a simple and quick way to identify patterns, recognize trends and extract features previously not obvious in the original set. However, the data set projection in a smaller space may not be sufficient in some cases to answer or clarify certain questions asked by the user, making the posterior projection analysis crucial for the exploration and understanding of the data. Thus, interactivity in the visualization, applied to the users needs, is an essential factor for analysis. In this context, this master projects main objective consists to create visual metaphors based on attributes, through statistical measures and artifacts for detecting noise and similar groups, to assist the exploration and analysis of projected data. In addition, it is proposed to make available, in Web browsers, the multidimensional data visualization techniques developed by the Group of Visual and Geometric Processing at ICMC-USP. The development of the project as a Web platform was inspired by the difficulty of installation and running that certain visualization projects have, mainly due different versions of IDEs, compilers and operating systems. In addition, the fact that the project is available online for execution aims to facilitate the access and dissemination of technical proposals for the general public.
|
Page generated in 0.12 seconds