• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 13
  • 3
  • 1
  • 1
  • 1
  • Tagged with
  • 19
  • 19
  • 9
  • 7
  • 6
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

CSVValidation: uma ferramenta para validação de arquivos CSV a partir de metadados / CSV Validation: uma ferramenta para validação de arquivos CSV a partir de metadados

OLIVEIRA, Hugo Santos 14 August 2015 (has links)
Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2017-03-14T18:10:49Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Hugo Santos de Oliveira - Versão Depósito Bib Central.pdf: 2529045 bytes, checksum: a83fb438eaa8daaa0b4dcba01cb0b729 (MD5) / Made available in DSpace on 2017-03-14T18:10:49Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Hugo Santos de Oliveira - Versão Depósito Bib Central.pdf: 2529045 bytes, checksum: a83fb438eaa8daaa0b4dcba01cb0b729 (MD5) Previous issue date: 2015-08-14 / Modelos de dados tabulares têm sido amplamente utilizados para a publicação de dados na Web, devido a sua simplicidade de representação e facilidade de manipulação. Entretanto, nem sempre os dados são dispostos em arquivos tabulares de maneira adequada, o que pode causar dificuldades no momento do processamento dos dados. Dessa forma, o consórcio W3C tem trabalhado em uma proposta de especificação padrão para representação de dados em formatos tabulares. Neste contexto, este trabalho tem como objetivo geral propor uma solução para o problema de validação de arquivos de Dados Tabulares. Estes arquivos, são representados no formato CSV e descritos por metadados, os quais são representados em JSON e definidos de acordo com a especificação proposta pelo W3C. A principal contribuição deste trabalho foi a definição do processo de validação de arquivos de dados tabulares e dos algoritmos necessários para a execução desse processo, além da implementação de um protótipo que tem por objetivo realizar a validação dos dados tabulares, conforme especificado pelo W3C. Outra importante contribuição foi a realização de experimentos com fontes de dados disponíveis na Web, com o objetivo de avaliar a abordagem proposta neste trabalho. / Tabular data models have been used a lot for publishing data on the Web because of its simplicity of representation and easy manipulation. However, in some cases the data are not disposed in tabular files appropriately, which can cause data processing problems. Thus, the W3C proposed a standard specification for representing data in tabular format. In this context this work has as main objective to propose a solution to the problem of validating tabular data files, represented in CSV, files and described by metadata represented as JSON files and described, according to the specification proposed by the W3C. The main contribution of this work is the definition of a tabular data file validation process and algorithms necessary for the implementation of this process as well as the implementation of a prototype that aimed to validate tabular data as specified by the W3C. Other important contribution is the execution of experiments with data sources available on the Web with the objective to evaluate the approach proposed in this work.
2

A Framework for Automated Discovery and Analysis of Suspicious Trade Records

Datta, Debanjan 27 May 2022 (has links)
Illegal logging and timber trade presents a persistent threat to global biodiversity and national security due to its ties with illicit financial flows, and causes revenue loss. The scale of global commerce in timber and associated products, combined with the complexity and geographical spread of the supply chain entities present a non-trivial challenge in detecting such transactions. International shipment records, specifically those containing bill of lading is a key source of data which can be used to detect, investigate and act upon such transactions. The comprehensive problem can be described as building a framework that can perform automated discovery and facilitate actionability on detected transactions. A data driven machine learning based approach is necessitated due to the volume, velocity and complexity of international shipping data. Such an automated framework can immensely benefit our targeted end-users---specifically the enforcement agencies. This overall problem comprises of multiple connected sub-problems with associated research questions. We incorporate crucial domain knowledge---in terms of data as well as modeling---through employing expertise of collaborating domain specialists from ecological conservationist agencies. The collaborators provide formal and informal inputs spanning across the stages---from requirement specification to the design. Following the paradigm of similar problems such as fraud detection explored in prior literature, we formulate the core problem of discovering suspicious transactions as an anomaly detection task. The first sub-problem is to build a system that can be used find suspicious transactions in shipment data pertaining to imports and exports of multiple countries with different country specific schema. We present a novel anomaly detection approach---for multivariate categorical data, following constraints of data characteristics, combined with a data pipeline that incorporates domain knowledge. The focus of the second problem is U.S. specific imports, where data characteristics differ from the prior sub-problem---with heterogeneous attributes present. This problem is important since U.S. is a top consumer and there is scope of actionable enforcement. For this we present a contrastive learning based anomaly detection model for heterogeneous tabular data, with performance and scalability characteristics applicable to real world trade data. While the first two problems address the task of detecting suspicious trades through anomaly detection, a practical challenge with anomaly detection based systems is that of relevancy or scenario specific precision. The third sub-problem addresses this through a human-in-the-loop approach augmented by visual analytics, to re-rank anomalies in terms of relevance---providing explanations for cause of anomalies and soliciting feedback. The last sub-problem pertains to explainability and actionability towards suspicious records, through algorithmic recourse. Algorithmic recourse aims to provides meaningful alternatives towards flagged anomalous records, such that those counterfactual examples are not judged anomalous by the underlying anomaly detection system. This can help enforcement agencies advise verified trading entities in modifying their trading patterns to avoid false detection, thus streamlining the process. We present a novel formulation and metrics for this unexplored problem of algorithmic recourse in anomaly detection. and a deep learning based approach towards explaining anomalies and generating counterfactuals. Thus the overall research contributions presented in this dissertation addresses the requirements of the framework, and has general applicability in similar scenarios beyond the scope of this framework. / Doctor of Philosophy / Illegal timber trade presents multiple global challenges to ecological biodiversity, vulnerable ecosystems, national security and revenue collection. Enforcement agencies---the target end-users of this framework---face a myriad of challenges in discovering and acting upon shipments with illegal timber that violate national and transnational laws due to volume and complexity of shipment data, coupled with logistical hurdles. This necessitates an automated framework based upon shipment data that can address this task---through solving problems of discovery, analysis and actionability. The overall problem is decomposed into self contained sub-problems that address the associated specific research questions. These comprise of anomaly detection in multiple types of high dimensional tabular data, improving precision of anomaly detection through expert feedback and algorithmic recourse for anomaly detection. We present data mining and machine learning solutions to each of the sub-problems that overcome limitations and inapplicability of prior approaches. Further, we address two broader research questions. First is incorporation domain knowledge into the framework, which we accomplish through collaboration with domain experts from environmental conservation organizations. Secondly, we address the issue of explainability in anomaly detection for tabular data in multiple contexts. Such real world data presents with challenges of complexity and scalability, especially given the tabular format of the data that presents it's own set of challenges in terms of machine learning. The solutions presented to these machine learning problems associated with each of components of the framework provide an end-to-end solution to it's requirements. More importantly, the models and approaches presented in this dissertation have applicability beyond the application scenario with similar data and application specific challenges.
3

Combining Cell Painting, Gene Expression and Structure-Activity Data for Mechanism of Action Prediction

Everett Palm, Erik January 2023 (has links)
The rapid progress in high-throughput omics methods and high-resolution morphological profiling, coupled with the significant advances in machine learning (ML) and deep learning (DL), has opened new avenues for tackling the notoriously difficult problem of predicting the Mechanism of Action (MoA) for a drug of clinical interest. Understanding a drug's MoA can enrich our knowledge of its biological activity, shed light on potential side effects, and serve as a predictor of clinical success.  This project aimed to examine whether incorporating gene expression data from LINCS L1000 public repository into a joint model previously developed by Tian et al. (2022), which combined chemical structure and morphological profiles derived from Cell Painting, would have a synergistic effect on the model's ability to classify chemical compounds into ten well-represented MoA classes. To do this, I explored the gene expression dataset to assess its quality, volume, and limitations. I applied a variety of ML and DL methods to identify the optimal single model for MoA classification using gene expression data, with a particular emphasis on transforming tabular data into image data to harness the power of convolutional neural networks. To capitalize on the complementary information stored in different modalities, I tested end-to-end integration and soft-voting on sets of joint models across five stratified data splits.  The gene expression dataset was relatively low in quality, with many uncontrollable factors that complicated MoA prediction. The highest-performing gene expression model was a one-dimensional convolutional neural network, with an average macro F1 score of 0.40877 and a standard deviation of 0.034. Approaches converting tabular data into image data did not significantly outperform other methods. Combining optimized single models resulted in a performance decline compared to the best single model in the combination. To take full advantage of algorithmic developments in drug development and high-throughput multi-omics data, my project underscores the need for standardizing data generation and optimizing data fusion methods.
4

Tracking and visualizing dimension space coverage for exploratory data analysis

Sarvghad Batn Moghaddam, Ali 15 August 2016 (has links)
In this dissertation, I investigate interactive visual history for collaborative exploratory data analysis (EDA). In particular, I examine use of analysis history for improving the awareness of the dimension space coverage 1 2 3 to better support data exploration. Commonly, interactive history tools facilitate data analysis by capturing and representing information about the analysis process. These tools can support a wide range of use-cases from simple undo and redo to complete reconstructions of the visualization pipeline. In the con- text of exploratory collaborative Visual Analytics (VA), history tools are commonly used for reviewing and reusing past states/actions and do not efficiently support other use-cases such as understanding the past analysis from the angle of dimension space coverage. How- ever, such knowledge is essential for exploratory analysis which requires constant formulation of new questions about data. To carry out exploration, an analyst needs to understand “what has been done” versus “what is remaining” to explore. Lack of such insight can result in premature fixation on certain questions, compromising the coverage of the data set and breadth of exploration [80]. In addition, exploration of large data sets sometimes requires collaboration between a group of analysts who might be in different time/location settings. In this case, in addition to personal analysis history, each team member needs to understand what aspects of the problem his or her collaborators have explored. Such scenarios are common in domains such as science and business [34] where analysts explore large multi-dimensional data sets in search of relationships, patterns and trends. Currently, analysts typically rely on memory and/or externalization to keep track of investigated versus uninvestigated aspects of the problem. Although analysis history 4 mechanisms have the potential to assist analyst(s) with this problem, most common visual representations of history are geared towards reviewing & reusing the visualization pipeline or visualization states. I started this research with an observational user study to gain a better understanding of analysts’ history needs in the context of collaborative exploratory VA. This study showed that understanding the coverage of dimension space by using linear history 5 was cumbersome and inefficient. To address this problem, I investigated how alternate visual representations of analysis history could support this use-case. First, I designed and evaluated Footprint-I, a visual history tool that represented analysis from the angle of dimension space coverage (i.e. history of investigation of data dimensions; specifically, this approach revealed which dimensions had been previously investigated and in which combinations). I performed a user study that evaluated participants’ ability to recall the scope of past analysis using my proposed design versus a linear representation of analysis history. I measured participants’ task duration and accuracy in answering questions about a past exploratory VA session. Findings of this study showed that participants with access to dimension space coverage information were both faster and more accurate in understanding dimension space coverage information. Next, I studied the effects of providing coverage information on collaboration. To investigate this question, I designed and implemented Footprint-II, the next version of Footprint-I. In this version, I redesigned the representation of dimension space coverage to be more usable and scalable. I conducted a user study that measured the effects of presenting history from the angle of dimension space coverage on task coordination (tacit breakdown of a common task between collaborators). I asked each participant to assume the role of a business data analyst and continue a exploratory analysis work which was started by a collaborator. The results of this study showed that providing dimension space coverage information helped participants to focus on dimensions that were not investigated in the initial analysis, hence improving tacit task coordination. Finally, I investigated the effects of providing live dimension space coverage information on VA outcomes. To this end, I designed and implemented a standalone prototype VA tool with a visual history module. I used scented widgets [76] to incorporate real-time dimension space coverage information into the GUI widgets. Results of a user study showed that providing live dimension space coverage information increased the number of top-level findings. Moreover, it expanded the breadth of exploration (without compromising the depth) and helped analysts to formulate and ask more questions about their data. / Graduate / 0984 / ali.sarvghad@gmail.com
5

Synthetic Data Generation Using Transformer Networks / Textgenerering med transformatornätverk : Skapa text från ett syntetiskt dataset i tabellform

Campos, Pedro January 2021 (has links)
One of the areas propelled by the advancements in Deep Learning is Natural Language Processing. These continuous advancements allowed the emergence of new language models such as the Transformer [1], a deep learning model based on attention mechanisms that takes a sequence of symbols as input and outputs another sequence, attending to the input during its generation. This model is often used in translation, text summarization and text generation, outperforming previous used methods such as Recurrent Neural Networks and Generative Adversarial Networks. The problem statement provided by the company Syndata for this thesis is related to this new architecture: Given a tabular dataset, create a model based on the Transformer that can generate text fields considering the underlying context from the rest of the accompanying fields. In an attempt to accomplish this, Syndata has previously implemented a recurrent model, nevertheless, they’re confident that a Transformer could perform better at this task. Their goal is to improve the solution provided with the implementation of a model based on the Transformer architecture. The implemented model should then be compared to the previous recurrent model and it’s expected to outperform it. Since there aren’t many published research articles where Transformers are used for synthetic tabular data generation, this problem is fairly original. Four different models were implemented: a model that is based on the GPT architecture [2], an LSTM [3], a Bidirectional-LSTM with an Encoder- Decoder structure and the Transformer. The first two models are autoregressive models and the later two are sequence-to-sequence models which have an Encoder-Decoder architecture. We evaluated each one of them based on 3 different aspects: on the distribution similarity between the real and generated datasets, on how well each model was able to condition name generation considering the information contained in the accompanying fields and on how much real data the model compromised after generation, which addresses a privacy related issue. We found that the Encoder-Decoder models such as the Transformer and the Bidirectional LSTM seem to perform better for this type of synthetic data generation where the output (or the field to be predicted) has to be conditioned by the rest of the accompanying fields. They’ve outperformed the GPT and the RNNmodels in the aspects that matter most to Syndata: keeping customer data private and being able to correctly condition the output with the information contained in the accompanying fields. / Deep learning har lett till stora framsteg inom textbaserad språkteknologi (Natural Language Processing) där en typ av maskininlärningsarkitektur kallad Transformers[1] har haft ett extra stort intryck. Dessa modeller använder sig av en så kallad attention mekanism, tränas som språkmodeller (Language Models), där de tar in en sekvens av symboler och matar ut en annan. Varje steg i den utgående sekvensen beror olika mycket på steg i den ingående sekvensen givet vad denna attention mekanism lärt sig vara relevant. Dessa modeller används för översättning, sammanfattning och textgenerering och har överträffat andra arkitekturer som Recurrent Neural Networks, RNNs samt Generative Adversarial Networks. Problemformuleringen för denna avhandling kom från företaget Syndata och är relaterat till denna arkitektur: givet tabellbaserad data, implementera en Transformer som genererar textfält beroende av informationen i de medföljande tabellfälten. Syndata har tidigare implementerat ett RNN för detta ändamål men är övertygande om att en Transformer kan prestera bättre. Målet för denna avhandling är att implementera en Transformer och jämföra med den tidigare implementationen med hypotesen att den kommer att prestera bättre. Det underliggande målet är att givet data i tabellform kunna generera ny syntetisk data, användbar för industrin, där problem kring integritet och privat information kan minimeras. Fyra modeller implementerades: en Transformermodel baserad på GPT- arkitekturen[ 2], en LSTM[3]-modell, en encoder-decoder Transformer och en BiLSTM-modell. De två förstnämnda modellerna är auto-regressiva och de senare två är sequence-to-sequence som har en encoder-decoder arkitektur. Dessa modeller utvärderades och jämfördes givet tre kriterier: hur lik sannolikhetsfördelningen mellan den verkliga och den genererade datamängden, hur mycket varje modell baserade generationen på de medföljande fälten och hur mycket verklig data som komprometteras genom synteseringen. Slutsatsen var att Encoder-Decoder varianterna, Transformern och BiLSTM, var bättre för att syntesera data i tabellformat, där utdatan (eller fälten som ska genereras) ska uppvisa ett starkt beroende av resten av de medföljande fälten. De överträffade GPT- och RNN- modellerna i de aspekter som betyder mest för Syndata att hålla kunddata privat och att den syntetiserade datan ska vara beroende av informationen i de medföljande fälten.
6

Tabular Information Extraction from Datasheets with Deep Learning for Semantic Modeling

Akkaya, Yakup 22 March 2022 (has links)
The growing popularity of artificial intelligence and machine learning has led to the adop- tion of the automation vision in the industry by many other institutions and organizations. Many corporations have made it their primary objective to make the delivery of goods and services and manufacturing in a more efficient way with minimal human intervention. Au- tomated document processing and analysis is also a critical component of this cycle for many organizations that contribute to the supply chain. The massive volume and diver- sity of data created in this rapidly evolving environment make this a highly desired step. Despite this diversity, important information in the documents is provided in the tables. As a result, extracting tabular data is a crucial aspect of document processing. This thesis applies deep learning methodologies to detect table structure elements for the extraction of data and preparation for semantic modelling. In order to find optimal structure definition, we analyzed the performance of deep learning models in different formats such as row/column and cell. The combined row and column detection models perform poorly compared to other models’ detection performance due to the highly over- lapping nature of rows and columns. Separate row and column detection models seem to achieve the best average F1-score with 78.5% and 79.1%, respectively. However, de- termining cell elements from the row and column detections for semantic modelling is a complicated task due to spanning rows and columns. Considering these facts, a new method is proposed to set the ground-truth information called a content-focused annota- tion to define table elements better. Our content-focused method is competent in handling ambiguities caused by huge white spaces and lack of boundary lines in table structures; hence, it provides higher accuracy. Prior works have addressed the table analysis problem under table detection and table structure detection tasks. However, the impact of dataset structures on table structure detection has not been investigated. We provide a comparison of table structure detection performance with cropped and uncropped datasets. The cropped set consists of only table images that are cropped from documents assuming tables are detected perfectly. The uncropped set consists of regular document images. Experiments show that deep learning models can improve the detection performance by up to 9% in average precision and average recall on the cropped versions. Furthermore, the impact of cropped images is negligible under the Intersection over Union (IoU) values of 50%-70% when compared to the uncropped versions. However, beyond 70% IoU thresholds, cropped datasets provide significantly higher detection performance.
7

Uma nova metáfora visual escalável para dados tabulares e sua aplicação na análise de agrupamentos / A scalable visual metaphor for tabular data and its application on clustering analysis

Mosquera, Evinton Antonio Cordoba 19 September 2017 (has links)
A rápida evolução dos recursos computacionais vem permitindo que grandes conjuntos de dados sejam armazenados e recuperados. No entanto, a exploração, compreensão e extração de informação útil ainda são um desafio. Com relação às ferramentas computacionais que visam tratar desse problema, a Visualização de Informação possibilita a análise de conjuntos de dados por meio de representações gráficas e a Mineração de Dados fornece processos automáticos para a descoberta e interpretação de padrões. Apesar da recente popularidade dos métodos de visualização de informação, um problema recorrente é a baixa escalabilidade visual quando se está analisando grandes conjuntos de dados, resultando em perda de contexto e desordem visual. Com intuito de representar grandes conjuntos de dados reduzindo a perda de informação relevante, o processo de agregação visual de dados vem sendo empregado. A agregação diminui a quantidade de dados a serem representados, preservando a distribuição e as tendências do conjunto de dados original. Quanto à mineração de dados, visualização de informação vêm se tornando ferramental essencial na interpretação dos modelos computacionais e resultados gerados, em especial das técnicas não-supervisionados, como as de agrupamento. Isso porque nessas técnicas, a única forma do usuário interagir com o processo de mineração é por meio de parametrização, limitando a inserção de conhecimento de domínio no processo de análise de dados. Nesta dissertação, propomos e desenvolvemos uma metáfora visual baseada na TableLens que emprega abordagens baseadas no conceito de agregação para criar representações mais escaláveis para a interpretação de dados tabulares. Como aplicação, empregamos a metáfora desenvolvida na análise de resultados de técnicas de agrupamento. O ferramental resultante não somente suporta análise de grandes bases de dados com reduzida perda de contexto, mas também fornece subsídios para entender como os atributos dos dados contribuem para a formação de agrupamentos em termos da coesão e separação dos grupos formados. / The rapid evolution of computing resources has enabled large datasets to be stored and retrieved. However, exploring, understanding and extracting useful information is still a challenge. Among the computational tools to address this problem, information visualization techniques enable the data analysis employing the human visual ability by making a graphic representation of the data set, and data mining provides automatic processes for the discovery and interpretation of patterns. Despite the recent popularity of information visualization methods, a recurring problem is the low visual scalability when analyzing large data sets resulting in context loss and visual disorder. To represent large datasets reducing the loss of relevant information, the process of aggregation is being used. Aggregation decreases the amount of data to be represented, preserving the distribution and trends of the original dataset. Regarding data mining, information visualization has become an essential tool in the interpretation of computational models and generated results, especially of unsupervised techniques, such as clustering. This occurs because, in these techniques, the only way the user interacts with the mining process is through parameterization, limiting the insertion of domain knowledge in the process. In this thesis, we propose and develop the new visual metaphor based on the TableLens that employs approaches based on the concept of aggregation to create more scalable representations of tabular data. As application, we use the developed metaphor in the analysis of the results of clustering techniques. The resulting framework does not only support large database analysis but also provides insights into how data attributes contribute to clustering regarding cohesion and separation of the composed groups
8

Plateforme visuelle pour l'intégration de données faiblement structurées et incertaines / A visual platform to integrate poorly structured and unknown data

Da Silva Carvalho, Paulo 19 December 2017 (has links)
Nous entendons beaucoup parler de Big Data, Open Data, Social Data, Scientific Data, etc. L’importance qui est apportée aux données en général est très élevée. L’analyse de ces données est importante si l’objectif est de réussir à en extraire de la valeur pour pouvoir les utiliser. Les travaux présentés dans cette thèse concernent la compréhension, l’évaluation, la correction/modification, la gestion et finalement l’intégration de données, pour permettre leur exploitation. Notre recherche étudie exclusivement les données ouvertes (DOs - Open Data) et plus précisément celles structurées sous format tabulaire (CSV). Le terme Open Data est apparu pour la première fois en 1995. Il a été utilisé par le groupe GCDIS (Global Change Data and Information System) (États-Unis) pour encourager les entités, possédant les mêmes intérêts et préoccupations, à partager leurs données [Data et System, 1995]. Le mouvement des données ouvertes étant récent, il s’agit d’un champ qui est actuellement en grande croissance. Son importance est actuellement très forte. L’encouragement donné par les gouvernements et institutions publiques à ce que leurs données soient publiées a sans doute un rôle important à ce niveau. / We hear a lot about Big Data, Open Data, Social Data, Scientific Data, etc. The importance currently given to data is, in general, very high. We are living in the era of massive data. The analysis of these data is important if the objective is to successfully extract value from it so that they can be used. The work presented in this thesis project is related with the understanding, assessment, correction/modification, management and finally the integration of the data, in order to allow their respective exploitation and reuse. Our research is exclusively focused on Open Data and, more precisely, Open Data organized in tabular form (CSV - being one of the most widely used formats in the Open Data domain). The first time that the term Open Data appeared was in 1995 when the group GCDIS (Global Change Data and Information System) (from United States) used this expression to encourage entities, having the same interests and concerns, to share their data [Data et System, 1995]. However, the Open Data movement has only recently undergone a sharp increase. It has become a popular phenomenon all over the world. Being the Open Data movement recent, it is a field that is currently growing and its importance is very strong. The encouragement given by governments and public institutions to have their data published openly has an important role at this level.
9

Uma nova metáfora visual escalável para dados tabulares e sua aplicação na análise de agrupamentos / A scalable visual metaphor for tabular data and its application on clustering analysis

Evinton Antonio Cordoba Mosquera 19 September 2017 (has links)
A rápida evolução dos recursos computacionais vem permitindo que grandes conjuntos de dados sejam armazenados e recuperados. No entanto, a exploração, compreensão e extração de informação útil ainda são um desafio. Com relação às ferramentas computacionais que visam tratar desse problema, a Visualização de Informação possibilita a análise de conjuntos de dados por meio de representações gráficas e a Mineração de Dados fornece processos automáticos para a descoberta e interpretação de padrões. Apesar da recente popularidade dos métodos de visualização de informação, um problema recorrente é a baixa escalabilidade visual quando se está analisando grandes conjuntos de dados, resultando em perda de contexto e desordem visual. Com intuito de representar grandes conjuntos de dados reduzindo a perda de informação relevante, o processo de agregação visual de dados vem sendo empregado. A agregação diminui a quantidade de dados a serem representados, preservando a distribuição e as tendências do conjunto de dados original. Quanto à mineração de dados, visualização de informação vêm se tornando ferramental essencial na interpretação dos modelos computacionais e resultados gerados, em especial das técnicas não-supervisionados, como as de agrupamento. Isso porque nessas técnicas, a única forma do usuário interagir com o processo de mineração é por meio de parametrização, limitando a inserção de conhecimento de domínio no processo de análise de dados. Nesta dissertação, propomos e desenvolvemos uma metáfora visual baseada na TableLens que emprega abordagens baseadas no conceito de agregação para criar representações mais escaláveis para a interpretação de dados tabulares. Como aplicação, empregamos a metáfora desenvolvida na análise de resultados de técnicas de agrupamento. O ferramental resultante não somente suporta análise de grandes bases de dados com reduzida perda de contexto, mas também fornece subsídios para entender como os atributos dos dados contribuem para a formação de agrupamentos em termos da coesão e separação dos grupos formados. / The rapid evolution of computing resources has enabled large datasets to be stored and retrieved. However, exploring, understanding and extracting useful information is still a challenge. Among the computational tools to address this problem, information visualization techniques enable the data analysis employing the human visual ability by making a graphic representation of the data set, and data mining provides automatic processes for the discovery and interpretation of patterns. Despite the recent popularity of information visualization methods, a recurring problem is the low visual scalability when analyzing large data sets resulting in context loss and visual disorder. To represent large datasets reducing the loss of relevant information, the process of aggregation is being used. Aggregation decreases the amount of data to be represented, preserving the distribution and trends of the original dataset. Regarding data mining, information visualization has become an essential tool in the interpretation of computational models and generated results, especially of unsupervised techniques, such as clustering. This occurs because, in these techniques, the only way the user interacts with the mining process is through parameterization, limiting the insertion of domain knowledge in the process. In this thesis, we propose and develop the new visual metaphor based on the TableLens that employs approaches based on the concept of aggregation to create more scalable representations of tabular data. As application, we use the developed metaphor in the analysis of the results of clustering techniques. The resulting framework does not only support large database analysis but also provides insights into how data attributes contribute to clustering regarding cohesion and separation of the composed groups
10

Syntetisering av tabulär data: En systematisk litteraturstudie om verktyg för att skapa syntetiska dataset

Allergren, Erik, Hildebrand, Clara January 2023 (has links)
De senaste åren har efterfrågan på stora mängder data för att träna maskininläringsalgoritmer ökat. Algoritmerna kan användas för att lösa stora som små samhällsfrågor och utmaningar. Ett sätt att möta efterfrågan är att generera syntetisk data som bibehåller statistiska värden och egenskaper från verklig data. Den syntetiska datan möjliggör generering av stora mängder data men är också bra då den minimerar risken för att personlig integritet röjd och medför att data kan tillgängliggöras för forskning utan att identiteter röjs. I denna studie var det övergripande syftet att undersöka och sammanställa vilka verktyg för syntetisering av tabulär data som finns beskrivna i vetenskapliga publiceringar på engelska. Studien genomfördes genom att följa de åtta stegen i en systematisk litteraturstudie med tydligt definierade kriterier för vilka artiklar som skulle inkluderas eller exkluderas. De främsta kraven för artiklarna var att de beskrivna verktygen existerar i form av kod eller program, alltså inte enbart i teorin, samt var generella och applicerbara på olika tabulära dataset. Verktygen fick därmed inte bara fungera eller vara anpassad till ett specifikt dataset eller situation. De verktyg som fanns beskrivna i de återstående artiklarna efter genomförd sökning och därmed representeras i resultatet är (a) Synthpop, ett verktyg som togs fram i ett projekt för UK Longitudinal Studies för att kunna hantera känslig data och personuppgifter; (b) Gretel, ett kommersiellt och open-source verktyg som uppkommit för att möta det ökade behovet av träningsdata; (c) UniformGAN, en ny variant av GAN (Generative Adversarial Network) som genererar syntetiska tabulära dataset medan sekretess säkerställs samt; (d) Synthia, ett open-source paket för Python som är gjort för att generera syntetisk data med en eller flera variabler, univariat och multivariat data. Resultatet visade att verktygen använder sig av olika metoder och modeller för att framställa syntetisk data samt har olika grad av tillgänglighet. Gretel framträdde mest från verktygen, då den är mer kommersiell med fler tjänster samt erbjuder möjligheten att generera syntetiskt data utan att ha goda kunskaper i programmering. / During the last years the demand for big amounts of data to train machine learning algorithms has increased. The algorithms can be used to solve real world problems and challenges. A way to meet the demand is to generate synthetic data that preserve the statistical values and characteristics from real data. The synthetic data makes it possible to obtain large amounts of data, but is also good since it minimizes the risk for privacy issues in micro data. In that way, this type of data can be made accessible for important research without disclosure and potentially harming personal integrity. In this study, the overall aim was to examine and compile which tools for generation of synthetic data are described in scientific articles written in English. The study was conducted by following the eight steps of systematic literature reviews with clearly defined requirements for which articles to include or exclude. The primary requirements for the articles were that the described tools where existing in the form of accessible code or program and that they could be used for general tabular datasets. Thus the tools could not be made just for a specific dataset or situation. The tools that were described in the remaining articles after the search, and consequently included in the result of the study, was (a) Synthpop, a tool developed within the UK Longitudinal Studies to handle sensitive data containing personal information; (b) Gretel, a commercial and open source tool that was created to meet the demand for training data; (c) UniformGAN, a new Generative Adversarial Network that generates synthetic data while preserving privacy and (d) Synthia, a Python open-source package made to generate synthetic univariate and multivariate data. The result showed that the tools use different methods and models to generate synthetic data and have different degrees of accessibility. Gretel is distinguished from the other tools, since it is more commercial with several services and offers the possibility to generate synthetic data without good knowledge in programming.

Page generated in 0.0545 seconds