Global ETD Search

31	Supervised metric learning with generalization guarantees / Apprentissage supervisé de métriques avec garanties en généralisation Bellet, Aurélien 11 December 2012 (has links) Ces dernières années, l'importance cruciale des métriques en apprentissage automatique a mené à un intérêt grandissant pour l'optimisation de distances et de similarités en utilisant l'information contenue dans des données d'apprentissage pour les rendre adaptées au problème traité. Ce domaine de recherche est souvent appelé apprentissage de métriques. En général, les méthodes existantes optimisent les paramètres d'une métrique devant respecter des contraintes locales sur les données d'apprentissage. Les métriques ainsi apprises sont généralement utilisées dans des algorithmes de plus proches voisins ou de clustering.Concernant les données numériques, beaucoup de travaux ont porté sur l'apprentissage de distance de Mahalanobis, paramétrisée par une matrice positive semi-définie. Les méthodes récentes sont capables de traiter des jeux de données de grande taille.Moins de travaux ont été dédiés à l'apprentissage de métriques pour les données structurées (comme les chaînes ou les arbres), car cela implique souvent des procédures plus complexes. La plupart des travaux portent sur l'optimisation d'une notion de distance d'édition, qui mesure (en termes de nombre d'opérations) le coût de transformer un objet en un autre.Au regard de l'état de l'art, nous avons identifié deux limites importantes des approches actuelles. Premièrement, elles permettent d'améliorer la performance d'algorithmes locaux comme les k plus proches voisins, mais l'apprentissage de métriques pour des algorithmes globaux (comme les classifieurs linéaires) n'a pour l'instant pas été beaucoup étudié. Le deuxième point, sans doute le plus important, est que la question de la capacité de généralisation des méthodes d'apprentissage de métriques a été largement ignorée.Dans cette thèse, nous proposons des contributions théoriques et algorithmiques qui répondent à ces limites. Notre première contribution est la construction d'un nouveau noyau construit à partir de probabilités d'édition apprises. A l'inverse d'autres noyaux entre chaînes, sa validité est garantie et il ne comporte aucun paramètre. Notre deuxième contribution est une nouvelle approche d'apprentissage de similarités d'édition pour les chaînes et les arbres inspirée par la théorie des (epsilon,gamma,tau)-bonnes fonctions de similarité et formulée comme un problème d'optimisation convexe. En utilisant la notion de stabilité uniforme, nous établissons des garanties théoriques pour la similarité apprise qui donne une borne sur l'erreur en généralisation d'un classifieur linéaire construit à partir de cette similarité. Dans notre troisième contribution, nous étendons ces principes à l'apprentissage de métriques pour les données numériques en proposant une méthode d'apprentissage de similarité bilinéaire qui optimise efficacement l'(epsilon,gamma,tau)-goodness. La similarité est apprise sous contraintes globales, plus appropriées à la classification linéaire. Nous dérivons des garanties théoriques pour notre approche, qui donnent de meilleurs bornes en généralisation pour le classifieur que dans le cas des données structurées. Notre dernière contribution est un cadre théorique permettant d'établir des bornes en généralisation pour de nombreuses méthodes existantes d'apprentissage de métriques. Ce cadre est basé sur la notion de robustesse algorithmique et permet la dérivation de bornes pour des fonctions de perte et des régulariseurs variés / In recent years, the crucial importance of metrics in machine learningalgorithms has led to an increasing interest in optimizing distanceand similarity functions using knowledge from training data to make them suitable for the problem at hand.This area of research is known as metric learning. Existing methods typically aim at optimizing the parameters of a given metric with respect to some local constraints over the training sample. The learned metrics are generally used in nearest-neighbor and clustering algorithms.When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance, which is parameterized by a positive semi-definite matrix. Recent methods offer good scalability to large datasets.Less work has been devoted to metric learning from structured objects (such as strings or trees), because it often involves complex procedures. Most of the work has focused on optimizing a notion of edit distance, which measures (in terms of number of operations) the cost of turning an object into another.We identify two important limitations of current supervised metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not really been studied so far. Second, and perhaps more importantly, the question of the generalization ability of metric learning methods has been largely ignored.In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Unlike other string kernels, it is guaranteed to be valid and parameter-free. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (epsilon,gamma,tau)-good similarity functions and formulated as a convex optimization problem. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend the same ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (epsilon,gamma,tau)-goodness. The similarity is learned based on global constraints that are more appropriate to linear classification. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms. It is based on a simple adaptation of the notion of algorithmic robustness and allows the derivation of bounds for various loss functions and regularizers. Apprentissage de métriques Apprentissage statistique Optimisation convexe Classification Données structurées Distance d'édition Bornes en généralisation Metric learning Statistical learning Convex optimization Classification Structured data Edit distance Generalization bounds
32	Semi-supervised structured prediction models Brefeld, Ulf 14 March 2008 (has links) Das Lernen aus strukturierten Eingabe- und Ausgabebeispielen ist die Grundlage für die automatisierte Verarbeitung natürlich auftretender Problemstellungen und eine Herausforderung für das Maschinelle Lernen. Die Einordnung von Objekten in eine Klassentaxonomie, die Eigennamenerkennung und das Parsen natürlicher Sprache sind mögliche Anwendungen. Klassische Verfahren scheitern an der komplexen Natur der Daten, da sie die multiplen Abhängigkeiten und Strukturen nicht erfassen können. Zudem ist die Erhebung von klassifizierten Beispielen in strukturierten Anwendungsgebieten aufwändig und ressourcenintensiv, während unklassifizierte Beispiele günstig und frei verfügbar sind. Diese Arbeit thematisiert halbüberwachte, diskriminative Vorhersagemodelle für strukturierte Daten. Ausgehend von klassischen halbüberwachten Verfahren werden die zugrundeliegenden analytischen Techniken und Algorithmen auf das Lernen mit strukturierten Variablen übertragen. Die untersuchten Verfahren basieren auf unterschiedlichen Prinzipien und Annahmen, wie zum Beispiel der Konsensmaximierung mehrerer Hypothesen im Lernen aus mehreren Sichten, oder der räumlichen Struktur der Daten im transduktiven Lernen. Desweiteren wird in einer Fallstudie zur Email-Batcherkennung die räumliche Struktur der Daten ausgenutzt und eine Lösung präsentiert, die der sequenziellen Natur der Daten gerecht wird. Aus den theoretischen Überlegungen werden halbüberwachte, strukturierte Vorhersagemodelle und effiziente Optmierungsstrategien abgeleitet. Die empirische Evaluierung umfasst Klassifikationsprobleme, Eigennamenerkennung und das Parsen natürlicher Sprache. Es zeigt sich, dass die halbüberwachten Methoden in vielen Anwendungen zu signifikant kleineren Fehlerraten führen als vollständig überwachte Baselineverfahren. / Learning mappings between arbitrary structured input and output variables is a fundamental problem in machine learning. It covers many natural learning tasks and challenges the standard model of learning a mapping from independently drawn instances to a small set of labels. Potential applications include classification with a class taxonomy, named entity recognition, and natural language parsing. In these structured domains, labeled training instances are generally expensive to obtain while unlabeled inputs are readily available and inexpensive. This thesis deals with semi-supervised learning of discriminative models for structured output variables. The analytical techniques and algorithms of classical semi-supervised learning are lifted to the structured setting. Several approaches based on different assumptions of the data are presented. Co-learning, for instance, maximizes the agreement among multiple hypotheses while transductive approaches rely on an implicit cluster assumption. Furthermore, in the framework of this dissertation, a case study on email batch detection in message streams is presented. The involved tasks exhibit an inherent cluster structure and the presented solution exploits the streaming nature of the data. The different approaches are developed into semi-supervised structured prediction models and efficient optimization strategies thereof are presented. The novel algorithms generalize state-of-the-art approaches in structural learning such as structural support vector machines. Empirical results show that the semi-supervised algorithms lead to significantly lower error rates than their fully supervised counterparts in many application areas, including multi-class classification, named entity recognition, and natural language parsing. Lernen mit strukturierten Daten halbüberwachtes Lernen Kernverfahren natürliche Sprachverarbeitung Learning with structured data semi-supervised learning kernel machines natural language processing 004 Informatik 28 Informatik, Datenverarbeitung ddc:004
33	Supervised metric learning with generalization guarantees Bellet, Aurélien 11 December 2012 (has links) (PDF) In recent years, the crucial importance of metrics in machine learningalgorithms has led to an increasing interest in optimizing distanceand similarity functions using knowledge from training data to make them suitable for the problem at hand.This area of research is known as metric learning. Existing methods typically aim at optimizing the parameters of a given metric with respect to some local constraints over the training sample. The learned metrics are generally used in nearest-neighbor and clustering algorithms.When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance, which is parameterized by a positive semi-definite matrix. Recent methods offer good scalability to large datasets.Less work has been devoted to metric learning from structured objects (such as strings or trees), because it often involves complex procedures. Most of the work has focused on optimizing a notion of edit distance, which measures (in terms of number of operations) the cost of turning an object into another.We identify two important limitations of current supervised metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not really been studied so far. Second, and perhaps more importantly, the question of the generalization ability of metric learning methods has been largely ignored.In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Unlike other string kernels, it is guaranteed to be valid and parameter-free. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (epsilon,gamma,tau)-good similarity functions and formulated as a convex optimization problem. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend the same ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (epsilon,gamma,tau)-goodness. The similarity is learned based on global constraints that are more appropriate to linear classification. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms. It is based on a simple adaptation of the notion of algorithmic robustness and allows the derivation of bounds for various loss functions and regularizers. Metric learning Statistical learning Convex optimization Classification Structured data Edit distance Generalization bounds
34	Uma técnica de indexação de dados semi-estruturados para o processamento eficiente de consultas com ramificação Viana, Talles Brito 20 April 2012 (has links) Made available in DSpace on 2015-05-14T12:36:35Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 1730516 bytes, checksum: 167ec230d84a25e110ad4386ec5aae74 (MD5) Previous issue date: 2012-04-20 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The explosive growth of web-based information systems has created various sources and vast quantities of semi-structured data, which need to be indexed by search engines in order to allow the retrieval of documents according to user needs. However, one of the major challenges in the development of indexing techniques for semi-structured data is related to how to index not only textual but also structural content. The main issue is how to efficiently handle branching path expressions without introducing precision loss as well as undesired growth of query processing costs and index file sizes. Several proposals for indexing semistructured data can be found in the literature. Despite their relevant contributions, existing proposals suffer from at least one of the problems related to precision loss, storage space requirements and query processing costs. In such a context, this thesis proposes an efficient, lossless path-based indexing technique (named as BranchGuide) for semi-structured data, which deals with a well-defined class of branching path expressions. This well-defined class includes branching paths that allow expressing parent-child dependencies between elements in which may be imposed restrictions over the textual value of attributes of such elements. As evinced by experimental evaluation, the adoption of the BranchGuide technique results in excellent query processing time and generates smaller index file sizes than a structural join indexing technique. / O surgimento de sistemas baseados na Web tem gerado uma vasta quantidade de fontes de documentos semi-estruturados, os quais necessitam ser indexados por sistemas de busca a fim de possibilitar a descoberta de documentos de acordo com necessidades de informação do usuário. Entretanto, um dos maiores desafios no desenvolvimento de técnicas de indexação para documentos semi-estruturados diz respeito a como indexar não somente o conteúdo textual, mas também a informação estrutural dos documentos. O principal problema está em prover suporte para consultas com ramificação sem introduzir fatores que causem perda de precisão aos resultados de pesquisa, bem como, o crescimento indesejado do tempo de processamento de consultas e dos tamanhos de índice. Várias técnicas de indexação para dados semi-estruturados são encontradas na literatura. Apesar das relevantes contribuições, as propostas existentes sofrem com problemas relacionados à perda de precisão, requisitos de armazenamento ou custos de processamento de consultas. Neste contexto, nesta dissertação é proposta uma técnica de indexação (denominada BranchGuide) para dados semi-estruturados que suporta uma bem definida classe de consultas com ramificação sem perda de precisão. Esta classe compreende caminhos com ramificação que permitem expressar dependências paifilho entre elementos nos quais podem ser impostas restrições sob os valores de atributos de tais elementos. Como evidenciado experimentalmente, a adoção da técnica BranchGuide gera excelentes tempos de processamento de consulta e tamanhos de índice menores do que os gerados por uma técnica de interseção estrutural. Informática Indexação Recuperação de Informação Dados Semi-Estruturados Data Processing Indexing Techniques Information Retrieval Semi-Structured Data
35	Hlasem ovládaný elektronický zubní kříž / Voice controled electronic health record in dentistry Hippmann, Radek January 2012 (has links) Title: Voice controlled electronic health record in dentistry Author: MUDr. Radek Hippmann Department: Department of paediatric stomatology, Faculty hospital Motol Supervisor: Prof. MUDr. Taťjana Dostalová, DrSc., MBA Supervisor's e-mail: Tatjana.Dostalova@fnmotol.cz This PhD thesis is concerning with development of the complex electronic health record (EHR) for the field of dentistry. This system is also enhanced with voice control based on the Automatic speech recognition (ASR) system and module for speech synthesis Text-to- speech (TTS). In the first part of the thesis is described the whole issue and are defined particular areas, whose combination is essential for EHR system creation in this field. It is mainly basic delimiting of terms and areas in the dentistry. In the next step we are engaged in temporomandibular joint (TMJ) problematic, which is often ignored and trends in EHR and voice technologies are also described. In the methodological part are described delineated technologies used during the EHR system creation, voice recognition and TMJ disease classification. Following part incorporates results description, which are corresponding with the knowledge base in dentistry and TMJ. From this knowledge base originates the graphic user interface DentCross, which is serving for dental data...
36	Uma abordagem de predição estruturada baseada no modelo perceptron Coelho, Maurício Archanjo Nunes 25 June 2015 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-03-06T17:58:43Z No. of bitstreams: 1 mauricioarchanjonunescoelho.pdf: 10124655 bytes, checksum: 549fa53eba76e81b76ddcbce12c97e55 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-03-06T20:26:43Z (GMT) No. of bitstreams: 1 mauricioarchanjonunescoelho.pdf: 10124655 bytes, checksum: 549fa53eba76e81b76ddcbce12c97e55 (MD5) / Made available in DSpace on 2017-03-06T20:26:44Z (GMT). No. of bitstreams: 1 mauricioarchanjonunescoelho.pdf: 10124655 bytes, checksum: 549fa53eba76e81b76ddcbce12c97e55 (MD5) Previous issue date: 2015-06-25 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / A teoria sobre aprendizado supervisionado tem avançado significativamente nas últimas décadas. Diversos métodos são largamente utilizados para resoluções dos mais variados problemas, citando alguns: sistemas especialistas para obter respostas to tipo verdadeiro/ falso, o modelo Perceptron para separação de classes, Máquina de Vetores Suportes (SVMs) e o Algoritmo de Margem Incremental (IMA) no intuito de aumentar a margem de separação, suas versões multi-classe, bem como as redes neurais artificiais, que apresentam possibilidades de entradas relativamente complexas. Porém, como resolver tarefas que exigem respostas tão complexas quanto as perguntas? Tais respostas podem consistir em várias decisões inter-relacionadas que devem ser ponderadas uma a uma para se chegar a uma solução satisfatória e globalmente consistente. Será visto no decorrer do trabalho que existem problemas de relevante interesse que apresentam estes requisitos. Uma questão que naturalmente surge é a necessidade de se lidar com a explosão combinatória das possíveis soluções. Uma alternativa encontrada apresenta-se através da construção de modelos que compactam e capturam determinadas propriedades estruturais do problema: correlações sequenciais, restrições temporais, espaciais, etc. Tais modelos, chamados de estruturados, incluem, entre outros, modelos gráficos, tais como redes de Markov e problemas de otimização combinatória, como matchings ponderados, cortes de grafos e agrupamentos de dados com padrões de similaridade e correlação. Este trabalho formula, apresenta e discute estratégias on-line eficientes para predição estruturada baseadas no princípio de separação de classes derivados do modelo Perceptron e define um conjunto de algoritmos de aprendizado supervisionado eficientes quando comparados com outras abordagens. São também realizadas e descritas duas aplicações experimentais a saber: inferência dos custos das diversas características relevantes para a realização de buscas em mapas variados e a inferência dos parâmetros geradores dos grafos de Markov. Estas aplicações têm caráter prático, enfatizando a importância da abordagem proposta. / The theory of supervised learning has significantly advanced in recent decades. Several methods are widely used for solutions of many problems, such as expert systems for answers to true/false, Support Vector Machine (SVM) and Incremental Margin Algorithm (IMA). In order to increase the margin of separation, as well as its multi-class versions, in addition to the artificial neural networks which allow complex input data. But how to solve tasks that require answers as complex as the questions? Such responses may consist of several interrelated decisions to be considered one by one to arrive at a satisfactory and globally consistent solution. Will be seen throughout the thesis, that there are problems of relevant interest represented by these requirements. One question that naturally arises is the need to deal with the exponential explosion of possible answers. As a alternative, we have found through the construction of models that compress and capture certain structural properties of the problem: sequential correlations, temporal constraints, space, etc. These structured models include, among others, graphical models, such as Markov networks and combinatorial optimization problems, such as weighted matchings, graph cuts and data clusters with similarity and correlation patterns. This thesis formulates, presents and discusses efficient online strategies for structured prediction based on the principle of separation of classes, derived from the Perceptron and defines a set of efficient supervised learning algorithms compared to other approaches. Also are performed and described two experimental applications: the costs prediction of relevant features on maps and the prediction of the probabilistic parameters for the generating Markov graphs. These applications emphasize the importance of the proposed approach. CNPQ::CIENCIAS EXATAS E DA TERRA Aprendizado de máquina Predição de dados estruturados Perceptron multi-classe Planejamento de caminhos Grafos de Markov Machine Learning Perceptron Multi-class Path Planning Prediction of Structured Data Markov Graphs
37	Hlasem ovládaný elektronický zubní kříž / Voice controled electronic health record in dentistry Hippmann, Radek January 2012 (has links) Title: Voice controlled electronic health record in dentistry Author: MUDr. Radek Hippmann Department: Department of paediatric stomatology, Faculty hospital Motol Supervisor: Prof. MUDr. Taťjana Dostalová, DrSc., MBA Supervisor's e-mail: Tatjana.Dostalova@fnmotol.cz This PhD thesis is concerning with development of the complex electronic health record (EHR) for the field of dentistry. This system is also enhanced with voice control based on the Automatic speech recognition (ASR) system and module for speech synthesis Text-to- speech (TTS). In the first part of the thesis is described the whole issue and are defined particular areas, whose combination is essential for EHR system creation in this field. It is mainly basic delimiting of terms and areas in the dentistry. In the next step we are engaged in temporomandibular joint (TMJ) problematic, which is often ignored and trends in EHR and voice technologies are also described. In the methodological part are described delineated technologies used during the EHR system creation, voice recognition and TMJ disease classification. Following part incorporates results description, which are corresponding with the knowledge base in dentistry and TMJ. From this knowledge base originates the graphic user interface DentCross, which is serving for dental data...
38	ADVANCED INTERFACE FOR QUERYING GRAPH DATA Mayes, Stephen Frederick January 2008 (has links) No description available. Computer Science Query Query Interface Graph Data Advanced Query Interface Semi-structured data Pathways Biological Pathway Data Path Query Neighborhood Query
39	Abordagem para integração automática de dados estruturados e não estruturados em um contexto Big Data / Approach for automatic integration of structured and unstructured data in a Big Data context Keylla Ramos Saes 22 November 2018 (has links) O aumento de dados disponíveis para uso tem despertado o interesse na geração de conhecimento pela integração de tais dados. No entanto, a tarefa de integração requer conhecimento dos dados e também dos modelos de dados utilizados para representá-los. Ou seja, a realização da tarefa de integração de dados requer a participação de especialistas em computação, o que limita a escalabilidade desse tipo de tarefa. No contexto de Big Data, essa limitação é reforçada pela presença de uma grande variedade de fontes e modelos heterogêneos de representação de dados, como dados relacionais com dados estruturados e modelos não relacionais com dados não estruturados, essa variedade de representações apresenta uma complexidade adicional para o processo de integração de dados. Para lidar com esse cenário é necessário o uso de ferramentas de integração que reduzam ou até mesmo eliminem a necessidade de intervenção humana. Como contribuição, este trabalho oferece a possibilidade de integração de diversos modelos de representação de dados e fontes de dados heterogêneos, por meio de uma abordagem que permite o do uso de técnicas variadas, como por exemplo, algoritmos de comparação por similaridade estrutural dos dados, algoritmos de inteligência artificial, que através da geração do metadados integrador, possibilita a integração de dados heterogêneos. Essa flexibilidade permite lidar com a variedade crescente de dados, é proporcionada pela modularização da arquitetura proposta, que possibilita que integração de dados em um contexto Big Data de maneira automática, sem a necessidade de intervenção humana / The increase of data available to use has piqued interest in the generation of knowledge for the integration of such data bases. However, the task of integration requires knowledge of the data and the data models used to represent them. Namely, the accomplishment of the task of data integration requires the participation of experts in computing, which limits the scalability of this type of task. In the context of Big Data, this limitation is reinforced by the presence of a wide variety of sources and heterogeneous data representation models, such as relational data with structured and non-relational models with unstructured data, this variety of features an additional complexity representations for the data integration process. Handling this scenario is required the use of integration tools that reduce or even eliminate the need for human intervention. As a contribution, this work offers the possibility of integrating diverse data representation models and heterogeneous data sources through the use of varied techniques such as comparison algorithms for structural similarity of the artificial intelligence algorithms, data, among others. This flexibility, allows dealing with the growing variety of data, is provided by the proposed modularized architecture, which enables data integration in a context Big Data automatically, without the need for human intervention Banco de dados não relacionais Banco de dados relacionais Big Data Dados estruturados Dados não estruturados Integração de dados Integração de dados heterogêneos NoSQL Big Data Data integration Heterogeneous data integration Non-relational database NoSQL Relational database Structured data Unstructured data
40	Abordagem para integração automática de dados estruturados e não estruturados em um contexto Big Data / Approach for automatic integration of structured and unstructured data in a Big Data context Saes, Keylla Ramos 22 November 2018 (has links) O aumento de dados disponíveis para uso tem despertado o interesse na geração de conhecimento pela integração de tais dados. No entanto, a tarefa de integração requer conhecimento dos dados e também dos modelos de dados utilizados para representá-los. Ou seja, a realização da tarefa de integração de dados requer a participação de especialistas em computação, o que limita a escalabilidade desse tipo de tarefa. No contexto de Big Data, essa limitação é reforçada pela presença de uma grande variedade de fontes e modelos heterogêneos de representação de dados, como dados relacionais com dados estruturados e modelos não relacionais com dados não estruturados, essa variedade de representações apresenta uma complexidade adicional para o processo de integração de dados. Para lidar com esse cenário é necessário o uso de ferramentas de integração que reduzam ou até mesmo eliminem a necessidade de intervenção humana. Como contribuição, este trabalho oferece a possibilidade de integração de diversos modelos de representação de dados e fontes de dados heterogêneos, por meio de uma abordagem que permite o do uso de técnicas variadas, como por exemplo, algoritmos de comparação por similaridade estrutural dos dados, algoritmos de inteligência artificial, que através da geração do metadados integrador, possibilita a integração de dados heterogêneos. Essa flexibilidade permite lidar com a variedade crescente de dados, é proporcionada pela modularização da arquitetura proposta, que possibilita que integração de dados em um contexto Big Data de maneira automática, sem a necessidade de intervenção humana / The increase of data available to use has piqued interest in the generation of knowledge for the integration of such data bases. However, the task of integration requires knowledge of the data and the data models used to represent them. Namely, the accomplishment of the task of data integration requires the participation of experts in computing, which limits the scalability of this type of task. In the context of Big Data, this limitation is reinforced by the presence of a wide variety of sources and heterogeneous data representation models, such as relational data with structured and non-relational models with unstructured data, this variety of features an additional complexity representations for the data integration process. Handling this scenario is required the use of integration tools that reduce or even eliminate the need for human intervention. As a contribution, this work offers the possibility of integrating diverse data representation models and heterogeneous data sources through the use of varied techniques such as comparison algorithms for structural similarity of the artificial intelligence algorithms, data, among others. This flexibility, allows dealing with the growing variety of data, is provided by the proposed modularized architecture, which enables data integration in a context Big Data automatically, without the need for human intervention Banco de dados não relacionais Banco de dados relacionais Big Data Big Data Dados estruturados Dados não estruturados Data integration Heterogeneous data integration Integração de dados Integração de dados heterogêneos Non-relational database NoSQL NoSQL Relational database Structured data Unstructured data

Search results