221 |
[pt] AGRUPAMENTO FUZZY APLICADO À INTEGRAÇÃO DE DADOS MULTI-ÔMICOS / [en] FUZZY CLUSTERING APPLIED TO MULTI-OMICS DATASARAH HANNAH LUCIUS LACERDA DE GOES TELLES CARVALHO ALVES 05 October 2021 (has links)
[pt] Os avanços nas tecnologias de obtenção de dados multi-ômicos têm disponibilizado diferentes níveis de informação molecular que aumentam progressivamente em volume e variedade. Neste estudo, propõem-se uma metodologia de integração de dados clínicos e multi-ômicos, com o objetivo de identificar subtipos de câncer por agrupamento fuzzy, representando assim as gradações entre os diferentes perfis moleculares. Uma melhor caracterização de tumores em subtipos moleculares pode contribuir para uma medicina mais
personalizada e assertiva. Os conjuntos de dados ômicos a serem integrados são definidos utilizando um classificador com classe-alvo definida por resultados da literatura. Na sequência, é realizado o pré-processamento dos conjuntos de dados para reduzir a alta dimensionalidade. Os dados selecionados são
integrados e em seguida agrupados. Optou-se pelo algoritmo fuzzy C-means pela sua capacidade de considerar a possibilidade dos pacientes terem características de diferentes grupos, o que não é possível com métodos clássicos de agrupamento. Como estudo de caso, utilizou-se dados de câncer colorretal
(CCR). O CCR tem a quarta maior incidência na população mundial e a terceira maior no Brasil. Foram extraídos dados de metilação, expressão de miRNA e mRNA do portal do projeto The Cancer Genome Atlas (TCGA). Observou-se que a adição dos dados de expressão de miRNA e metilação a um classificador de expressão de mRNA da literatura aumentou a acurácia deste em 5 pontos percentuais. Assim, foram usados dados de metilação, expressão de miRNA e mRNA neste trabalho. Os atributos de cada conjunto de dados foram selecionados, obtendo-se redução significativa do número de atributos. A identificação dos grupos foi realizada com o algoritmo fuzzy C-means. A variação dos hiperparâmetros deste algoritmo, número de grupos e parâmetro de fuzzificação, permitiu a escolha da combinação de melhor desempenho. A escolha da melhor configuração considerou o efeito da variação dos parâmetros nas características biológicas, em especial na sobrevida global dos pacientes. Observou-se que o agrupamento gerado permitiu identificar que as amostras consideradas não agrupadas têm características biológicas compartilhadas entre grupos de diferentes prognósticos. Os resultados obtidos com a combinação de dados clínicos e ômicos mostraram-se promissores para melhor predizer o fenótipo. / [en] The advances in technologies for obtaining multi-omic data provide different levels of molecular information that progressively increase in volume and variety. This study proposes a methodology for integrating clinical and multiomic data, which aim is the identification of cancer subtypes using fuzzy clustering
algorithm, representing the different degrees between molecular profiles. A better characterization of tumors in molecular subtypes can contribute to a more personalized and assertive medicine. A classifier that uses a target class from literature results indicates which omic data sets should be integrated.
Next, data sets are pre-processed to reduce high dimensionality. The selected data is integrated and then clustered. The fuzzy C-means algorithm was chosen due to its ability to consider the shared patients characteristics between different groups. As a case study, colorectal cancer (CRC) data were used. CCR has
the fourth highest incidence in the world population and the third highest in Brazil. Methylation, miRNA and mRNA expression data were extracted from The Cancer Genome Atlas (TCGA) project portal. It was observed that the addition of miRNA expression and methylation data to a literature mRNA expression classifier increased its accuracy by 5 percentage points. Therefore, methylation, miRNA and mRNA expression data were used in this work. The attributes of each data set were pre-selected, obtaining a significant reduction in the number of attributes. Groups were identified using the fuzzy C-means
algorithm. The variation of the hyperparameters of this algorithm, number of groups and membership degree, indicated the best performance combination. This choice considered the effect of parameters variation on biological characteristics, especially on the overall survival of patients. Clusters showed that patients considered not grouped had biological characteristics shared between groups of different prognoses. The combination of clinical and omic data to better predict the phenotype revealed promissing results.
|
222 |
Semantic Integration across Heterogeneous Databases : Finding Data Correspondences using Agglomerative Hierarchical Clustering and Artificial Neural Networks / Semantisk integrering mellan heterogena databaser : Hitta datakopplingar med hjälp av hierarkisk klustring och artificiella neuronnätHobro, Mark January 2018 (has links)
The process of data integration is an important part of the database field when it comes to database migrations and the merging of data. The research in the area has grown with the addition of machine learning approaches in the last 20 years. Due to the complexity of the research field, no go-to solutions have appeared. Instead, a wide variety of ways of enhancing database migrations have emerged. This thesis examines how well a learning-based solution performs for the semantic integration problem in database migrations. Two algorithms are implemented. One that is based on information retrieval theory, with the goal of yielding a matching result that can be used as a benchmark for measuring the performance of the machine learning algorithm. The machine learning approach is based on grouping data with agglomerative hierarchical clustering and then training a neural network to recognize patterns in the data. This allows making predictions about potential data correspondences across two databases. The results show that agglomerative hierarchical clustering performs well in the task of grouping the data into classes. The classes can in turn be used for training a neural network. The matching algorithm gives a high recall of matching tables, but improvements are needed to both receive a high recall and precision. The conclusion is that the proposed learning-based approach, using agglomerative hierarchical clustering and a neural network, works as a solid base to semi-automate the data integration problem seen in this thesis. But the solution needs to be enhanced with scenario specific algorithms and rules, to reach desired performance. / Dataintegrering är en viktig del inom området databaser när det kommer till databasmigreringar och sammanslagning av data. Forskning inom området har ökat i takt med att maskininlärning blivit ett attraktivt tillvägagångssätt under de senaste 20 åren. På grund av komplexiteten av forskningsområdet, har inga optimala lösningar hittats. Istället har flera olika tekniker framställts, som tillsammans kan förbättra databasmigreringar. Denna avhandling undersöker hur bra en lösning baserad på maskininlärning presterar för dataintegreringsproblemet vid databasmigreringar. Två algoritmer har implementerats. En är baserad på informationssökningsteori, som främst används för att ha en prestandamässig utgångspunkt för algoritmen som är baserad på maskininlärning. Den algoritmen består av ett första steg, där data grupperas med hjälp av hierarkisk klustring. Sedan tränas ett artificiellt neuronnät att hitta mönster i dessa grupperingar, för att kunna göra förutsägelser huruvida olika datainstanser har ett samband mellan två databaser. Resultatet visar att agglomerativ hierarkisk klustring presterar väl i uppgiften att klassificera den data som använts. Resultatet av matchningsalgoritmen visar på att en stor mängd av de matchande tabellerna kan hittas. Men förbättringar behöver göras för att både ge hög en hög återkallelse av matchningar och hög precision för de matchningar som hittas. Slutsatsen är att ett inlärningsbaserat tillvägagångssätt, i detta fall att använda agglomerativ hierarkisk klustring och sedan träna ett artificiellt neuronnät, fungerar bra som en basis för att till viss del automatisera ett dataintegreringsproblem likt det som presenterats i denna avhandling. För att få bättre resultat, krävs att lösningen förbättras med mer situationsspecifika algoritmer och regler.
|
223 |
Hippocratic data sharing in e-government space with contract managementAiyadurai, Yoganand January 2015 (has links)
Submitted in partial fulfillment of the requirement of the degree Magister Technologiae: Information Technology, Durban University of Technology, Durban, South Africa, 2015. / The research reported in this dissertation focuses on seamless data sharing in e-government space because of the intrinsic complexity, disparity and heterogeneity of government information systems as well as the need to improve government service delivery. The often observed bureaucracy in government processes, especially when verifying information, coupled with the high interdependency of government departments and diversity in government operations has made it difficult to improve government service delivery efficiency. These challenges raise the need to find better ways to seamlessly share data between government to citizens, government to businesses, government to suppliers and government to public institutions. Obviously, efficient automatic data sharing is an important phenomenon that contributes to improvements in communication, collaboration, interaction and efficiency in the service delivery process because it reduces information verification time and improves reliability of information.
The general applications of data sharing systems become perceptible in institutions such as banks and government establishments where information verification is highly necessary in the process of service delivery. Data sharing usually occurs between a data holder and a data requester when copies of authorized data are transported from the source databases to the requester. This data sharing process should guarantee a high level of privacy because of the confidential nature of certain data. A data integration gateway (DIG) is being proposed in this research as a methodological solution to seamlessly share data in e-government space, using Hippocratic database principles to enforce data privacy.
The DIG system is a centralized web application that utilizes a lightweight database within the government data centre to hold information on data contracts, data sources, connection strings and data destinations. The data sharing policies are stated as contracts and once indentures on how to share data are established between different data publishers, it is possible to ensure a seamless integration of data from different sources using the DIG application being proposed in this dissertation. The application is malleable to support the sharing of publisher data that are stored in any kind of database. The proposed DIG application promises to reduce costs of system maintenance and improve service delivery efficiency without any change to the existing hardware infrastructure and information systems residing within different government departments.
|
224 |
Prise en compte des dépendances entre données thématiques utilisateur et données topographiques lors d’un changement de niveau de détail / Taking into account the dependences between user thematic data and topographic data when the level of detail is changedJaara, Kusay 10 March 2015 (has links)
Avec l'importante disponibilité de données topographiques de référence, la création des données géographiques n'est plus réservée aux professionnels de l'information géographique. De plus en plus d'utilisateurs saisissent leurs propres données, que nous appelons données thématiques, en s'appuyant sur ces données de référence qui jouent alors le rôle de données support. Les données thématiques ainsi saisies font sens en tant que telles, mais surtout de par leurs relations avec les données topographiques. La non prise en compte des relations entre données thématiques et topographiques lors de traitements modifiant les unes ou les autres peut engendrer des incohérences, notamment pour les traitements liés au changement de niveau de détail. L'objectif de la thèse est de définir une méthodologie pour préserver la cohérence entre les données thématiques et topographiques lors d'un changement de niveau de détail. Nous nous concentrons sur l'adaptation des données thématiques suite à une modification des données topographiques, processus que nous appelons migration des données thématiques. Nous proposons d'abord un modèle pour la migration de données thématiques ponctuelles sur réseau composé de : (1) un modèle pour décrire le référencement des données thématiques sur les données topographiques par des relations spatiales (2) une méthode de relocalisation basée sur ces relations. L'approche consiste à identifier les relations finales attendues en fonction des relations initiales et des changements sur les données topographiques entre les états initial et final. La relocalisation est alors effectuée grâce à une méthode multicritère de manière à respecter au mieux les relations attendues. Une mise en œuvre est présentée sur des cas d'étude jouets et sur un cas réel fourni par un service de l'Etat gestionnaire de réseau routier. Nous discutons enfin l'extension du modèle proposé pour traiter la prise en compte des relations pour d'autres applications que la migration de données thématiques / With the large availability of reference topographic data, creating geographic data is not exclusive to experts of geographic information any more. More and more users rely on reference data to create their own data, hereafter called thematic data. Reference data then play the role of support for thematic data. Thematic data make sense by themselves, but even more by their relations with topographic data. Not taking into account the relations between thematic and topographic data during processes that modify the former or the latter may cause inconsistencies, especially for processes that are related to changing the level of detail. The objective of this thesis is to define a methodology to preserve the consistency between thematic and topographic when the level of detail is modified. This thesis focuses on the adaptation of thematic data after a modification of topographic data: we call this process thematic data migration. We first propose a model for the migration of punctual thematic data hosted by a network. This model is composed of: (1) a model to describe the referencing of thematic data on topographic data using spatial relations (2) a method to re-locate thematic data based on these relations. The approach consists in identifying the expected final relations according to the initial relations and the modifications of topographic data between the initial and the final state. The thematic data are then re-located using a multi-criteria method in order to satisfy, as much as possible, the expected relations. An implementation is presented on toy problems and on a real use case provided by a French public authority in charge of road network management. The extension of the proposed model to take into account the relations for other applications than thematic data migration is also discussed
|
225 |
Master Data Management a jeho využití v praxi / Master Data Management and its usage in practiceKukačka, Pavel January 2011 (has links)
This thesis deals with the Master Data Management (MDM), specifically its implementation. The main objectives are to analyze and capture the general approaches of MDM implementation including best practices, describe and evaluate the implementation of MDM project using Microsoft SQL Server 2008 R2 Master Data Services (MDS) realized in the Czech environment and on the basis of the above theoretical background, experiences of implemented project and available technical literature create a general procedure for implementation of the MDS tool. To achieve objectives above are used these procedures: exploration of information resources (printed, electronic and personal appointments with consultants of Clever Decision), cooperation on project realized by Clever Decision and analysis of tool Microsoft SQL Server 2008 R2 Master Data Services. Contributions of this work are practically same as its goals. The main contribution is creation of a general procedure for implementation of the MDS tool. The thesis is divided into two parts. The first (theoretically oriented) part deals with basic concepts (including definition against other systems), architecture, implementing styles, market trends and best practices. The second (practically oriented) part deals at first with implementation of realized MDS project and hereafter describes a general procedure for implementation of the MDS tool.
|
226 |
Découverte de biomarqueurs prédictifs en cancer du sein par intégration transcriptome-interactome / Biomarkers discovery in breast cancer by Interactome-Transcriptome IntegrationGarcia, Maxime 20 December 2013 (has links)
L’arrivée des technologies à haut-débit pour mesurer l’expression des gènes a permis l’utilisation de signatures génomiques pour prédire des conditions cliniques ou la survie du patient. Cependant de telles signatures ont des limitations, comme la dépendance au jeu de données d’entrainement et le manque de généralisation. Nous proposons un nouvel algorithme, Integration Transcriptome-Interactome (ITI) (Garcia et al.) pour extraire une signature generalisable prédisant la rechute métastatique dans le cancer du sein par superimposition d’un très large jeu de données d’interaction protèine-protèine sur de multiples jeux de données d’expression des gènes. Cette méthode ré-implemente l’algorithme Chuang et al. , avec la capacité supplémentaire d’extraire une signature génomique à partir de plusieurs jeux de donnés d’expression des gènes simultanément. Une analyse non-supervisée et une analyse supervisée ont été réalisés sur un compendium de jeux de donnés issus de puces à ADN en cancer du sein. Les performances des signatures trouvées par ITI ont été comparé aux performances des signatures préalablement publiées (Wang et al. , Van De Vijver et al. , Sotiriou et al. ). Nos résultats montrent que les signatures ITI sont plus stables et plus généralisables, et sont plus performantes pour classifier un jeu de données indépendant. Nous avons trouvés des sous-réseaux formant des complexes précédement relié à des fonctions biologiques impliquées dans la nétastase et le cancer du sein. Plusieurs gènes directeurs ont été détectés, dont CDK1, NCK1 et PDGFB, certains n’étant pas déjà relié à la rechute métastatique dans le cancer du sein. / High-throughput gene-expression profiling technologies yeild genomic signatures to predict clinical condition or patient outcome. However, such signatures have limitations, such as dependency on training set, and lack of generalization. We propose a novel algorithm, Interactome-Transcriptome Integration (ITI) (Garcia et al.) extract a generalizable signature predicting breast cancer relapse by superimposition of a large-scale protein-protein interaction data over several gene-expression data sets. This method re-implements the Chuang et al. algorithm, with the added capability to extract a genomic signature from several gene expression data sets simultaneously. A non-supervised and a supervised analysis were made with a breast cancer compendium of DNA microarray data sets. Performances of signatures found with ITI were compared with previously published signatures (Wang et al. , Van De Vijver et al. , Sotiriou et al. ). Our results show that ITI’s signatures are more stable and more generalizable, and perfom better when classifying an independant dataset. We found that subnetworks formed complexes functionally linked to biological functions related to metastasis and breast cancer. Several drivers genes were detected, including CDK1, NCK1 and PDGFB, some not previously linked to breast cancer relapse.
|
227 |
Feeding a data warehouse with data coming from web services. A mediation approach for the DaWeS prototype / Alimenter un entrepôt de données par des données issues de services web. Une approche médiation pour le prototype DaWeSSamuel, John 06 October 2014 (has links)
Cette thèse traite de l’établissement d’une plateforme logicielle nommée DaWeS permettant le déploiement et la gestion en ligne d’entrepôts de données alimentés par des données provenant de services web et personnalisés à destination des petites et moyennes entreprises. Ce travail s’articule autour du développement et de l’expérimentation de DaWeS. L’idée principale implémentée dans DaWeS est l’utilisation d’une approche virtuelle d’intégration de données (la médiation) en tant queprocessus ETL (extraction, transformation et chargement des données) pour les entrepôts de données gérés par DaWeS. A cette fin, un algorithme classique de réécriture de requêtes (l’algorithme inverse-rules) a été adapté et testé. Une étude théorique sur la sémantique des requêtes conjonctives et datalog exprimées avec des relations munies de limitations d’accès (correspondant aux services web) a été menée. Cette dernière permet l’obtention de bornes supérieures sur les nombres d’appels aux services web requis dans l’évaluation de telles requêtes. Des expérimentations ont été menées sur des services web réels dans trois domaines : le marketing en ligne, la gestion de projets et les services d’aide aux utilisateurs. Une première série de tests aléatoires a été effectuée pour tester le passage à l’échelle. / The role of data warehouse for business analytics cannot be undermined for any enterprise, irrespective of its size. But the growing dependence on web services has resulted in a situation where the enterprise data is managed by multiple autonomous and heterogeneous service providers. We present our approach and its associated prototype DaWeS [Samuel, 2014; Samuel and Rey, 2014; Samuel et al., 2014], a DAta warehouse fed with data coming from WEb Services to extract, transform and store enterprise data from web services and to build performance indicators from them (stored enterprise data) hiding from the end users the heterogeneity of the numerous underlying web services. Its ETL process is grounded on a mediation approach usually used in data integration. This enables DaWeS (i) to be fully configurable in a declarative manner only (XML, XSLT, SQL, datalog) and (ii) to make part of the warehouse schema dynamic so it can be easily updated. (i) and (ii) allow DaWeS managers to shift from development to administration when they want to connect to new web services or to update the APIs (Application programming interfaces) of already connected ones. The aim is to make DaWeS scalable and adaptable to smoothly face the ever-changing and growing web services offer. We point out the fact that this also enables DaWeS to be used with the vast majority of actual web service interfaces defined with basic technologies only (HTTP, REST, XML and JSON) and not with more advanced standards (WSDL, WADL, hRESTS or SAWSDL) since these more advanced standards are not widely used yet to describe real web services. In terms of applications, the aim is to allow a DaWeS administrator to provide to small and medium companies a service to store and query their business data coming from their usage of third-party services, without having to manage their own warehouse. In particular, DaWeS enables the easy design (as SQL Queries) of personalized performance indicators. We present in detail this mediation approach for ETL and the architecture of DaWeS. Besides its industrial purpose, working on building DaWeS brought forth further scientific challenges like the need for optimizing the number of web service API operation calls or handling incomplete information. We propose a bound on the number of calls to web services. This bound is a tool to compare future optimization techniques. We also present a heuristics to handle incomplete information.
|
228 |
Systems Integration Tool: uma ferramenta para integração e visualização de dados em larga escala e sua aplicação em cana-de-açúcar / Systems Integration Tool: an integration and visualization tool for big data and their application on sugarcanePiovezani, Amanda Rusiska 14 December 2017 (has links)
As respostas das plantas ao ambiente são orquestradas por fatores genéticos, bem como sua flexibilidade metabólica, uma vez que essas são sésseis. As respostas das plantas ao ambiente são regidas por fatores genéticos, bem como sua flexibilidade metabólica, uma vez que essas são sésseis. A forma com que os padrões gênicos e metabólicos redundam entre as células, refletem nos diferentes níveis organizacionais (célula, tecido, órgão e até o organismo como um todo). Por isso, para entendermos as respostas das plantas em determinados estágios de desenvolvimento ou condições é importante explorarmos ao máximo os diferentes níveis de regulação. Neste sentido, tem crescido a quantidade de dados biológicos obtidos através de métodos que produzem dados em larga escala, visando um estudo de forma sistêmica. Embora existam várias ferramentas para a integração de dados biológicos, elas estão desenvolvidas para organismos modelos, inviabilizando análises para outros, como a cana-de-açúcar, que possui vários dados biológicos disponíveis, mas com genoma complexo e incompleto. Tendo em vista a importância econômica da cana-de-açúcar e o interesse em entendermos o processo de degradação da parede celular, desenvolvemos a ferramenta SIT (Systems Integration Tool), para integração dos dados disponíveis (transcritoma, proteoma e atividade enzimática). A implementação da ferramenta foi realizada utilizando as linguagens de programação Perl e Java. SIT possui uma interface gráfica, podendo ser executada localmente, a qual possibilita a integração de até seis diferentes conjuntos de dados. A visualização do resultado é obtida na forma de redes complexas, permitindo ao usuário a visualização e edição dinâmica da integração. O uso da SIT permitiu no presente estudo, entre outros, a identificação de elementos chave na degradação da parede celular, presentes nos diferentes conjuntos de dados explorados, apontando portanto, potenciais alvos de estudos experimentais. SIT pode ser aplicada à diferentes conjuntos de dados, a qual poderá auxiliar em estudos futuros em várias áreas do conhecimento. / Plant are sessile organisms, and their responses to environmental stimuli are orchestrated by genetic factors, as well as by their metabolic flexibility. Inside the cell, there are genetic and metabolic patterns responsible for cell redundancy, and that reflects on different organizational levels (cell, tissue, organ, until a whole organism). Thus, to understand plant responses to certain conditions, it is important to understanding different regulatory levels. Recently, there was a large increase in availability of biological data. This happened due to the advance in next-generation sequencing techniques, which now enables more profound system biology studies. Despite the availability of several integration tools for analysis of biological data, these were developed for organism modeling. However, such tools are partially effective for sugarcane, for which there are large amounts of data, but has incomplete genome data. Due to the economic importance of sugarcane and aiming at understanding cell wall degradation process, we develop the software Systems Integration Tool (SIT). The tool integrates available data (transcriptomics, proteomics, and enzymatic activity). The implementation was performed in Perl and Java. SIT has a graphical interface, standalone execution, enabling integration until six layers of data. Integration results are generated as complex networks, allowing the users to visualize and dynamically edit the networks. The present study allowed the identification of key cell wall regulatory elements present on different data sets pointing out to potential targets for experimental validation. SIT can be applied to various data sets being capable of helping future studies in different areas of knowledge.
|
229 |
Integration framework for artifact-centric processes in the internet of things / Cadre d'intégration pour les processus centrés artéfacts dans l'Internet des objetsAbi Assaf, Maroun 09 July 2018 (has links)
La démocratisation des objets communicants fixes ou mobiles pose de nombreux défis concernant leur intégration dans des processus métiers afin de développer des services intelligents. Dans le contexte de l’Internet des objets, les objets connectés sont des entités hétérogènes et dynamiques qui englobent des fonctionnalités et propriétés cyber-physiques et interagissent via différents protocoles de communication. Pour pallier aux défis d’interopérabilité et d’intégration, il est primordial d’avoir une vue unifiée et logique des différents objets connectés afin de définir un ensemble de langages, outils et architectures permettant leur intégration et manipulation à grande échelle. L'artéfact métier a récemment émergé comme un modèle d’objet (métier) autonome qui encapsule ses données, un ensemble de services, et manipulant ses données ainsi qu'un cycle de vie à base d’états. Le cycle de vie désigne le comportement de l’objet et son évolution à travers ses différents états pour atteindre son objectif métier. La modélisation des objets connectés sous forme d’artéfact métier étendu nous permet de construire un paradigme intuitif pour exprimer facilement des processus d’intégration d’objets connectés dirigés par leurs données. Face aux changements contextuels et à la réutilisation des objets connectés dans différentes applications, les processus dirigés par les données, (appelés aussi « artifacts » au sens large) restent relativement invariants vu que leurs structures de données ne changent pas. Or, les processus centrés sur les services requièrent souvent des changements dans leurs flux d'exécution. Cette thèse propose un cadre d'intégration de processus centré sur les artifacts et leur application aux objets connectés. Pour cela, nous avons construit une vue logique unifiée et globale d’artéfact permettant de spécifier, définir et interroger un très grand nombre d'artifacts distribués, ayant des fonctionnalités similaires (maisons intelligentes ou voitures connectées, …). Le cadre d'intégration comprend une méthode de modélisation conceptuelle des processus centrés artifacts, des des algorithmes d'appariement inter-artifacts et une algèbre de définition et de manipulation d’artifacts. Le langage déclaratif, appelé AQL (Artifact Query Language) permet en particulier d’interroger des flux continus d’artifacts. Il s'appuie sur une syntaxe de type SQL pour réduire les efforts d'apprentissage. Nous avons également développé un prototype pour valider nos contributions et mener des expériences dans le contexte de l’Internet des objets. / The emergence of fixed or mobile communicating objects poses many challenges regarding their integration into business processes in order to develop smart services. In the context of the Internet of Things, connected devices are heterogeneous and dynamic entities that encompass cyber-physical features and properties and interact through different communication protocols. To overcome the challenges related to interoperability and integration, it is essential to build a unified and logical view of different connected devices in order to define a set of languages, tools and architectures allowing their integrations and manipulations at a large scale. Business artifact has recently emerged as an autonomous (business) object model that encapsulates attribute-value pairs, a set of services manipulating its attribute data, and a state-based lifecycle. The lifecycle represents the behavior of the object and its evolution through its different states in order to achieve its business objective. Modeling connected devices and smart objects as an extended business artifact allows us to build an intuitive paradigm to easily express integration data-driven processes of connected objects. In order to handle contextual changes and reusability of connected devices in different applications, data-driven processes (or artifact processes in the broad sense) remain relatively invariant as their data structures do not change. However, service-centric or activity-based processes often require changes in their execution flows. This thesis proposes a framework for integrating artifact-centric processes and their application to connected devices. To this end, we introduce a logical and unified view of a "global" artifact allowing the specification, definition and interrogation of a very large number of distributed artifacts, with similar functionalities (smart homes or connected cars, ...). The framework includes a conceptual modeling method for artifact-centric processes, inter-artifact mapping algorithms, and artifact definition and manipulation algebra. A declarative language, called AQL (Artifact Query Language) aims in particular to query continuous streams of artifacts. The AQL relies on a syntax similar to the SQL in relational databases in order to reduce its learning curve. We have also developed a prototype to validate our contributions and conducted experimentations in the context of the Internet of Things.
|
230 |
A Resource-Oriented Architecture for Integration and Exploitation of Linked Data / Conception d'une architecture orientée services pour l'intégration et l'exploitation de données liéesDe Vettor, Pierre 29 September 2016 (has links)
Cette thèse porte sur l'intégration de données brutes provenant de sources hétérogènes sur le Web. L'objectif global est de fournir une architecture générique et modulable capable de combiner, de façon sémantique et intelligente, ces données hétérogènes dans le but de les rendre réutilisables. Ce travail est motivé par un scenario réel de l'entreprise Audience Labs permettant une mise à l'échelle de cette architecture. Dans ce rapport, nous proposons de nouveaux modèles et techniques permettant d'adapter le processus de combinaison et d'intégration à la diversité des sources de données impliquées. Les problématiques sont une gestion transparente et dynamique des sources de données, passage à l'échelle et responsivité par rapport au nombre de sources, adaptabilité au caractéristiques de sources, et finalement, consistance des données produites(données cohérentes, sans erreurs ni doublons). Pour répondre à ces problématiques, nous proposons un méta-modèle pour représenter ces sources selon leurs caractéristiques, liées à l'accès (URI) ou à l'extraction (format) des données, mais aussi au capacités physiques des sources (latence, volume). En s'appuyant sur cette formalisation, nous proposent différentes stratégies d'accès aux données, afin d'adapter les traitements aux spécificités des sources. En se basant sur ces modèles et stratégies, nous proposons une architecture orientée ressource, ou tout les composants sont accessibles par HTTP via leurs URI. En se basant sur les caractéristiques des sources, sont générés des workflows d'exécution spécifiques et adapté, permettant d'orchestrer les différentes taches du processus d'intégration de façon optimale, en donnant différentes priorités à chacune des tâches. Ainsi, les temps de traitements sont diminuées, ainsi que les volumes des données échangées. Afin d'améliorer la qualité des données produites par notre approches, l'accent est mis sur l'incertitude qui peut apparaître dans les données sur le Web. Nous proposons un modèle, permettant de représenter cette incertitude, au travers du concept de ressource Web incertaines, basé sur un modèle probabiliste ou chaque ressource peut avoir plusieurs représentation possibles, avec une certaine probabilité. Cette approche sera à l'origine d'une nouvelle optimisation de l'architecture pour permettre de prendre en compte l'incertitude pendant la combinaison des données / In this thesis, we focus on data integration of raw data coming from heterogeneous and multi-origin data sources on the Web. The global objective is to provide a generic and adaptive architecture able to analyze and combine this heterogeneous, informal, and sometimes meaningless data into a coherent smart data set. We define smart data as significant, semantically explicit data, ready to be used to fulfill the stakeholders' objective. This work is motivated by a live scenario from the French {\em Audience Labs} company. In this report, we propose new models and techniques to adapt the combination and integration process to the diversity of data sources. We focus on transparency and dynamicity in data source management, scalability and responsivity according to the number of data sources, adaptability to data source characteristics, and finally consistency of produced data (coherent data, without errors and duplicates). In order to address these challenges, we first propose a meta-models in order to represent the variety of data source characteristics, related to access (URI, authentication) extraction (request format), or physical characteristics (volume, latency). By relying on this coherent formalization of data sources, we define different data access strategies in order to adapt access and processing to data source capabilities. With help form these models and strategies, we propose a distributed resource oriented software architecture, where each component is freely accessible through REST via its URI. The orchestration of the different tasks of the integration process can be done in an optimized way, regarding data source and data characteristics. This data allows us to generate an adapted workflow, where tasks are prioritized amongst other in order to fasten the process, and by limiting the quantity of data transfered. In order to improve the data quality of our approach, we then focus on the data uncertainty that could appear in a Web context, and propose a model to represent uncertainty in a Web context. We introduce the concept of Web resource, based on a probabilistic model where each resource can have different possible representations, each with a probability. This approach will be the basis of a new architecture optimization allowing to take uncertainty into account during our combination process
|
Page generated in 0.0289 seconds