Global ETD Search

171	Um método de integração de dados armazenados em bancos de dados relacionais e NOSQL / A method for integration data stored in databases relational and NOSQL Vilela, Flávio de Assis 08 October 2015 (has links) Submitted by Marlene Santos (marlene.bc.ufg@gmail.com) on 2016-08-05T19:33:36Z No. of bitstreams: 2 Dissertação - Flávio de Assis Vilela - 2015.pdf: 4909033 bytes, checksum: 3266fed0915712ec88adad7eec5bfc55 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2016-08-08T14:30:29Z (GMT) No. of bitstreams: 2 Dissertação - Flávio de Assis Vilela - 2015.pdf: 4909033 bytes, checksum: 3266fed0915712ec88adad7eec5bfc55 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2016-08-08T14:30:29Z (GMT). No. of bitstreams: 2 Dissertação - Flávio de Assis Vilela - 2015.pdf: 4909033 bytes, checksum: 3266fed0915712ec88adad7eec5bfc55 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2015-10-08 / The increase in quantity and variety of data available on the Web contributed to the emergence of NOSQL approach, aiming at new demands, such as availability, schema flexibility and scalability. At the same time, relational databases are widely used for storing and manipulating structured data, providing stability and integrity of data, which is accessed through a standard language such as SQL. This work presents a method for integrating data stored in heterogeneous sources, in which an input query in standard SQL produces a unified answer, based in the partial answers of relational and NOSQL databases. / O aumento da quantidade e variedade de dados disponíveis na Web contribuiu com o surgimento da abordagem NOSQL, visando atender novas demandas, como disponibilidade, flexibilidade de esquema e escalabilidade. Paralelamente, bancos de dados relacionais são largamente utilizados para armazenamento e manipulação de dados estruturados, oferecendo estabilidade e integridade de dados, que são acessados através de uma linguagem padrão, como SQL. Este trabalho apresenta um método de integração de dados armazenados em fontes heterogêneas, no qual uma consulta de entrada em SQL produz uma resposta unificada, baseada nas respostas parciais de bancos de dados relacionais e NOSQL. Palavras–chave Integração de dados Banco de dados relacional Banco de dados NOSQL Data integration Relational database NoSQL database
172	Casamento de esquemas de banco de dados aplicando aprendizado ativo Rodrigues, Diego de Azevedo 12 March 2013 (has links) Submitted by Geyciane Santos (geyciane_thamires@hotmail.com) on 2015-06-18T13:54:27Z No. of bitstreams: 1 Dissertação - Diego de Azevedo Rodrigues.pdf: 8601801 bytes, checksum: 6c2dde718a0b6857ac6e14fd715e240c (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-19T21:02:00Z (GMT) No. of bitstreams: 1 Dissertação - Diego de Azevedo Rodrigues.pdf: 8601801 bytes, checksum: 6c2dde718a0b6857ac6e14fd715e240c (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-19T21:03:00Z (GMT) No. of bitstreams: 1 Dissertação - Diego de Azevedo Rodrigues.pdf: 8601801 bytes, checksum: 6c2dde718a0b6857ac6e14fd715e240c (MD5) / Made available in DSpace on 2015-06-19T21:03:00Z (GMT). No. of bitstreams: 1 Dissertação - Diego de Azevedo Rodrigues.pdf: 8601801 bytes, checksum: 6c2dde718a0b6857ac6e14fd715e240c (MD5) Previous issue date: 2013-03-12 / FAPEAM - Fundação de Amparo à Pesquisa do Estado do Amazonas / Given two database schemas within the same domain, the schema matching problem is the task of finding pairs of schema elements that have the same semantics for that domain. Usually, this task was performed manually by a specialist making it tedious and costly because the specialist should know the schemas and their domain. Currently this process is assisted by semi-automatic schema matching methods. Current, methods use some heuristics to generate matchings and many of them share a common modeling: they build a similarity matrix between the elements from functions called matchers and, based on the matrix values, decide according to a criterion which of the matchings are correct. This thesis presents an active-learning based method that uses the similarity matrix generated by the matchers, a machine learning algorithm and specialist interventions to generate matchings. The presented method di↵ers from others because it has no fixed heuristic and uses the specialist expertise only when necessary. In our experiments, we evaluate the proposed method against a baseline on two datasets: the first one was the same used by the baseline and the second containing schemas of a benchmark for schema integration. We show that baseline achieves good results on its original dataset, but its fixed strategy is not as e↵ective for other schemas. Moreover, the proposed method based on active learning is shown more consistent achieving, on average, F-measure value of 0.64. / Dados dois esquemas de bancos de dados pertencentes ao mesmo domíınio, o problema de Casamento de Esquemas consiste em encontrar pares de elementos desses esquemas que possuam a mesma semântica para aquele domínio. Tradicionalmente, tal tarefa era realizada manualmente por um especialista, tornando-a custosa e cansativa pois, este deveria conhecer bem os esquemas e o domíınio em que estes estavam inseridos. Atualmente, esse processo é assistido por métodos semi-automáticos de casamento de esquemas. Os métodos atuais utilizam diversas heurísticas para gerar os casamentos e muitos deles compartilham uma modelagem em comum: constroem uma matriz de similaridade entre os elementos a partir de funções chamadas matchers e, baseados nos valores dessa matriz, decidem segundo algum critério quais os casamentos válidos. Esta dissertação apresenta um método baseado em aprendizado ativo que utiliza a matriz de similaridade gerada pelos matchers e um algoritmo de aprendizagem de máquina, além de intervenções de um especialista, para gerar os casamentos. O método apresentado se diferencia dos outros por não possuir uma heurística fixa e por utilizar a experiência do especialista apenas quando necessário. Em nossos experimentos, avaliamos o método proposto contra um baseline em dois datasets: o primeiro que foi o mesmo utilizado pelo baseline e o segundo contendo esquemas propostos em um benchmark para integração de esquemas. Mostramos que o baseline alcança bons resultados no dataset em que foi originalmente testado, mas que sua estratégia fixa não é tão efetiva para outros esquemas. Por outro lado, o método baseado em aprendizado ativo que propomos se mostra consistente em ambos os datasets, alcançando, em média, um valor de medida-F igual a 0, 64. Casamento de esquemas Integração de dados Aprendizado ativo Schema matching Data integration Active learning
173	Integração de dados na inferência de redes de genes: avaliação de informações biológicas e características topológicas / Data integration in gene networks inference: evaluation of biological and topological features Fabio Fernandes da Rocha Vicente 02 May 2016 (has links) Os componentes celulares não atuam sozinhos, mas sim em uma rede de interações. Neste sentido, é fundamental descobrir como os genes se relacionam e compreender a dinâmica do sistema biológico. Este conhecimento pode contribuir para o tratamento de doenças, para o melhoramento genético de plantas e aumento de produção agrícola, por exemplo. Muitas redes gênicas são desconhecidas ou apenas conhecidas parcialmente. Neste contexto, a inferência de Redes Gênicas surgiu como possível solução e tem por objetivo recuperar a rede a partir de dados de expressão gênica utilizando modelos probabilísticos. No entanto, um problema intrínseco da inferência de redes é formalmente descrito como maldição da dimensionalidade (a quantidade de variáveis é muito maior que a quantidade de amostras). No contexto biológico, este problema é ainda agravado pois é necessário lidar com milhares de genes e apenas um ou duas dezenas de amostras de dados de expressão. Assim, os modelos de inferência buscam contornar este problema propondo soluções que minimizem o erro de estimação. Nos modelos de predição ainda há muitos empates, isto é, apenas os dados de expressão não são suficientes para decidir pela interação correta entre os genes. Neste contexto, a proposta de integração de outros dados biológicos além do dado de expressão gênica surge como possível solução. No entanto, estes dados são heterogêneos: referem-se a interações físicas, relacionamentos funcionais, localização, dentre outros. Além disto são representados de diferentes formas: como dado quantitativo, qualitativo, como atributos nominais ou atributos ordinais. Algumas vezes organizados em estrutura hierárquica, em outras como um grafo e ainda como anotação descritiva. Além disto, não está claro como cada tipo de dado pode contribuir com a inferência e redução do erro dos modelos. Portanto, é fundamental buscar compreender a relação entre os dados biológicos disponíveis, bem como investigar como integrá-los na inferência. Assim, neste trabalho desenvolveu-se três metodologias de integração de dados e a contribuição de cada tipo foi analisada. Os resultados mostraram que o uso conjunto de dados de expressão e outros dados biológicos melhora a predição das redes. Também apontaram para diferença no potencial de redução do erro de acordo com o tipo de dado. Além disto, os resultados mostraram que o conhecimento da topologia da rede também reduz o erro além de inferir redes topologicamente coerentes com a topologia esperada / It is widely known that the cellular components do not act in isolation but through a network of interactions. In this sense, it is essential to discover how genes interact with each other and to understand the dynamics of the biological system. This knowledge can contribute for the treatment of diseases, contribute for plant breeding and increased agricultural production. In this context, the inference of Gene Networks (GNs) has emerged as a possible solution, studying how to recover the network from gene expression data through probabilistic models. However, a known problem of network inference is formally described as curse of dimensionality (the number of variables is much larger than the number of samples). In biological problems, it is even worse since there is only few samples and thousands of genes. However, there are still many ties found in the prediction models, that is, only the expression data are frequently not enough to decide the correct interaction between genes. In this context, data integration is proposed as a possible solution. However, the data are heterogeneous, refer to physical interactions and functional location. They are represented in different ways as quantitative or qualitative information, being nominal or ordinal attributes. Sometimes organized in hierarchical structure or as a graph. In addition, it is unclear how each type of data can contribute to the inference and reduction of the error. Therefore, it is very important to understand the relationship between the biological information available. Also, it is important to investigate how to integrate them in the inference algorithm. Thus, this work has developed three data integration methodologies and also, the contribution of biological information was analyzed. The results showed that the combined use of expression data and biological information improves the inference. Moreover, the results shows distinct behaviour of distinct data in error reduction. Also, experiments that include topological features into the models, shows that the knowledge of the network topology can increase the corrctness of the inferred newtorks Bioinformática Integração de dados Reconhecimento de padrões Redes complexas Redes de genes Bioinformatics Complex networks Data integration Gene networks Pattern recognition
174	Modelo navegacional dinâmico, para implementação da integração inter-estrutural de dados. / Dynamic navigational model for implementation of the data inter-structural integration. José Gomes Neto 04 November 2016 (has links) Na última década, observaram-se substanciais mudanças nos tipos de dados processados, quando comparados à definição convencional de dados estruturados. Neste contexto, sistemas computacionais que em sua maioria acessam bases de dados convencionais, centralizadas, que armazenam dados estruturados, necessitam cada vez mais acessarem e processarem também dados não estruturados, distribuídos e em grandes quantidades. Fatores tais como versatilidade em abrigar dados não estruturados, coexistência, integração e difusão de dados complexos a velocidades superiores as velocidades até então observadas, restringem, em determinadas situações, o uso dos modelos de dados convencionais. Dessa forma, nesta Tese é proposto e formalizado um modelo de dados pós relacional, baseado nos conceitos de grafos complexos, também denominados, Redes Complexas. Por intermédio da utilização do modelo de grafos, define-se uma forma de se implementar uma integração inter-estrutural de dados, ou seja, os tradicionais dados estruturados, com os mais recentemente utilizados dados não estruturados, tais como os dados multimídia. Tal integração envolve todas as transações presentes em um banco de dados, ou seja, consulta, inserção, atualização e exclusão de dados. A denominação dada a tal forma de trabalho e implementação foi Modelo Navegacional Dinâmico - MND. Esse modelo representa diferentes estruturas de dados e sobretudo, permite que essas diferentes estruturas coexistam de forma integrada, agregando à informação resultante maior completeza e abrangência. Portanto, o MND associa os benefícios da junção da estrutura das Redes Complexas ao contexto de dados não estruturados, sobretudo no que tange à integração resultante de dados com estruturas distintas, conferindo assim às aplicações que necessitam desta integração, melhoria no aproveitamento dos recursos. / Over the last decade several changes in data processing have been observed when compared to the conventional structured data definition. In such context, computational systems accessing centralized databases need to process large, distributed, non-structured data as well. Factors like versatility in hosting data, coexistence, integration and diffusion of such complex data at high speeds can be, in some cases, troublesome when using conventional data models. In this work a post-relational, graph-based (also known as Complex Network) model, is presented. Such model enables the integration of both structured data and non-structured data, such as multimedia, allowing such structures to coexist. This integration involves all transactions found in a database, such as select, insert, delete and update data. The name given to this form of work and implementation was Navigational Model Dynamic - MND. This model represents different data structures and above all, allows these different structures to coexist in an integrated way, adding to the resulting information greater completeness and comprehensiveness. Hence, MND harnesses the benefits of Complex Network and non-structured data providing all relational data handling already available in other databases but also integration and better use of resources. Banco de dados Dados não estruturados Integração de dados Modelagem de dados Redes complexas Complex networks Data integration Data models Unstructured data
175	Integração de dados na inferência de redes de genes: avaliação de informações biológicas e características topológicas / Data integration in gene networks inference: evaluation of biological and topological features Vicente, Fabio Fernandes da Rocha 02 May 2016 (has links) Os componentes celulares não atuam sozinhos, mas sim em uma rede de interações. Neste sentido, é fundamental descobrir como os genes se relacionam e compreender a dinâmica do sistema biológico. Este conhecimento pode contribuir para o tratamento de doenças, para o melhoramento genético de plantas e aumento de produção agrícola, por exemplo. Muitas redes gênicas são desconhecidas ou apenas conhecidas parcialmente. Neste contexto, a inferência de Redes Gênicas surgiu como possível solução e tem por objetivo recuperar a rede a partir de dados de expressão gênica utilizando modelos probabilísticos. No entanto, um problema intrínseco da inferência de redes é formalmente descrito como maldição da dimensionalidade (a quantidade de variáveis é muito maior que a quantidade de amostras). No contexto biológico, este problema é ainda agravado pois é necessário lidar com milhares de genes e apenas um ou duas dezenas de amostras de dados de expressão. Assim, os modelos de inferência buscam contornar este problema propondo soluções que minimizem o erro de estimação. Nos modelos de predição ainda há muitos empates, isto é, apenas os dados de expressão não são suficientes para decidir pela interação correta entre os genes. Neste contexto, a proposta de integração de outros dados biológicos além do dado de expressão gênica surge como possível solução. No entanto, estes dados são heterogêneos: referem-se a interações físicas, relacionamentos funcionais, localização, dentre outros. Além disto são representados de diferentes formas: como dado quantitativo, qualitativo, como atributos nominais ou atributos ordinais. Algumas vezes organizados em estrutura hierárquica, em outras como um grafo e ainda como anotação descritiva. Além disto, não está claro como cada tipo de dado pode contribuir com a inferência e redução do erro dos modelos. Portanto, é fundamental buscar compreender a relação entre os dados biológicos disponíveis, bem como investigar como integrá-los na inferência. Assim, neste trabalho desenvolveu-se três metodologias de integração de dados e a contribuição de cada tipo foi analisada. Os resultados mostraram que o uso conjunto de dados de expressão e outros dados biológicos melhora a predição das redes. Também apontaram para diferença no potencial de redução do erro de acordo com o tipo de dado. Além disto, os resultados mostraram que o conhecimento da topologia da rede também reduz o erro além de inferir redes topologicamente coerentes com a topologia esperada / It is widely known that the cellular components do not act in isolation but through a network of interactions. In this sense, it is essential to discover how genes interact with each other and to understand the dynamics of the biological system. This knowledge can contribute for the treatment of diseases, contribute for plant breeding and increased agricultural production. In this context, the inference of Gene Networks (GNs) has emerged as a possible solution, studying how to recover the network from gene expression data through probabilistic models. However, a known problem of network inference is formally described as curse of dimensionality (the number of variables is much larger than the number of samples). In biological problems, it is even worse since there is only few samples and thousands of genes. However, there are still many ties found in the prediction models, that is, only the expression data are frequently not enough to decide the correct interaction between genes. In this context, data integration is proposed as a possible solution. However, the data are heterogeneous, refer to physical interactions and functional location. They are represented in different ways as quantitative or qualitative information, being nominal or ordinal attributes. Sometimes organized in hierarchical structure or as a graph. In addition, it is unclear how each type of data can contribute to the inference and reduction of the error. Therefore, it is very important to understand the relationship between the biological information available. Also, it is important to investigate how to integrate them in the inference algorithm. Thus, this work has developed three data integration methodologies and also, the contribution of biological information was analyzed. The results showed that the combined use of expression data and biological information improves the inference. Moreover, the results shows distinct behaviour of distinct data in error reduction. Also, experiments that include topological features into the models, shows that the knowledge of the network topology can increase the corrctness of the inferred newtorks Bioinformática Bioinformatics Complex networks Data integration Gene networks Integração de dados Pattern recognition Reconhecimento de padrões Redes complexas Redes de genes
176	Test Data Extraction and Comparison with Test Data Generation Raza, Ali 01 August 2011 (has links) Testing an integrated information system that relies on data from multiple sources can be a challenge, particularly when the data is confidential. This thesis describes a novel test data extraction approach, called semantic-based test data extraction for integrated systems (iSTDE) that solves many of the problems associated with creating realistic test data for integrated information systems containing confidential data. iSTDE reads a consistent cross-section of data from the production databases, manipulates that data to obscure individual identities while still preserving overall semantic data characteristics that are critical to thorough system testing, and then moves that test data to an external test environment. This thesis also presents a theoretical study that compares test-data extraction with a competing technique, named test-data generation. Specifically, this thesis a) describes a comparison method that includes a comprehensive list of characteristics essential for testing the database applications organized into seven different areas, b) presents an analysis of the relative strengths and weaknesses of the different test-data creation techniques, and c) reports a number of specific conclusions that will help testers make appropriate choices. Data Integration Data Sensitization/Anonymization Health Informatics Software Engineering Test Data Extraction Testing Data-Centric Applications Computer Sciences
177	A Web-Based Approach to the Integration of Diverse Data Sources for GIS Shea, Geoffrey Yu Kai, Surveying & Spatial Information Systems, Faculty of Engineering, UNSW January 2001 (has links) The rigorous developments of GIS over the past decades have enabled application developers to create powerful systems that are used to facilitate the management of spatial data. Unfortunately, each one of these systems is specific to a local service, with little or no interconnection with services in other locales. This makes it virtually impossible to perform dynamic and interactive GIS operations across multiple locales which have similar or dissimilar system configurations. The Spatial Data Transfer Standard (SDTS) resolved the problems partially by offering excellent conceptual and logical abstraction model for data exchange. Recent advancements of the Internet enlightened the GIS community as to the realization of an ideal concept of information interchange. A suite of new technologies that embraces Extensible Markup Language (XML), Scalable Vector Graphics (SVG), Portable Network Graphics (PNG) and Java creates a powerful and new perspective that can be applied to all phases of online GIS system development. The online GIS is a Web-based approach to integrating diverse spatial data sources for GIS applications. To address the spatial data integration options and implications related to the Web-based approach the investigation was undertaken in 5 phases: (1) Determine the mapping requirements of graphic and non-graphic spatial data for online GIS application; (2) Analyze the requirements of spatial data integration for online environments; (3) Investigate a suitable method for integrating different formats of spatial data; (4) Study the feasibility and applicability of setting up the online GIS; and (5) Develop a prototype for online sharing of teaching resources. Resulting from the critical review on current Internet technology, a conceptual framework for spatial data integration was proposed. This framework was based on the emerging Internet technology on XML, SVG, PNG, and Java. It was comprised of four loosely coupled modules, namely, Application Interface, Presentation, Integrator, and Data module. This loosely coupled framework provides an environment that will be independent of the underlying GIS data structure and makes it easy to change or update the system as a new task or knowledge is acquired. A feasibility study was conducted to test the applicability for the proposed conceptual framework. A detailed user requirements and system specification was thus devised from the feasibility study. These user requirements and system specification provided some guidelines for online GIS application development. They were expressed specifically in terms of six aspects: (1) User; (2) Teaching resources management; (3) Data; (4) Cartography; (5) Functions; and (6) Software development configuration. A prototype system based on some of the devised system specifications was developed. In the prototype software design, the architecture of a Three-Tier Client-Server computing model was adopted. Due to the inadequacy of native support for SVG and PNG in all currently available Web browsers, the prototype was thus implemented in HTML, Java and vendor specific vector format. The prototype demonstrated how teaching resources from a variety of sources and format (including map data and non-map resources) were integrated and shared. The implementation of the prototype revealed that the Web is still an ideal medium for providing wider accessibility of geographical information to a larger number of users through a corporate intranet or the Internet cost-effectively. The investigation concluded that current WWW technology is limited in its capability for spatial data integration and delivering online functionality. However, developing of XML-based GIS data model and graphic standards SVG and PNG for structuring and transferring spatial data on the Internet appear to be providing solutions to the current limitations. It is believed that the ideal world where everyone retrieving spatial information contextually through a Web browser disregarding the information format and location will eventually become true. data integration Web mapping online GIS Internet mapping geographic information systems cartography data processing
178	Capability-based Description and Discovery of Services Devereux, Drew Unknown Date (has links) Whenever autonomous entities work together to meet each other's needs, there arises the problem of how an entity with a need can find and use entities with the capability to meet that need. This problem is seen in Web service architectures, agent systems, and data integration systems, among others. Solutions have been proposed in each of these fields, but they are all dependent on implementation and interface. Hence all are restricted to their particular field, and all require their participants to conform to certain assumptions about implementation and interface. This failure of support for service autonomy is conceptually unattractive and impractical. In this thesis we show how to describe and matchmake service capabilities and client needs in a way that is implementation and interface independent. The result is a service discovery solution that fully supports the rights of services to choose their own implementation and interface. Our representation is capable of capturing capabilities across a range of service types, from Web services to agents to data sources, while ignoring the implementation and interface details that distinguish them. Thus, our solution unifies these fields for description and discovery purposes, allowing data sources with complex language interfaces to compete against form-based Web services and frame-and-slot agents, for example. Moreover, our solution captures all of the most important aspects of capability, such as: the conceptual meaning and limitations on what a service can achieve; what requests can be expressed through a service's interface, and limitations on what attributes of information a service can return. The provision of an interface independent capability description raises the additional question of how to enable a client to invoke the service to which it has been matched, and correctly interpret the results returned; we solve this by providing an interface description that maps from client objectives onto invocations, and from returned results onto a canonical result format. 280107 Global Information Systems semantic Web capability data integration Web services agents
179	Information Integration in a Grid Environment Applications in the Bioinformatics Domain Radwan, Ahmed M. 16 December 2010 (has links) Grid computing emerged as a framework for supporting complex operations over large datasets; it enables the harnessing of large numbers of processors working in parallel to solve computing problems that typically spread across various domains. We focus on the problems of data management in a grid/cloud environment. The broader context of designing a services oriented architecture (SOA) for information integration is studied, identifying the main components for realizing this architecture. The BioFederator is a web services-based data federation architecture for bioinformatics applications. Based on collaborations with bioinformatics researchers, several domain-specific data federation challenges and needs are identified. The BioFederator addresses such challenges and provides an architecture that incorporates a series of utility services; these address issues like automatic workflow composition, domain semantics, and the distributed nature of the data. The design also incorporates a series of data-oriented services that facilitate the actual integration of data. Schema integration is a core problem in the BioFederator context. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. We propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a ranking mechanism for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice. The proposed methods and algorithms are compared to the state of the art approaches. The BioFederator design, services, and usage scenarios are discussed. We demonstrate how our architecture can be leveraged on real world bioinformatics applications. We preformed a whole human genome annotation for nucleosome exclusion regions. The resulting annotations were studied and correlated with tissue specificity, gene density and other important gene regulation features. We also study data processing models on grid environments. MapReduce is one popular parallel programming model that is proven to scale. However, using the low-level MapReduce for general data processing tasks poses the problem of developing, maintaining and reusing custom low-level user code. Several frameworks have emerged to address this problem; these frameworks share a top-down approach, where a high-level language is used to describe the problem semantics, and the framework takes care of translating this problem description into the MapReduce constructs. We highlight several issues in the existing approaches and alternatively propose a novel refined MapReduce model that addresses the maintainability and reusability issues, without sacrificing the low-level controllability offered by directly writing MapReduce code. We present MapReduce-LEGOS (MR-LEGOS), an explicit model for composing MapReduce constructs from simpler components, namely, "Maplets", "Reducelets" and optionally "Combinelets". Maplets and Reducelets are standard MapReduce constructs that can be composed to define aggregated constructs describing the problem semantics. This composition can be viewed as defining a micro-workflow inside the MapReduce job. Using the proposed model, complex problem semantics can be defined in the encompassing micro-workflow provided by MR-LEGOS while keeping the building blocks simple. We discuss the design details, its main features and usage scenarios. Through experimental evaluation, we show that the proposed design is highly scalable and has good performance in practice. Data Federation Data Integration Schema Integration Bioinformatics Grid Computing Cloud Computing Mapreduce Hadoop Data Management Extract Transform Load
180	Novel Bioinformatics Applications for Protein Allergology, Genome-Wide Association and Retrovirology Studies Martínez Barrio, Álvaro January 2010 (has links) Recently, the pace of growth in the amount of data sources within Life Sciences has increased exponentially until pose a difficult problem to efficiently manage their integration. The data avalanche we are experiencing may be significant for a turning point in science, with a change of orientation from proprietary to publicly available data and a concomitant acceptance of studies based on the latter. To investigate these issues, a Network of Excellence (EMBRACE) was launched with the aim to integrate the major databases and the most popular bioinformatics software tools. The focus of this thesis is therefore to approach the problem of seamlessly integrating varied data sources and/or distributed research tools. In paper I, we have developed a web service to facilitate allergenicity risk assessment, based on allergen descriptors, in order to characterize proteins with the potential for sensitization and cross-reactivity. In paper II, a web service was developed which uses a lightweight protocol to integrate human endogenous retrovirus (ERV) data within a public genome browser. This new data catalogue and many other publicly available sources were integrated and tested in a bioinformatics-rich client application. In paper III, GeneFinder, a distributed tool for genome-wide association studies, was developed and tested. Useful information based on a particular genomic region can be easily retrieved and assessed. Finally, in paper IV, we developed a prototype pipeline to mine the dog genome for endogenous retroviruses and displaying the transcriptional landscape of these retroviral integrations. Moreover, we further characterized a group that until this point was believed to be primate-specific. Our results also revealed that the dog has been very effective in protecting itself from such integrations. This work integrates different applications in the fields of protein allergology, biotechnology, genome association studies and endogenous retroviruses. / EMBRACE NoE EU FP6 data integration web services protein allergology risk assessment cross reactivity endogenous retroviruses ERV dog canine GWAS genome-wide association studies

Search results