Global ETD Search

121	Information Integration in a Grid Environment Applications in the Bioinformatics Domain Radwan, Ahmed M. 16 December 2010 (has links) Grid computing emerged as a framework for supporting complex operations over large datasets; it enables the harnessing of large numbers of processors working in parallel to solve computing problems that typically spread across various domains. We focus on the problems of data management in a grid/cloud environment. The broader context of designing a services oriented architecture (SOA) for information integration is studied, identifying the main components for realizing this architecture. The BioFederator is a web services-based data federation architecture for bioinformatics applications. Based on collaborations with bioinformatics researchers, several domain-specific data federation challenges and needs are identified. The BioFederator addresses such challenges and provides an architecture that incorporates a series of utility services; these address issues like automatic workflow composition, domain semantics, and the distributed nature of the data. The design also incorporates a series of data-oriented services that facilitate the actual integration of data. Schema integration is a core problem in the BioFederator context. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. We propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a ranking mechanism for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice. The proposed methods and algorithms are compared to the state of the art approaches. The BioFederator design, services, and usage scenarios are discussed. We demonstrate how our architecture can be leveraged on real world bioinformatics applications. We preformed a whole human genome annotation for nucleosome exclusion regions. The resulting annotations were studied and correlated with tissue specificity, gene density and other important gene regulation features. We also study data processing models on grid environments. MapReduce is one popular parallel programming model that is proven to scale. However, using the low-level MapReduce for general data processing tasks poses the problem of developing, maintaining and reusing custom low-level user code. Several frameworks have emerged to address this problem; these frameworks share a top-down approach, where a high-level language is used to describe the problem semantics, and the framework takes care of translating this problem description into the MapReduce constructs. We highlight several issues in the existing approaches and alternatively propose a novel refined MapReduce model that addresses the maintainability and reusability issues, without sacrificing the low-level controllability offered by directly writing MapReduce code. We present MapReduce-LEGOS (MR-LEGOS), an explicit model for composing MapReduce constructs from simpler components, namely, "Maplets", "Reducelets" and optionally "Combinelets". Maplets and Reducelets are standard MapReduce constructs that can be composed to define aggregated constructs describing the problem semantics. This composition can be viewed as defining a micro-workflow inside the MapReduce job. Using the proposed model, complex problem semantics can be defined in the encompassing micro-workflow provided by MR-LEGOS while keeping the building blocks simple. We discuss the design details, its main features and usage scenarios. Through experimental evaluation, we show that the proposed design is highly scalable and has good performance in practice. Data Federation Data Integration Schema Integration Bioinformatics Grid Computing Cloud Computing Mapreduce Hadoop Data Management Extract Transform Load
122	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável Castro, Marcelo Rodrigo de 13 February 2017 (has links) Submitted by Aelson Maciera (aelsoncm@terra.com.br) on 2017-09-06T18:32:40Z No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:27Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:34Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Made available in DSpace on 2017-09-25T17:05:03Z (GMT). No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) Previous issue date: 2017-02-13 / Outra / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ) / With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST. / Com a redução dos custos e evolução dos mecanismos que efetuam o sequenciamento genômico, tem havido um grande aumento na quantidade de dados referentes aos estudos da genomica. O crescimento desses dados tem ocorrido a taxas mais elevadas do que a industria tem conseguido aumentar o poder dos computadores a cada ano. Para melhor atender a necessidade de processamento e analise de dados em bioinformatica faz-se o uso de sistemas paralelos e distribuídos, como por exemplo: Clusters, Grids e Nuvens Computacionais. Contudo, muitas ferramentas, como o BLAST, que fazem o alinhamento entre sequencias e banco de dados, nao foram desenvolvidas para serem processadas de forma distribuída e escalavel. Os atuais frameworks Apache Hadoop e Apache Spark permitem a execucao de aplicacoes de forma distribuída e paralela, desde que as aplicacoes possam ser devidamente adaptadas e paralelizadas. Estudos que permitam melhorar desempenho de aplicacoes em bioinformatica tem se tornado um esforço contínuo. O Spark tem se mostrado uma ferramenta robusta para processamento massivo de dados. Nesta pesquisa de mestrado a ferramenta Apache Spark foi utilizada para dar suporte ao paralelismo da ferramenta BLAST (Basic Local Alingment Search Tool). Experimentos realizados na nuvem Google Cloud e Microsoft Azure demonstram desempenho (speedup) obtido foi similar ou melhor que trabalhos semelhantes ja desenvolvidos em Hadoop. BLAST Apache Spark Nuvens computacionais Sequenciamento genético Cloud computing Genetic sequencing Hadoop
123	Scheduling workflows to optimize for execution time Peters, Mathias January 2018 (has links) Many functions in today’s society are immensely dependent on data. Data drives everything from business decisions to self-driving cars to intelligent home assistants like Amazon Echo and Google Home. To make good decisions based on data, of which exabytes are generated every day, somehow that data has to be processed. Data processing can be complex and time-consuming. One way of reducing the complexity is to create workflows that consist of several steps that together produce the right result. Klarna is an example of a company that relies on workflows for transforming and analyzing data. As a company whose core business involves analyzing customer data, being able to do those analyses faster will lead to direct business value in the form of more well-informed decisions. The workflows Klarna use are currently all written in a sequential form. However, workflows, where independent tasks are executed in parallel, are more performant than workflows where only one task is executed at any point in time. Due to limitations in human attention span, parallelized workflows are harder for humans to write, compared to sequential workflows. In this work, a computer application was created that automates the parallelization of a workflow to let humans write sequential workflows while still getting the performance of parallelized workflows. The application does this by taking a simple sequential workflow, identifies dependencies in the workflow and then schedules it in a way that is as parallel as possible given the identified dependencies. Such a solution has not been created before. However, experimental evaluation shows that parallelization of a sequential workflow used in daily production at Klarna can reduce execution time by up to 80%, showing that the application can bring value to Klarna and other organizations that use workflows to analyze big data. Hadoop Hive big data workflows scheduling parallelization automatic dependency identification dependency graph SQL HiveQL Information Systems
124	Melhorando o processamento de dados com Hadoop na nuvem através do uso transparente de instancias oportunistas com qualidade de serviço. NÓBREGA, Telles Mota Vidal. 09 May 2018 (has links) Submitted by Emanuel Varela Cardoso (emanuel.varela@ufcg.edu.br) on 2018-05-09T20:35:48Z No. of bitstreams: 1 TELLES MOTA VIDAL NÓBREGA – DISSERTAÇÃO (PPGEEI) 2016.pdf: 2737490 bytes, checksum: 8a7c4bfc097eaba99e275e3406c16149 (MD5) / Made available in DSpace on 2018-05-09T20:35:48Z (GMT). No. of bitstreams: 1 TELLES MOTA VIDAL NÓBREGA – DISSERTAÇÃO (PPGEEI) 2016.pdf: 2737490 bytes, checksum: 8a7c4bfc097eaba99e275e3406c16149 (MD5) Previous issue date: 2016-08-12 / Nuvens computacionais oferecem para usuários a facilidade de aquisição de recursos por meio da internet de forma rápida, barata e segura. Entretanto, grande parte das nuvens se mantém ociosa devido à reserva de recursos. Visando a aumentar a utilização da nuvem, provedores de nuvem criaram um modelo de instâncias que reusam recursos ociosos, conhecidas como instâncias oportunistas. Essas instâncias são mais baratas que as instâncias de recursos dedicados, porém voláteis, podendo ser preemptadas do usuário a qualquer momento, o que as torna inadequadas para alguns tipos de aplicação. Processamento de dados, seguindo a tendência de outras aplicações, tem sido migrado para nuvem e pode ser beneficiado por instâncias oportunistas, devido à sua natureza tolerante à falha, resultando na criação de clusters a um custo menor comparado à instâncias com recursos dedicados.Este trabalho propõe a utilização dos recursos ociosos para a criação de um outro modelo de instâncias oportunistas. Esse modelo visa a criação de instâncias oportunistas com qualidade de serviço, que são instâncias criadas baseadas em uma predição do estado da nuvem. A predição é realizada a partir de dados históricos de utilização de recursos como CPU e memória RAM e assim diminuindo o risco de perder instâncias antes do fim do processamento. Ainda com a existência do preditor, o risco de perda de uma máquina existe e para esse caso propomos a utilização de migração viva, movendo a máquina virtual de servidor, evitando assim a destruição da mesma. Com nossa abordagem, utilizando apenas duas instâncias oportunistas durante os experimentos, obtivemos uma diminuição no tempo de processamento de dados de 10% em um cluster com 2 workers e 1 master. Além disso, ao utilizar a migração, temos uma melhora de aproximadamente 70% no tempo de processamento em comparação com os casos onde uma instância é perdida. / Cloud computing offers the users the ease of resources acquisition through the Internet in a fast, cheap and safe manner. However, these clouds have a lot of idle resources due to resource reservation. Aiming to increase resources usage, cloud providers have created an instance model that uses these idle resources, known as opportunistic instances. These instances are cheaper than the dedicated resources instances, but are volatile and can be destroyed at any time, which makes them unsuitable for some types of application. Data processing, following the trend of other applications, have been migrated to the cloud and can be benefited by the use opportunistic instances, due to its fault tolerant nature, resulting in the creation of clusters at a lower cost compared to instances with dedicated resources. In this work, we propose the use of idle resources to create another model of opportunistic instances. This model aims to create opportunistic instances with quality of service, which are created instances based on a prediction of the state of the cloud. The prediction is made from historical data of resource usage such as CPU and RAM, thus reducing the risk of losing instances before the end of the processing. Even with the existence of a predictor, the risk of losing a machine still exists, and for this case we propose the use of live migration, moving the virtual machine to a different server, thus avoiding the its destruction. With our approach, using only two opportunistic instances during the experiments, we found a decrease in 10% in the data processing time in a cluster with 2 workers and 1 master. Furthermore, when using the migration, we have an improvement of approximately 70% in processing time compared with the case where one instance is lost. Ciência da computação Ciências Computação na nuvem Processamento de dados Hadoop Instâncias oportunistas Cloud computing Data processing Opportunistic instances
125	Integrating Heterogeneous Data Nieva, Gabriel January 2016 (has links) Technological advances, particularly in the areas of processing and storage have made it possible to gather an unprecedented vast and heterogeneous amount of data. The evolution of the internet, particularly Social media, the internet of things, and mobile technology together with new business trends has precipitated us in the age of Big data and add complexity to the integration task. The objective of this study has been to explore the question of data heterogeneity trough the deployment of a systematic literature review methodology. The study surveys the drivers of this data heterogeneity, the inner workings of it, and it explores the interrelated fields and technologies that deal with the capture, organization and mining of this data and their limitations. Developments such as Hadoop and its suit components together with new computing paradigms such as cloud computing and virtualization help palliate the unprecedented amount of rapidly changing, heterogeneous data which we see today. Despite these dramatic developments, the study shows that there are gaps which need to be filled in order to tackle the challenges of Web 3.0. Data Heterogeneity Data Integration Warehousing ETL Hadoop Cloud Computing Semantic Web BI&A Big Data Computer Sciences Datavetenskap (datalogi)
126	Uma arquitetura para internet das coisas para análise da concentração de monóxido de carbono na grande São Paulo por meio de técnicas de Big Data Borges, Marco Aurelio 18 August 2017 (has links) Submitted by Marta Toyoda (1144061@mackenzie.br) on 2018-02-08T23:22:57Z No. of bitstreams: 2 MARCO AURÉLIO BORGES.pdf: 7602861 bytes, checksum: 46e1f47532a94db9ed9336a74a78ff62 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Paola Damato (repositorio@mackenzie.br) on 2018-03-08T11:14:27Z (GMT) No. of bitstreams: 2 MARCO AURÉLIO BORGES.pdf: 7602861 bytes, checksum: 46e1f47532a94db9ed9336a74a78ff62 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-03-08T11:14:27Z (GMT). No. of bitstreams: 2 MARCO AURÉLIO BORGES.pdf: 7602861 bytes, checksum: 46e1f47532a94db9ed9336a74a78ff62 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2017-08-18 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Fundo Mackenzie de Pesquisa / The use of sensors for the monitoring of a given environment allied to the Internet as a means of communication is popularly known as Internet of Things (IoT). The amount of information generated in this environment has led to an unprecedented increase in data collection. One of the major challenges for its development lies in the storage and the processing of this huge volume of data into acceptable measurement and analysis parameters. This research takes up this challenge by storing and compiling data from di erent sensors, and by carrying out an exploratory analysis of the information gathered. In this research, sensors that collect data from a speci c Sao Paulo's Metropolitan Area (SMA) have been analysed. These sensors are capable of measuring carbon monoxide (CO) levels. This research aims to analyse some architectures for both batch and stream sensor processing and to use one of them for the construction of a Big Data environment. Big Data tools were used for IoT storage, processing and visualization data. During the experiments, carbon monoxide sensors (MQ7), were analysed. They were connected through a microcontroller unit that supports the Transmission Control Protocol/Internet Protocol (TCP/IP). This project highlights the necessary tools to execute and analyse the data in a dynamic manner. The data collected by the sensors show that the avarage levels of carbon monoxide are well above the international standards set by the World Health Organization (WHO). / A utilização de sensores para o monitoramento de um determinado ambiente aliada ao uso da internet como meio de comunicação é popularmente chamado de Internet das Coisas (do inglês, Internet of Things (IoT )). A quantidade de informação que se gera neste ambiente IoT tem fomentado um aumento no acúmulo de dados nunca antes imaginado. Um dos importantes desafios para o seu desenvolvimento é armazenar e processar esse grande volume de dados em aceitáveis parâmetros de medição e análise. Esta pesquisa direciona esse desafio, a partir do armazenamento e compilação de dados oriundos de diversos sensores até a análise exploratória das informações obtidas. Na pesquisa foram analisados sensores de captação de dados na Região Metropolitana de São Paulo (RMSP), com sensores capazes de medir os índices de monóxido de carbono (CO). A pesquisa ainda analisa algumas arquiteturas para processamento em lote (batch) e em fluxo (stream) de sensores e utilizar uma delas na construção de um ambiente Big Data. Foram utilizadas as ferramentas de Big Data para o armazenamento, processamento e visualização desses dados de IoT. Nos experimentos desenvolvidos na pesquisa foram analisados os sensores de monóxido de carbono (MQ7), conectados através de uma unidade microcontroladora que apresenta suporte ao protocolo Transmission Control Protocol/Internet Protocol (TCP/IP). Este projeto destaca o instrumento de como compilar e executar esses dados e a análise dos mesmos obtidos de forma dinâmica. Os dados obtidos pelos sensores IoT constatam como a média dos índices coletados, estão muito superiores aos padrões internacionais estabelecidos pela OMS. internet das coisas big data sensor data mining arquitetura Hadoop ESP2866
127	Spatial Data Mining Analytical Environment for Large Scale Geospatial Data Yang, Zhao 16 December 2016 (has links) Nowadays, many applications are continuously generating large-scale geospatial data. Vehicle GPS tracking data, aerial surveillance drones, LiDAR (Light Detection and Ranging), world-wide spatial networks, and high resolution optical or Synthetic Aperture Radar imagery data all generate a huge amount of geospatial data. However, as data collection increases our ability to process this large-scale geospatial data in a flexible fashion is still limited. We propose a framework for processing and analyzing large-scale geospatial and environmental data using a “Big Data” infrastructure. Existing Big Data solutions do not include a specific mechanism to analyze large-scale geospatial data. In this work, we extend HBase with Spatial Index(R-Tree) and HDFS to support geospatial data and demonstrate its analytical use with some common geospatial data types and data mining technology provided by the R language. The resulting framework has a robust capability to analyze large-scale geospatial data using spatial data mining and making its outputs available to end users. Databases and Information Systems Systems Architecture Theory and Algorithms
128	Apache Hadoop jako analytická platforma / Apache Hadoop as analytics platform Brotánek, Jan January 2017 (has links) Diploma Thesis focuses on integrating Hadoop platform into current data warehouse architecture. In theoretical part, properties of Big Data are described together with their methods and processing models. Hadoop framework, its components and distributions are discussed. Moreover, compoments which enables end users, developers and analytics to access Hadoop cluster are described. Case study of batch data extraction from current data warehouse on Oracle platform with aid of Sqoop tool, their transformation in relational structures of Hive component and uploading them back to the original source is being discussed at practical part of thesis. Compression of data and efficiency of queries depending on various storage formats is also discussed. Quality and consistency of manipulated data is checked during all phases of the process. Fraction of practical part discusses ways of storing and capturing stream data. For this purposes tool Flume is used to capture stream data. Further this data are transformed in Pig tool. Purpose of implementing the process is to move part of data and its processing from current data warehouse to Hadoop cluster. Therefore process of integration of current data warehouse and Hortonworks Data Platform and its components, was designed
129	Analýza Big Data v oblasti zdravotnictví / Big Data analysis in healthcare Nováková, Martina January 2014 (has links) This thesis deals with the analysis of Big Data in healthcare. The aim is to define the term Big Data, to acquaint the reader with data growth in the world and in the health sector. Another objective is to explain the concept of a data expert and to define team members of the data experts team. In following chapters phases of the Big Data analysis according to methodology of EMC2 company are defined and basic technologies for analysing Big Data are described. As beneficial and interesting I consider the part dealing with definition of tasks in which Big Data technologies are already used in healthcare. In the practical part I perform the Big Data analysis task focusing on meteorotropic diseases in which I use real medical and meteorological data. The reader is not only acquainted with the one of recommended methods of analysis and with used statistical models, but also with terms from the field of biometeorology and healthcare. An integral part of the analysis is also information about its limitations, the consultation on results, and conclusions of experts in meteorology and healthcare.
130	Srovnání distribuovaných "NoSQL" databází s důrazem na výkon a škálovatelnost / Comparison of distributed "NoSQL" databases with focus on performance and scalability Vrbík, Tomáš January 2011 (has links) This paper focuses on NoSQL database systems. These systems currently serve rather as supplement than replacement of relational database systems. The aim of this paper is to compare 4 selected NoSQL database systems (MongoDB, Apache Cassandra, Apache HBase and Redis) with a main focus on performance and scalability. Performance comparison is done using simulated workload in a 4 nodes cluster environment. One relational SQL database is also benchmarked to provide comparison between classic and modern way of maintaining structured data. As the result of comparison I found out that none of these database systems can be labeled as "the best" as each of the compared systems is suitable for different production deployment.

Search results