Global ETD Search

21	A e-science e as atuais práticas de pesquisa científica Appel, Andre Luiz 24 March 2014 (has links) Submitted by Priscilla Araujo (priscilla@ibict.br) on 2016-08-10T18:57:33Z No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Pesquisa_Andre_Appel_2014-06-26_final.pdf: 1939404 bytes, checksum: 6982e0d84650c64e682e9739977b6ccf (MD5) / Made available in DSpace on 2016-08-10T18:57:33Z (GMT). No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Pesquisa_Andre_Appel_2014-06-26_final.pdf: 1939404 bytes, checksum: 6982e0d84650c64e682e9739977b6ccf (MD5) Previous issue date: 2014-03-24 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Este trabalho teve como objetivo analisar a relação entre novas práticas de produção colaborativa de conhecimento científico e o desenvolvimento e uso de plataformas tecnológicas de amparo à pesquisa colaborativa, movimento conhecido como e-Science, levando-se em consideração as diferentes visões, perspectivas e interesses dos atores atuantes nessas práticas, suas opções de uso e adesão às plataformas de pesquisa de e-Science e as definições quanto aos direitos de acesso e uso dos dados e resultados de pesquisa em tais práticas. O uso do termo e-Science diz respeito a um movimento que prevê a concepção de uma ciência atuante no uso intensivo de dados e na colaboração por meio do uso de plataformas de pesquisa baseadas em computação avançada. Como proposta metodológica deste trabalho, desenvolveu-se uma primeira fase documental, baseada em levantamento na literatura, para a exploração do histórico, de conceitos e práticas relacionados à e-Science. Em um segundo momento, desenvolveu-se uma abordagem empírica, com o estudo de uma experiência em e-Science, mais especificamente, o caso do Conseil Européen pour la Recherche Nucléaire (CERN). Para desenvolvimento desse estudo de caso foram entrevistados pesquisadores e especialistas atuantes em colaborações junto ao CERN e em demais pesquisas e iniciativas relacionadas à e-Science. Os resultados das entrevistas foram analisados com base em categorias e conceitos-chave elencados a partir dos questionamentos iniciais da pesquisa e do referencial teórico. Dentre os principais resultados e considerações destacados ao longo do trabalho está a percepção da governança como uma dimensão significativa no contexto da e-Science. Ganham destaque também as implicações referentes às condições de financiamento à pesquisa e às formas de organização dos atores e dos grupos de pesquisa para viabilizar a colaboração, além da descoberta dos dados como ativos importantes nos processos de produção de ciência, e como isso afeta as estruturas de avaliação e mensuração de resultados nesses processos. Em linhas gerais, destaca-se convergência dos objetivos propostos e temas abordados, apresentando-se como principal desafio a carência de estudos e publicações brasileiras sobre o assunto. Como perspectivas futuras de estudo, destaca-se a potencialidade de desenvolvimento de um framework para análise e estudo da governança em e-Science, com foco na abertura dos processos de geração e tratamento de dados derivados de pesquisas colaborativas, ou de iniciativas por parte da comunidade científica brasileira alinhada a esses processos, garantindo-lhe um posicionamento estratégico no campo das colaborações em e-Science. / This study aimed to investigate the relationship between new practices of collaborative production of scientific knowledge and the development and use of technological platforms for support to collaborative research, a movement known as e-Science. This was done considering the divergent views, perspectives and interests of actors within these practices, also considering their options of use and adherence to the e-Science platforms as well as the definitions related to the rights of access and usage of research data and results in those practices. The concept of e-Science explored here is related to a movement that requires the design of a data-intensive science and collaboration through research-based advanced computing platforms. The methodological approach involved a literature research to explore the history, concepts and examples of practices related to e-Science. As a second step, an empirical approach was developed, with the study of an e-Science experience, the case of Conseil Européen pour la Recherche Nucléaire (CERN). We interviewed a group of researchers who collaborate in CERN based projects and other research programs related to e-Science. The interviews were analyzed based to categories and key-concepts evidenced from the primary research questions and the supporting literature. One of the main results highlighted throughout the work is the perception of governance as a significant dimension in the context of e-Science. Another highlighted result is the discovery of implications regarding the conditions of research funding and forms of organization of actors and research groups to enable collaboration, as well as the discovery of data as important assets in the processes of science production and how it affects the structures assessment and measurement of outcomes in these processes. In general, it was possible to observe the convergence of the proposed objectives and covered topics and as major challenge to the work is the lack of Brazilian studies and publications on the subject. As future prospects of study, there is the potential for developing a framework for analysis and study of the governance in e-Science focusing on opening up the processes of data creation and processing, and the proposal of Brazilian scientific community initiatives aligned to these processes, guaranteeing to it a strategic position in the field of e-Science collaborations. E-Science Produção de ciência Uso intensivo de dados Pesquisa colaborativa Science production Data intensive Collaborative research
22	Building Evolutionary Clustering Algorithms on Spark Fu, Xinye January 2017 (has links) Evolutionary clustering (EC) is a kind of clustering algorithm to handle the noise of time-evolved data. It can track the truth drift of clustering across time by considering history. EC tries to make clustering result fit both current data and historical data/model well, so each EC algorithm defines snapshot cost (SC) and temporal cost (TC) to reflect both requests. EC algorithms minimize both SC and TC by different methods, and they have different ability to deal with a different number of cluster, adding/deleting nodes, etc.Until now, there are more than 10 EC algorithms, but no survey about that. Therefore, a survey of EC is written in the thesis. The survey first introduces the application scenario of EC, the definition of EC, and the history of EC algorithms. Then two categories of EC algorithms model-level algorithms and data-level algorithms are introduced oneby-one. What’s more, each algorithm is compared with each other. Finally, performance prediction of algorithms is given. Algorithms which optimize the whole problem (i.e., optimize change parameter or don’t use change parameter to control), accept a change of cluster number perform best in theory.EC algorithm always processes large datasets and includes many iterative data-intensive computations, so they are suitable for implementing on Spark. Until now, there is no implementation of EC algorithm on Spark. Hence, four EC algorithms are implemented on Spark in the project. In the thesis, three aspects of the implementation are introduced. Firstly, algorithms which can parallelize well and have a wide application are selected to be implemented. Secondly, program design details for each algorithm have been described. Finally, implementations are verified by correctness and efficiency experiments. / Evolutionär clustering (EC) är en slags klustringsalgoritm för att hantera bruset av tidutvecklad data. Det kan spåra sanningshanteringen av klustring över tiden genom att beakta historien. EC försöker göra klustringsresultatet passar både aktuell data och historisk data / modell, så varje EC-algoritm definierar ögonblicks kostnad (SC) och tidsmässig kostnad (TC) för att reflektera båda förfrågningarna. EC-algoritmer minimerar både SC och TC med olika metoder, och de har olika möjligheter att hantera ett annat antal kluster, lägga till / radera noder etc.Hittills finns det mer än 10 EC-algoritmer, men ingen undersökning om det. Därför skrivs en undersökning av EC i avhandlingen. Undersökningen introducerar först applikationsscenariot för EC, definitionen av EC och historien om EC-algoritmer. Därefter introduceras två kategorier av EC-algoritmer algoritmer på algoritmer och algoritmer på datanivå en för en. Dessutom jämförs varje algoritm med varandra. Slutligen ges resultatprediktion av algoritmer. Algoritmer som optimerar hela problemet (det vill säga optimera förändringsparametern eller inte använda ändringsparametern för kontroll), acceptera en förändring av klusternummer som bäst utför i teorin.EC-algoritmen bearbetar alltid stora dataset och innehåller många iterativa datintensiva beräkningar, så de är lämpliga för implementering på Spark. Hittills finns det ingen implementering av EG-algoritmen på Spark. Därför implementeras fyra EC-algoritmer på Spark i projektet. I avhandlingen införs tre aspekter av genomförandet. För det första är algoritmer som kan parallellisera väl och ha en bred tillämpning valda att implementeras. För det andra har programdesigndetaljer för varje algoritm beskrivits. Slutligen verifieras implementeringarna av korrekthet och effektivitetsexperiment. Evolutionary clustering Spark Survey Data-intensive computing Evolutionär clustering Spark Undersökning Datintensiv databehandling Computer Sciences Datavetenskap (datalogi)
23	A Shared-Memory Coupled Architecture to Leverage Big Data Frameworks in Prototyping and In-Situ Analytics for Data Intensive Scientific Workflows Lemon, Alexander Michael 01 July 2019 (has links) There is a pressing need for creative new data analysis methods whichcan sift through scientific simulation data and produce meaningfulresults. The types of analyses and the amount of data handled by currentmethods are still quite restricted, and new methods could providescientists with a large productivity boost. New methods could be simpleto develop in big data processing systems such as Apache Spark, which isdesigned to process many input files in parallel while treating themlogically as one large dataset. This distributed model, combined withthe large number of analysis libraries created for the platform, makesSpark ideal for processing simulation output.Unfortunately, the filesystem becomes a major bottleneck in any workflowthat uses Spark in such a fashion. Faster transports are notintrinsically supported by Spark, and its interface almost denies thepossibility of maintainable third-party extensions. By leveraging thesemantics of Scala and Spark's recent scheduler upgrades, we forceco-location of Spark executors with simulation processes and enable fastlocal inter-process communication through shared memory. This provides apath for bulk data transfer into the Java Virtual Machine, removing thecurrent Spark ingestion bottleneck.Besides showing that our system makes this transfer feasible, we alsodemonstrate a proof-of-concept system integrating traditional HPC codeswith bleeding-edge analytics libraries. This provides scientists withguidance on how to apply our libraries to gain a new and powerful toolfor developing new analysis techniques in large scientific simulationpipelines. Apache Spark Data-Intensive Science High-Performance Computing In-Situ Analytics Parameter Sweep State-Space Pruning Computer Sciences Physical Sciences and Mathematics
24	Advanced middleware support for distributed data-intensive applications Du, Wei 12 September 2005 (has links) No description available. Computer Science Distributed Data-intensive Applications Coarse-Grained Pipelined Parallelism Language and Compiler Support Program Decomposition Communication Granularity Selection Program Adaptation
25	Supporting Fault Tolerance and Dynamic Load Balancing in FREERIDE-G Bicer, Tekin 23 August 2010 (has links) No description available. Computer Science Information Systems Systems Design Fault tolerance Load balancing Data-intensive computing Cloud computing Map-reduce
26	Specification, Configuration and Execution of Data-intensive Scientific Applications Kumar, Vijay Shiv 14 December 2010 (has links) No description available. Computer Science data-intensive computing high performance computing scientific workflow out-of-core multi-dimensional data semantic modeling
27	Optimisation de la gestion des données pour les applications MapReduce sur des infrastructures distribuées à grande échelle Moise, Diana Maria 16 December 2011 (has links) (PDF) Les applications data-intensive sont largement utilisées au sein de domaines diverses dans le but d'extraire et de traiter des informations, de concevoir des systèmes complexes, d'effectuer des simulations de modèles réels, etc. Ces applications posent des défis complexes tant en termes de stockage que de calcul. Dans le contexte des applications data-intensive, nous nous concentrons sur le paradigme MapReduce et ses mises en oeuvre. Introduite par Google, l'abstraction MapReduce a révolutionné la communauté intensif de données et s'est rapidement étendue à diverses domaines de recherche et de production. Une implémentation domaine publique de l'abstraction mise en avant par Google, a été fournie par Yahoo à travers du project Hadoop. Le framework Hadoop est considéré l'implémentation de référence de MapReduce et est actuellement largement utilisé à des fins diverses et sur plusieurs infrastructures. Nous proposons un système de fichiers distribué, optimisé pour des accès hautement concurrents, qui puisse servir comme couche de stockage pour des applications MapReduce. Nous avons conçu le BlobSeer File System (BSFS), basé sur BlobSeer, un service de stockage distribué, hautement efficace, facilitant le partage de données à grande échelle. Nous étudions également plusieurs aspects liés à la gestion des données intermédiaires dans des environnements MapReduce. Nous explorons les contraintes des données intermédiaires MapReduce à deux niveaux: dans le même job MapReduce et pendant l'exécution des pipelines d'applications MapReduce. Enfin, nous proposons des extensions de Hadoop, un environnement MapReduce populaire et open-source, comme par example le support de l'opération append. Ce travail inclut également l'évaluation et les résultats obtenus sur des infrastructures à grande échelle: grilles informatiques et clouds. [INFO:INFO_OH] Computer Science/Other Applications data-intensive MapReduce Grilles informatiques Cloud computing Gestion des données intermédiaires Hadoop HDFS BlobSeer Haut débit Accès hautement concurrents
28	Analyzing hybrid architectures for massively parallel graph analysis Ediger, David 08 April 2013 (has links) The quantity of rich, semi-structured data generated by sensor networks, scientific simulation, business activity, and the Internet grows daily. The objective of this research is to investigate architectural requirements for emerging applications in massive graph analysis. Using emerging hybrid systems, we will map applications to architectures and close the loop between software and hardware design in this application space. Parallel algorithms and specialized machine architectures are necessary to handle the immense size and rate of change of today's graph data. To highlight the impact of this work, we describe a number of relevant application areas ranging from biology to business and cybersecurity. With several proposed architectures for massively parallel graph analysis, we investigate the interplay of hardware, algorithm, data, and programming model through real-world experiments and simulations. We demonstrate techniques for obtaining parallel scaling on multithreaded systems using graph algorithms that are orders of magnitude faster and larger than the state of the art. The outcome of this work is a proposed hybrid architecture for massive-scale analytics that leverages key aspects of data-parallel and highly multithreaded systems. In simulations, the hybrid systems incorporating a mix of multithreaded, shared memory systems and solid state disks performed up to twice as fast as either homogeneous system alone on graphs with as many as 18 trillion edges. Data intensive computing Computer architectures Cray XMT Streaming graph algorithms Multithreaded graph algorithms Computer algorithms Graph algorithms Parallel algorithms
29	Performance Evaluation of Data Intensive Computing In The Cloud Kaza, Bhagavathi 01 January 2013 (has links) Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing large volumes of data on the scale of terabytes or petabytes. While some research exists for the performance effect of data intensive applications in the cloud, none of the research compares the Amazon Elastic Compute Cloud (Amazon EC2) and Google Compute Engine (GCE) clouds using multiple benchmarks. This study performs extensive research on the Amazon EC2 and GCE clouds using the TeraSort, MalStone and CreditStone benchmarks on Hadoop and Sector data layers. Data collected for the Amazon EC2 and GCE clouds measure performance as the number of nodes is varied. This study shows that GCE is more efficient for data-intensive applications compared to Amazon EC2. Thesis University of North Florida UNF Other Computer Engineering
30	Storage Format Selection and Optimization for Materialized Intermediate Results in Data-Intensive Flows Munir, Rana Faisal 01 February 2021 (has links) Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs. / Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen peratortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HL Ansatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzen. info:eu-repo/classification/ddc/004 ddc:004

Search results