Global ETD Search

11	Analyzing hybrid architectures for massively parallel graph analysis Ediger, David 08 April 2013 (has links) The quantity of rich, semi-structured data generated by sensor networks, scientific simulation, business activity, and the Internet grows daily. The objective of this research is to investigate architectural requirements for emerging applications in massive graph analysis. Using emerging hybrid systems, we will map applications to architectures and close the loop between software and hardware design in this application space. Parallel algorithms and specialized machine architectures are necessary to handle the immense size and rate of change of today's graph data. To highlight the impact of this work, we describe a number of relevant application areas ranging from biology to business and cybersecurity. With several proposed architectures for massively parallel graph analysis, we investigate the interplay of hardware, algorithm, data, and programming model through real-world experiments and simulations. We demonstrate techniques for obtaining parallel scaling on multithreaded systems using graph algorithms that are orders of magnitude faster and larger than the state of the art. The outcome of this work is a proposed hybrid architecture for massive-scale analytics that leverages key aspects of data-parallel and highly multithreaded systems. In simulations, the hybrid systems incorporating a mix of multithreaded, shared memory systems and solid state disks performed up to twice as fast as either homogeneous system alone on graphs with as many as 18 trillion edges. Data intensive computing Computer architectures Cray XMT Streaming graph algorithms Multithreaded graph algorithms Computer algorithms Graph algorithms Parallel algorithms
12	Performance Evaluation of Data Intensive Computing In The Cloud Kaza, Bhagavathi 01 January 2013 (has links) Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing large volumes of data on the scale of terabytes or petabytes. While some research exists for the performance effect of data intensive applications in the cloud, none of the research compares the Amazon Elastic Compute Cloud (Amazon EC2) and Google Compute Engine (GCE) clouds using multiple benchmarks. This study performs extensive research on the Amazon EC2 and GCE clouds using the TeraSort, MalStone and CreditStone benchmarks on Hadoop and Sector data layers. Data collected for the Amazon EC2 and GCE clouds measure performance as the number of nodes is varied. This study shows that GCE is more efficient for data-intensive applications compared to Amazon EC2. Thesis University of North Florida UNF Other Computer Engineering
13	Autonomie, sécurité et QoS de bout en bout dans un environnement de Cloud Computing / Security, QoS and self-management within an end-to-end Cloud Computing environment Hamze, Mohamad 07 December 2015 (has links) De nos jours, le Cloud Networking est considéré comme étant l'un des domaines de recherche innovants au sein de la communauté de recherche du Cloud Computing. Les principaux défis dans un environnement de Cloud Networking concernent non seulement la garantie de qualité de service (QoS) et de sécurité mais aussi sa gestion en conformité avec un accord de niveau de service (SLA) correspondant. Dans cette thèse, nous proposons un Framework pour l'allocation des ressources conformément à un SLA établi de bout en bout entre un utilisateur de services Cloud (CSU) et plusieurs fournisseurs de services Cloud (CSP) dans un environnement de Cloud Networking (architectures d’inter-Cloud Broker et Fédération). Nos travaux se concentrent sur les services Cloud de types NaaS et IaaS. Ainsi, nous proposons l'auto-établissement de plusieurs types de SLA ainsi que la gestion autonome des ressources de Cloud correspondantes en conformité avec ces SLA en utilisant des gestionnaires autonomes spécifiques de Cloud. De plus, nous étendons les architectures et les SLA proposés pour offrir un niveau de service intégrant une garantie de sécurité. Ainsi, nous permettons aux gestionnaires autonomes de Cloud d'élargir leurs objectifs de gestion autonome aux fonctions de sécurité (auto-protection) tout en étudiant l'impact de la sécurité proposée sur la garantie de QoS. Enfin, nous validons notre architecture avec différents scénarios de simulation. Nous considérons dans le cadre de ces simulations des applications de vidéoconférence et de calcul intensif afin de leur fournir une garantie de QoS et de sécurité dans un environnement de gestion autonome des ressources du Cloud. Les résultats obtenus montrent que nos contributions permettent de bonnes performances pour ce type d’applications. En particulier, nous observons que l'architecture de type Broker est la plus économique, tout en assurant les exigences de QoS et de sécurité. De plus, nous observons que la gestion autonome des ressources du Cloud permet la réduction des violations, des pénalités et limite l'impact de la sécurité sur la garantie de la QoS. / Today, Cloud Networking is one of the recent research areas within the Cloud Computing research communities. The main challenges of Cloud Networking concern Quality of Service (QoS) and security guarantee as well as its management in conformance with a corresponding Service Level Agreement (SLA). In this thesis, we propose a framework for resource allocation according to an end-to-end SLA established between a Cloud Service User (CSU) and several Cloud Service Providers (CSPs) within a Cloud Networking environment (Inter-Cloud Broker and Federation architectures). We focus on NaaS and IaaS Cloud services. Then, we propose the self-establishing of several kinds of SLAs and the self-management of the corresponding Cloud resources in conformance with these SLAs using specific autonomic cloud managers. In addition, we extend the proposed architectures and the corresponding SLAs in order to deliver a service level taking into account security guarantee. Moreover, we allow autonomic cloud managers to expand the self-management objectives to security functions (self-protection) while studying the impact of the proposed security on QoS guarantee. Finally, our proposed architecture is validated by different simulation scenarios. We consider, within these simulations, videoconferencing and intensive computing applications in order to provide them with QoS and security guarantee in a Cloud self-management environment. The obtained results show that our contributions enable good performances for these applications. In particular, we observe that the Broker architecture is the most economical while ensuring QoS and security requirements. In addition, we observe that Cloud self-management enables violations and penalties’ reduction as well as limiting security impact on QoS guarantee. Cloud Computing Cloud Networking Inter-Cloud Service Level Agreement Qualité de Service Sécurité Gestion Autonome Videoconférence Calcul Intensif Cloud Computing Cloud Networking Inter-Cloud Service Level Agreement Quality of Service Security Self-management Videoconferencing Intensive Computing 004.6
14	Mining brain imaging and genetics data via structured sparse learning Yan, Jingwen 29 April 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Alzheimer's disease (AD) is a neurodegenerative disorder characterized by gradual loss of brain functions, usually preceded by memory impairments. It has been widely affecting aging Americans over 65 old and listed as 6th leading cause of death. More importantly, unlike other diseases, loss of brain function in AD progression usually leads to the significant decline in self-care abilities. And this will undoubtedly exert a lot of pressure on family members, friends, communities and the whole society due to the time-consuming daily care and high health care expenditures. In the past decade, while deaths attributed to the number one cause, heart disease, has decreased 16 percent, deaths attributed to AD has increased 68 percent. And all of these situations will continue to deteriorate as the population ages during the next several decades. To prevent such health care crisis, substantial efforts have been made to help cure, slow or stop the progression of the disease. The massive data generated through these efforts, like multimodal neuroimaging scans as well as next generation sequences, provides unprecedented opportunities for researchers to look into the deep side of the disease, with more confidence and precision. While plenty of efforts have been made to pull in those existing machine learning and statistical models, the correlated structure and high dimensionality of imaging and genetics data are generally ignored or avoided through targeted analysis. Therefore their performances on imaging genetics study are quite limited and still have plenty to be improved. The primary contribution of this work lies in the development of novel prior knowledge-guided regression and association models, and their applications in various neurobiological problems, such as identification of cognitive performance related imaging biomarkers and imaging genetics associations. In summary, this work has achieved the following research goals: (1) Explore the multimodal imaging biomarkers toward various cognitive functions using group-guided learning algorithms, (2) Development and application of novel network structure guided sparse regression model, (3) Development and application of novel network structure guided sparse multivariate association model, and (4) Promotion of the computation efficiency through parallelization strategies. Alzheimer's disease Biomarker discovery Data intensive computing Imaging genetics Structured sparse learning Alzheimer's disease -- Research Alzheimer's disease -- Patients -- Care Medical genetics Diagnostic imaging Image processing Neural networks (Neurobiology) Brain -- Physiology
15	An I/O-aware scheduler for containerized data-intensive HPC tasks in Kubernetes-based heterogeneous clusters / En I/O-medveten schemaläggare för containeriserade dataintensiva HPC-uppgifter i Kubernetes-baserade heterogena kluster Wu, Zheyun January 2022 (has links) Cloud-native is a new computing paradigm that takes advantage of key characteristics of cloud computing, where applications are packaged as containers. The lifecycle of containerized applications is typically managed by container orchestration tools such as Kubernetes, the most popular container orchestration system that automates the containers’ deployment, maintenance, and scaling. Kubernetes has become the de facto standard for container orchestrators in the cloud-native era. Meanwhile, with the increasing demand for High-Performance Computing (HPC) over the past years, containerization is being adopted by the HPC community and various processors and special-purpose hardware are utilized to accelerate HPC applications. The architecture of cloud systems has been gradually shifting from homogeneous to heterogeneous with different processors and hardware accelerators, which raises a new challenge: how to exploit different computing resources efficiently? Much effort has been devoted to improving the use efficiency of computing resources in heterogeneous systems from the perspective of task scheduling, which aims to match different types of tasks to optimal computing devices for execution. Existing proposals do not take into account the variation in I/O performance between heterogeneous nodes when scheduling tasks. However, I/O performance is an important but often overlooked factor that can be a potential performance bottleneck for HPC tasks. This thesis proposes an I/O-aware scheduler named cmio-scheduler for containerized data-intensive HPC tasks in Kubernetes-based heterogeneous clusters, which is aware of the I/O throughput of compute nodes when making task placement decisions. In principle, cmio-scheduler assigns data-intensive HPC tasks to the node that fulfills the tasks’ requirements for CPU, memory, and GPU and has the highest I/O throughput. The experimental results demonstrate that cmio-scheduler reduces the execution time by 19.32% for the overall workflow and 15.125% for parallelizable tasks on average. / Cloud-native är ett nytt dataparadigm som drar nytta av de viktigaste egenskaperna hos molntjänster, där applikationer paketeras som behållare. Livscykeln för applikationer i containrar hanteras vanligtvis av verktyg för containerorkestrering, t.ex. Kubernetes, det mest populära systemet för containerorkestrering, som automatiserar installation, underhåll och skalning av containrar. Kubernetes har blivit de facto-standard för containerorkestrar i den molnnativa eran. Med den ökande efterfrågan på högpresterande beräkningar (HPC) under de senaste åren har containerisering antagits av HPC-samhället och olika processorer och specialhårdvara används för att påskynda HPC-tillämpningar. Arkitekturen för molnsystem har gradvis skiftat från homogen till heterogen med olika processorer och hårdvaruacceleratorer, vilket ger upphov till en ny utmaning: hur kan man utnyttja olika datorresurser på ett effektivt sätt? Mycket arbete har ägnats åt att förbättra utnyttjandet av datorresurser i heterogena system ur perspektivet för uppgiftsfördelning, som syftar till att matcha olika typer av uppgifter till optimala datorutrustning för utförande. Befintliga förslag tar inte hänsyn till variationen i I/O-prestanda mellan heterogena noder vid schemaläggning av uppgifter. I/O-prestanda är dock en viktig men ofta förbisedd faktor som kan vara en potentiell flaskhals för HPC-uppgifter. I den här avhandlingen föreslås en I/O-medveten schemaläggare vid namn cmio-scheduler för containeriserade dataintensiva HPC-uppdrag i Kubernetes-baserade heterogena kluster, som är medveten om beräkningsnodernas I/O-genomströmning när den fattar beslut om placering av uppdrag. I princip tilldelar cmio-scheduler dataintensiva HPC-uppgifter till den nod som uppfyller uppgifternas krav på CPU, minne och GPU och som har den högsta I/O-genomströmningen. De experimentella resultaten visar att cmio-scheduler i genomsnitt minskar exekveringstiden med 19,32 % för det totala arbetsflödet och med 15,125 % för parallelliserbara uppgifter. Cloud-native Containers Kubernetes High-performance computing (HPC) Data-intensive computing Task scheduling Heterogeneous systems Cloud-native Containrar Kubernetes Högpresterande datoranvändning (HPC) Dataintensiv datoranvändning Uppgiftsschemaläggning Heterogena system Computer and Information Sciences Data- och informationsvetenskap
16	A comparative study of the Data Warehouse and Data Lakehouse architecture / En komparativ studie av Data Warehouse- och Data Lakehouse-arkitektur Salqvist, Philip January 2024 (has links) This thesis aimed to assess a given Data Warehouse against a well-suited Data Lakehouse in terms of read performance and scalability. Using the TPC-DS benchmark, these systems were tested with synthetic datasets reflecting the specific needs of a Decision Support (DSS) system. Moreover, this research aimed to determine whether certain categories of queries resulted in notably large discrepancies between the systems. This might help pinpoint the architectural differences that cause these discrepancies. Initial research identified BigQuery and Delta Lake as top candidates due to their exceptional read performance and scalability, prompting further investigation into both. The most significant latency difference was noted in the initial benchmark using a dataset scale of 2 GB, with BigQuery outperforming Delta Lake. As the dataset size grew, BigQuery’s latency increased by 336%, while Delta Lake’s went up by just 40%. However, BigQuery still maintained a significant overall lower latency across all scales. Detailed query analysis showed BigQuery excelling especially with complex queries, those involving extensive aggregation and multiple join operations, which have a high potential for generating large intermediate data during the shuffle stage. It was hypothesized that some of the read performance discrepancies could be attributed to BigQuery’s in-memory shuffling capability, whereas Delta Lake might spill intermediate data to the disk. Delta Lake’s hardware utilization metrics further supported this theory, displaying a trend where peaks in memory usage and disk write rate coincided with queries showing high discrepancies. Meanwhile, CPU utilization remained low. This pattern suggests an I/O-bound system rather than a CPU-bound one, possibly explaining the observed performance differences. Future studies are encouraged to explicitly monitor shuffle operations, aiming for a more rigorous correlation between high-discrepancy queries and data spillage during the shuffle phase. Further research should also include larger dataset sizes; this thesis was constrained to a maximum dataset size of 64 GB due to limited resources. / Denna uppsats undersökte ett givet Data Warehouse i jämförelse med ett lämpligt Data Lakehouse med fokus på läsprestanda och skalbarhet. Med hjälp av TPC-DS benchmark testades dessa system med syntetiska dataset som speglade kundens specifika behov. Vidare syftade forskningen till att avgöra om vissa kategorier av queries resulterade i märkbart stora skillnader mellan systemen. Detta för att identifiera de teknologiska aspekter hos systemen som orsakar dessa skillnader. Den inledande litteraturstudien identifierade BigQuery och Delta Lake som toppkandidater på grund av deras läsprestanda och skalbarhet, vilket ledde till ytterligare undersökning av båda. Den mest påtagliga skillnaden i latens noterades i den initiala jämförelsen med ett dataset av storleken 2 GB, där BigQuery presterade bättre än Delta Lake. När datamängden skalades upp, ökade BigQuery’s latens med 336%, medan Delta Lakes ökade med endast 40%. Dock bibehöll BigQuery en avsevärt lägre total latens för samtliga datamängder. Detaljerad analys visade att BigQuery presterade särskilt bra under komplexa queries som involverade omfattande aggregering och flera join-operationer, vilka har en hög potential för att generera stora datamängder under shuffle-fasen. Det antogs att skillnaderna i latens delvis kunde tillskrivas BigQuery’s in-memory shuffle-kapacitet, medan Delta Lake riskerade att spilla data till disk. Delta Lakes hårdvaruanvändning stödde denna teori ytterligare, där toppar i minnesanvändning och skrivhastighet till disk sammanföll med queries som visade höga skillnader, samtidigt som CPU-användningen förblev låg. Detta mönster tyder på ett I/O-bundet system snarare än ett CPU-bundet, vilket möjligen förklarar de observerade prestandaskillnaderna. Framtida studier uppmuntras att explicit övervaka shuffle-operationer, med målet att mer noggrant koppla queries som uppvisar stora skillnader med dataspill under shuffle-fasen. Ytterligare forskning bör också inkludera större datamängdstorlekar; denna avhandling var begränsad till en maximal datamängdstorlek på 64 GB på grund av begränsade resurser. Data-Intensive Computing Data Lakehouse BigQuery Delta Lake Data storage system Data Lakehouse architecture Data-intensiv databehandling Data Lakehouse Data Lakehouse-arkitektur BigQuery Delta Lake Datalagringssystem Computer and Information Sciences Data- och informationsvetenskap

Page generated in 0.1354 seconds