Global ETD Search

31	Knihovna znovupoužitelných komponent a utilit pro framework Angular 2 / Library of Reusable Components and Utilities for the Angular 2 Framework Branderský, Gabriel January 2017 (has links) Táto práca sa zaoberá vytvorením knižnice znovapoužiteľných komponent a utilít určené na použitie v dátavo-intenzívnych aplikáciach. Jednou typickou komponentou pre také aplikácie je tabuľka, ktorá je považovaná za hlavnú komponentu knižnice. Pre zaistenie vysokej kohezie sú všetky ostatné komponenty a utility sú s nou úzko prepojené. Výsledná sada komponent je použiteľná deklaratívným spôsobom a umožnuje rôzne konfigurácie. Uživateľské rozhranie je tiež prizpôsobené na dátovo-intenzívne aplikácie s rôznymi prvkami.
32	Advancing information privacy concerns evaluation in personal data intensive services Rohunen, A. (Anna) 04 December 2019 (has links) Abstract When personal data are collected and utilised to produce personal data intensive services, users of these services are exposed to the possibility of privacy losses. Users’ information privacy concerns may lead to non-adoption of new services and technologies, affecting the quality and the completeness of the collected data. These issues make it challenging to fully reap the benefits brought by the services. The evaluation of information privacy concerns makes it possible to address these concerns in the design and the development of personal data intensive services. This research investigated how privacy concerns evaluations should be developed to make them valid in the evolving data collection contexts. The research was conducted in two phases: employing a mixed-method research design and using a literature review methodology. In Phase 1, two empirical studies were conducted, following a mixed-method exploratory sequential design. In both studies, the data subjects’ privacy behaviour and privacy concerns that were associated with mobility data collection were first explored qualitatively, and quantitative instruments were then developed based on the qualitative results to generalise the findings. Phase 2 was planned to provide an extensive view on privacy behaviour and some possibilities to develop privacy concerns evaluation in new data collection contexts. Phase 2 consisted of two review studies: a systematic literature review of privacy behaviour models and a review of the EU data privacy legislation changes. The results show that in evolving data collection contexts, privacy behaviour and concerns have characteristics that differ from earlier ones. Privacy concerns have aspects specific to these contexts, and their multifaceted nature appears emphasised. Because privacy concerns are related to other privacy behaviour antecedents, it may be reasonable to incorporate some of these antecedents into evaluations. The existing privacy concerns evaluation instruments serve as valid starting points for evaluations in evolving personal data collection contexts. However, these instruments need to be revised and adapted to the new contexts. The development of privacy concerns evaluation may be challenging due to the incoherence of the existing privacy behaviour research. More overarching research is called for to facilitate the application of the existing knowledge. / Tiivistelmä Kun henkilötietoja kerätään ja hyödynnetään dataintensiivisten palveluiden tuottamiseen, palveluiden käyttäjien tietosuoja saattaa heikentyä. Käyttäjien tietosuojahuolet voivat hidastaa uusien palveluiden ja teknologioiden käyttöönottoa sekä vaikuttaa kerättävän tiedon laatuun ja kattavuuteen. Tämä hankaloittaa palveluiden täysimittaista hyödyntämistä. Tietosuojahuolten arviointi mahdollistaa niiden huomioimisen henkilötietoperusteisten palveluiden suunnittelussa ja kehittämisessä. Tässä tutkimuksessa selvitettiin, kuinka tietosuojahuolten arviointia tulisi kehittää muuttuvissa tiedonkeruuympäristöissä. Kaksivaiheisessa tutkimuksessa toteutettiin aluksi empiirinen monimenetelmällinen tutkimus ja tämän jälkeen systemaattinen kirjallisuustutkimus. Ensimmäisessä vaiheessa tehtiin kaksi empiiristä tutkimusta monimenetelmällisen tutkimuksen tutkivan peräkkäisen asetelman mukaisesti. Näissä tutkimuksissa selvitettiin ensin laadullisin menetelmin tietosuojakäyttäytymistä ja tietosuojahuolia liikkumisen dataa kerättäessä. Laadullisten tulosten pohjalta kehitettiin kvantitatiiviset instrumentit tulosten yleistettävyyden tutkimiseksi. Tutkimuksen toisessa vaiheessa toteutettiin kaksi katsaustyyppistä tutkimusta, jotta saataisiin kattava käsitys tietosuojakäyttäytymisestä sekä mahdollisuuksista kehittää tietosuojahuolten arviointia uusissa tiedonkeruuympäristöissä. Nämä tutkimukset olivat systemaattinen kirjallisuuskatsaus tietosuojakäyttäytymisen malleista sekä katsaus EU:n tietosuojalainsäädännön muutoksista. Tutkimuksen tulokset osoittavat, että kehittyvissä tiedonkeruuympäristöissä tietosuojakäyttäytyminen ja tietosuojahuolet poikkeavat aikaisemmista ympäristöistä. Näissä ympäristöissä esiintyy niille ominaisia tietosuojahuolia ja huolten monitahoisuus korostuu. Koska tietosuojahuolet ovat kytköksissä muihin tietosuojakäyttäytymistä ennustaviin muuttujiin, arviointeihin voi olla aiheellista sisällyttää myös näitä muuttujia. Olemassa olevia tietosuojahuolten arviointi-instrumentteja on perusteltua käyttää arvioinnin lähtökohtana myös kehittyvissä tiedonkeruuympäristöissä, mutta niitä on mukautettava uusiin ympäristöihin soveltuviksi. Arvioinnin kehittäminen voi olla haasteellista, sillä aikaisempi tietosuojatutkimus on epäyhtenäistä. Jotta sitä voidaan soveltaa asianmukaisesti arviointien kehittämisessä, tutkimusta on vietävä kokonaisvaltaisempaan suuntaan. information privacy mixed-method research personal data personal data intensive services privacy behaviour privacy concerns privacy concerns evaluation dataintensiiviset palvelut henkilötiedot monimenetelmällinen tutkimus tietosuoja tietosuojahuolet tietosuojahuolten arviointi tietosuojakäyttäytyminen
33	Intermediate Results Materialization Selection and Format for Data-Intensive Flows Munir, Rana Faisal, Nadal, Sergi, Romero, Oscar, Abelló, Alberto, Jovanovic, Petar, Thiele, Maik, Lehner, Wolfgang 14 June 2023 (has links) Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions. info:eu-repo/classification/ddc/510 ddc:510
34	Optimizing data management for MapReduce applications on large-scale distributed infrastructures / Optimisation de la gestion des données pour les applications MapReduce sur des infrastructures distribuées à grande échelle Moise, Diana Maria 16 December 2011 (has links) Les applications data-intensive sont largement utilisées au sein de domaines diverses dans le but d'extraire et de traiter des informations, de concevoir des systèmes complexes, d'effectuer des simulations de modèles réels, etc. Ces applications posent des défis complexes tant en termes de stockage que de calcul. Dans le contexte des applications data-intensive, nous nous concentrons sur le paradigme MapReduce et ses mises en oeuvre. Introduite par Google, l'abstraction MapReduce a révolutionné la communauté intensif de données et s'est rapidement étendue à diverses domaines de recherche et de production. Une implémentation domaine publique de l'abstraction mise en avant par Google, a été fournie par Yahoo à travers du project Hadoop. Le framework Hadoop est considéré l'implémentation de référence de MapReduce et est actuellement largement utilisé à des fins diverses et sur plusieurs infrastructures. Nous proposons un système de fichiers distribué, optimisé pour des accès hautement concurrents, qui puisse servir comme couche de stockage pour des applications MapReduce. Nous avons conçu le BlobSeer File System (BSFS), basé sur BlobSeer, un service de stockage distribué, hautement efficace, facilitant le partage de données à grande échelle. Nous étudions également plusieurs aspects liés à la gestion des données intermédiaires dans des environnements MapReduce. Nous explorons les contraintes des données intermédiaires MapReduce à deux niveaux: dans le même job MapReduce et pendant l'exécution des pipelines d'applications MapReduce. Enfin, nous proposons des extensions de Hadoop, un environnement MapReduce populaire et open-source, comme par example le support de l'opération append. Ce travail inclut également l'évaluation et les résultats obtenus sur des infrastructures à grande échelle: grilles informatiques et clouds. / Data-intensive applications are nowadays, widely used in various domains to extract and process information, to design complex systems, to perform simulations of real models, etc. These applications exhibit challenging requirements in terms of both storage and computation. Specialized abstractions like Google’s MapReduce were developed to efficiently manage the workloads of data-intensive applications. The MapReduce abstraction has revolutionized the data-intensive community and has rapidly spread to various research and production areas. An open-source implementation of Google's abstraction was provided by Yahoo! through the Hadoop project. This framework is considered the reference MapReduce implementation and is currently heavily used for various purposes and on several infrastructures. To achieve high-performance MapReduce processing, we propose a concurrency-optimized file system for MapReduce Frameworks. As a starting point, we rely on BlobSeer, a framework that was designed as a solution to the challenge of efficiently storing data generated by data-intensive applications running at large scales. We have built the BlobSeer File System (BSFS), with the goal of providing high throughput under heavy concurrency to MapReduce applications. We also study several aspects related to intermediate data management in MapReduce frameworks. We investigate the requirements of MapReduce intermediate data at two levels: inside the same job, and during the execution of pipeline applications. Finally, we show how BSFS can enable extensions to the de facto MapReduce implementation, Hadoop, such as the support for the append operation. This work also comprises the evaluation and the obtained results in the context of grid and cloud environments. Applications data-intensive MapReduce Grilles informatiques Cloud computing Gestion des données intermédiaires Hadoop HDFS BlobSeer Haut débit Accès hautement concurrents Data-intensive applications MapReduce Large-scale distributed platforms Grid Cloud computing Intermediate data management Hadoop HDFS BlobSeer High throughput Heavy access concurrency
35	Optimisation de la gestion des données pour les applications MapReduce sur des infrastructures distribuées à grande échelle Moise, Diana 16 December 2011 (has links) (PDF) Les applications data-intensive sont largement utilisées au sein de domaines diverses dans le but d'extraire et de traiter des informations, de concevoir des systèmes complexes, d'effectuer des simulations de modèles réels, etc. Ces applications posent des défis complexes tant en termes de stockage que de calcul. Dans le contexte des applications data-intensive, nous nous concentrons sur le paradigme MapReduce et ses mises en oeuvre. Introduite par Google, l'abstraction MapReduce a révolutionné la communauté data-intensive et s'est rapidement étendue à diverses domaines de recherche et de production. Une implémentation domaine publique de l'abstraction mise en avant par Google a été fournie par Yahoo à travers du project Hadoop. Le framework Hadoop est considéré l'implémentation de référence de MapReduce et est actuellement largement utilisé à des fins diverses et sur plusieurs infrastructures. Nous proposons un système de fichiers distribué, optimisé pour des accès hautement concurrents, qui puisse servir comme couche de stockage pour des applications MapReduce. Nous avons conçu le BlobSeer File System (BSFS), basé sur BlobSeer, un service de stockage distribué, hautement efficace, facilitant le partage de données à grande échelle. Nous étudions également plusieurs aspects liés à la gestion des données intermédiaires dans des environnements MapReduce. Nous explorons les contraintes des données intermédiaires MapReduce à deux niveaux: dans le même job MapReduce et pendant l'exécution des pipelines d'applications MapReduce. Enfin, nous proposons des extensions de Hadoop, un environnement MapReduce populaire et open-source, comme par example le support de l'opération append. Ce travail inclut également l'évaluation et les résultats obtenus sur des infrastructures à grande échelle: grilles informatiques et clouds. Applications data-intensive MapReduce grilles informatiques cloud computing gestion des données intermédiaires Hadoop HDFS BlobSeer haut débit accès hautement concurrents
36	Mining brain imaging and genetics data via structured sparse learning Yan, Jingwen 29 April 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Alzheimer's disease (AD) is a neurodegenerative disorder characterized by gradual loss of brain functions, usually preceded by memory impairments. It has been widely affecting aging Americans over 65 old and listed as 6th leading cause of death. More importantly, unlike other diseases, loss of brain function in AD progression usually leads to the significant decline in self-care abilities. And this will undoubtedly exert a lot of pressure on family members, friends, communities and the whole society due to the time-consuming daily care and high health care expenditures. In the past decade, while deaths attributed to the number one cause, heart disease, has decreased 16 percent, deaths attributed to AD has increased 68 percent. And all of these situations will continue to deteriorate as the population ages during the next several decades. To prevent such health care crisis, substantial efforts have been made to help cure, slow or stop the progression of the disease. The massive data generated through these efforts, like multimodal neuroimaging scans as well as next generation sequences, provides unprecedented opportunities for researchers to look into the deep side of the disease, with more confidence and precision. While plenty of efforts have been made to pull in those existing machine learning and statistical models, the correlated structure and high dimensionality of imaging and genetics data are generally ignored or avoided through targeted analysis. Therefore their performances on imaging genetics study are quite limited and still have plenty to be improved. The primary contribution of this work lies in the development of novel prior knowledge-guided regression and association models, and their applications in various neurobiological problems, such as identification of cognitive performance related imaging biomarkers and imaging genetics associations. In summary, this work has achieved the following research goals: (1) Explore the multimodal imaging biomarkers toward various cognitive functions using group-guided learning algorithms, (2) Development and application of novel network structure guided sparse regression model, (3) Development and application of novel network structure guided sparse multivariate association model, and (4) Promotion of the computation efficiency through parallelization strategies. Alzheimer's disease Biomarker discovery Data intensive computing Imaging genetics Structured sparse learning Alzheimer's disease -- Research Alzheimer's disease -- Patients -- Care Medical genetics Diagnostic imaging Image processing Neural networks (Neurobiology) Brain -- Physiology
37	An I/O-aware scheduler for containerized data-intensive HPC tasks in Kubernetes-based heterogeneous clusters / En I/O-medveten schemaläggare för containeriserade dataintensiva HPC-uppgifter i Kubernetes-baserade heterogena kluster Wu, Zheyun January 2022 (has links) Cloud-native is a new computing paradigm that takes advantage of key characteristics of cloud computing, where applications are packaged as containers. The lifecycle of containerized applications is typically managed by container orchestration tools such as Kubernetes, the most popular container orchestration system that automates the containers’ deployment, maintenance, and scaling. Kubernetes has become the de facto standard for container orchestrators in the cloud-native era. Meanwhile, with the increasing demand for High-Performance Computing (HPC) over the past years, containerization is being adopted by the HPC community and various processors and special-purpose hardware are utilized to accelerate HPC applications. The architecture of cloud systems has been gradually shifting from homogeneous to heterogeneous with different processors and hardware accelerators, which raises a new challenge: how to exploit different computing resources efficiently? Much effort has been devoted to improving the use efficiency of computing resources in heterogeneous systems from the perspective of task scheduling, which aims to match different types of tasks to optimal computing devices for execution. Existing proposals do not take into account the variation in I/O performance between heterogeneous nodes when scheduling tasks. However, I/O performance is an important but often overlooked factor that can be a potential performance bottleneck for HPC tasks. This thesis proposes an I/O-aware scheduler named cmio-scheduler for containerized data-intensive HPC tasks in Kubernetes-based heterogeneous clusters, which is aware of the I/O throughput of compute nodes when making task placement decisions. In principle, cmio-scheduler assigns data-intensive HPC tasks to the node that fulfills the tasks’ requirements for CPU, memory, and GPU and has the highest I/O throughput. The experimental results demonstrate that cmio-scheduler reduces the execution time by 19.32% for the overall workflow and 15.125% for parallelizable tasks on average. / Cloud-native är ett nytt dataparadigm som drar nytta av de viktigaste egenskaperna hos molntjänster, där applikationer paketeras som behållare. Livscykeln för applikationer i containrar hanteras vanligtvis av verktyg för containerorkestrering, t.ex. Kubernetes, det mest populära systemet för containerorkestrering, som automatiserar installation, underhåll och skalning av containrar. Kubernetes har blivit de facto-standard för containerorkestrar i den molnnativa eran. Med den ökande efterfrågan på högpresterande beräkningar (HPC) under de senaste åren har containerisering antagits av HPC-samhället och olika processorer och specialhårdvara används för att påskynda HPC-tillämpningar. Arkitekturen för molnsystem har gradvis skiftat från homogen till heterogen med olika processorer och hårdvaruacceleratorer, vilket ger upphov till en ny utmaning: hur kan man utnyttja olika datorresurser på ett effektivt sätt? Mycket arbete har ägnats åt att förbättra utnyttjandet av datorresurser i heterogena system ur perspektivet för uppgiftsfördelning, som syftar till att matcha olika typer av uppgifter till optimala datorutrustning för utförande. Befintliga förslag tar inte hänsyn till variationen i I/O-prestanda mellan heterogena noder vid schemaläggning av uppgifter. I/O-prestanda är dock en viktig men ofta förbisedd faktor som kan vara en potentiell flaskhals för HPC-uppgifter. I den här avhandlingen föreslås en I/O-medveten schemaläggare vid namn cmio-scheduler för containeriserade dataintensiva HPC-uppdrag i Kubernetes-baserade heterogena kluster, som är medveten om beräkningsnodernas I/O-genomströmning när den fattar beslut om placering av uppdrag. I princip tilldelar cmio-scheduler dataintensiva HPC-uppgifter till den nod som uppfyller uppgifternas krav på CPU, minne och GPU och som har den högsta I/O-genomströmningen. De experimentella resultaten visar att cmio-scheduler i genomsnitt minskar exekveringstiden med 19,32 % för det totala arbetsflödet och med 15,125 % för parallelliserbara uppgifter. Cloud-native Containers Kubernetes High-performance computing (HPC) Data-intensive computing Task scheduling Heterogeneous systems Cloud-native Containrar Kubernetes Högpresterande datoranvändning (HPC) Dataintensiv datoranvändning Uppgiftsschemaläggning Heterogena system Computer and Information Sciences Data- och informationsvetenskap
38	Using Cloud Technologies to Optimize Data-Intensive Service Applications Lehner, Wolfgang, Habich, Dirk, Richly, Sebastian, Assmann, Uwe 01 November 2022 (has links) The role of data analytics increases in several application domains to cope with the large amount of captured data. Generally, data analytics are data-intensive processes, whose efficient execution is a challenging task. Each process consists of a collection of related structured activities, where huge data sets have to be exchanged between several loosely coupled services. The implementation of such processes in a service-oriented environment offers some advantages, but the efficient realization of data flows is difficult. Therefore, we use this paper to propose a novel SOA-aware approach with a special focus on the data flow. The tight interaction of new cloud technologies with SOA technologies enables us to optimize the execution of data-intensive service applications by reducing the data exchange tasks to a minimum. Fundamentally, our core concept to optimize the data flows is found in data clouds. Moreover, we can exploit our approach to derive efficient process execution strategies regarding different optimization objectives for the data flows. info:eu-repo/classification/ddc/004 ddc:004 info:eu-repo/classification/ddc/005 ddc:005
39	A comparative study of the Data Warehouse and Data Lakehouse architecture / En komparativ studie av Data Warehouse- och Data Lakehouse-arkitektur Salqvist, Philip January 2024 (has links) This thesis aimed to assess a given Data Warehouse against a well-suited Data Lakehouse in terms of read performance and scalability. Using the TPC-DS benchmark, these systems were tested with synthetic datasets reflecting the specific needs of a Decision Support (DSS) system. Moreover, this research aimed to determine whether certain categories of queries resulted in notably large discrepancies between the systems. This might help pinpoint the architectural differences that cause these discrepancies. Initial research identified BigQuery and Delta Lake as top candidates due to their exceptional read performance and scalability, prompting further investigation into both. The most significant latency difference was noted in the initial benchmark using a dataset scale of 2 GB, with BigQuery outperforming Delta Lake. As the dataset size grew, BigQuery’s latency increased by 336%, while Delta Lake’s went up by just 40%. However, BigQuery still maintained a significant overall lower latency across all scales. Detailed query analysis showed BigQuery excelling especially with complex queries, those involving extensive aggregation and multiple join operations, which have a high potential for generating large intermediate data during the shuffle stage. It was hypothesized that some of the read performance discrepancies could be attributed to BigQuery’s in-memory shuffling capability, whereas Delta Lake might spill intermediate data to the disk. Delta Lake’s hardware utilization metrics further supported this theory, displaying a trend where peaks in memory usage and disk write rate coincided with queries showing high discrepancies. Meanwhile, CPU utilization remained low. This pattern suggests an I/O-bound system rather than a CPU-bound one, possibly explaining the observed performance differences. Future studies are encouraged to explicitly monitor shuffle operations, aiming for a more rigorous correlation between high-discrepancy queries and data spillage during the shuffle phase. Further research should also include larger dataset sizes; this thesis was constrained to a maximum dataset size of 64 GB due to limited resources. / Denna uppsats undersökte ett givet Data Warehouse i jämförelse med ett lämpligt Data Lakehouse med fokus på läsprestanda och skalbarhet. Med hjälp av TPC-DS benchmark testades dessa system med syntetiska dataset som speglade kundens specifika behov. Vidare syftade forskningen till att avgöra om vissa kategorier av queries resulterade i märkbart stora skillnader mellan systemen. Detta för att identifiera de teknologiska aspekter hos systemen som orsakar dessa skillnader. Den inledande litteraturstudien identifierade BigQuery och Delta Lake som toppkandidater på grund av deras läsprestanda och skalbarhet, vilket ledde till ytterligare undersökning av båda. Den mest påtagliga skillnaden i latens noterades i den initiala jämförelsen med ett dataset av storleken 2 GB, där BigQuery presterade bättre än Delta Lake. När datamängden skalades upp, ökade BigQuery’s latens med 336%, medan Delta Lakes ökade med endast 40%. Dock bibehöll BigQuery en avsevärt lägre total latens för samtliga datamängder. Detaljerad analys visade att BigQuery presterade särskilt bra under komplexa queries som involverade omfattande aggregering och flera join-operationer, vilka har en hög potential för att generera stora datamängder under shuffle-fasen. Det antogs att skillnaderna i latens delvis kunde tillskrivas BigQuery’s in-memory shuffle-kapacitet, medan Delta Lake riskerade att spilla data till disk. Delta Lakes hårdvaruanvändning stödde denna teori ytterligare, där toppar i minnesanvändning och skrivhastighet till disk sammanföll med queries som visade höga skillnader, samtidigt som CPU-användningen förblev låg. Detta mönster tyder på ett I/O-bundet system snarare än ett CPU-bundet, vilket möjligen förklarar de observerade prestandaskillnaderna. Framtida studier uppmuntras att explicit övervaka shuffle-operationer, med målet att mer noggrant koppla queries som uppvisar stora skillnader med dataspill under shuffle-fasen. Ytterligare forskning bör också inkludera större datamängdstorlekar; denna avhandling var begränsad till en maximal datamängdstorlek på 64 GB på grund av begränsade resurser. Data-Intensive Computing Data Lakehouse BigQuery Delta Lake Data storage system Data Lakehouse architecture Data-intensiv databehandling Data Lakehouse Data Lakehouse-arkitektur BigQuery Delta Lake Datalagringssystem Computer and Information Sciences Data- och informationsvetenskap

Search results