1 |
A Different Threshold Approach to Data Replication in Data GridsHuang, Yen-Wei 21 January 2008 (has links)
Certain scientific application domains, such as High-Energy Physics or Earth Observation, are expected to produce several Petabytes (220 Gigabyes) of data that is analyzed and evaluated by the scientists all over the world. In the context of data grid technology, data replication is mostly used to reduce access latency and bandwidth consumption. In this thesis, we adopt the typical Data Grid architecture, three kinds of nodes: server, cache, and client nodes. A server node represents a main storage site. A client node represents a site where data access requests are generated, and a cache node represents an intermediate storage site. However, the access latency of the hierarchical storage system may be of the order of seconds up to hours. The static replication strategy can be used to improve such long delay; however, it cannot adapt to changes of users¡¦ behaviors. Therefore, the dynamic data replication strategy is used in Data Grids. Three fundamental design issues in a dynamic replication strategy are: (1) when to create the replicas, (2) which files to be replicated, and (3) where the replicas to be placed. Two of well known replication strategies are Fast-Spread and Cascading, which can work well for different kinds of access patterns individually. For example, the Fast-Spread strategy works well for random access patterns, and the Cascading strategy works well for the patterns with the properties of localities. However, for so many different access patterns, if we use a strategy for one kind of access patterns and another strategy for another kind of access patterns, the system may become too complex. Therefore, in this thesis, we propose one strategy which can work for any kind of access patterns. We propose a replication approach, a Different Threshold (DT) approach to data replication in Data Grids, which can be dynamically adapted to several kinds of access patterns and provide even better performance than Cascading and Fast-Spread strategies. In our approach, there are different thresholds for different layers. Based on this approach, first, we propose a static DT strategy in which the threshold at each layer is fixed. So, by carefully adjusting the difference between the thresholds Ti, where i is the i-th layer of the tree structure, we can even provide the better performance than the above two well-known strategies. Moreover, among large amount of different data files, there may exist some hot data files. Those files which have been mostly requested are hot data files. To reduce the number of requests for the hot files, next, we propose the dynamic DT strategy. In the dynamic DT strategy, each data file even has its own threshold. We let data replication of hot files occur earlier than others by decreasing the thresholds of hot files earlier than the normal ones. From our simulation results, we show that the response time in our static DT strategy is less than that in the Cascading and the Fast-Spread strategies. Moreover, we can show that the performance of the dynamic DT strategy is better than that of the static DT strategy.
|
2 |
Scheduling distributed data-intensive applications on global gridsVenugopal, Srikumar Unknown Date (has links) (PDF)
The next generation of scientific experiments and studies are being carried out by large collaborations of researchers distributed around the world engaged in analysis of huge collections of data generated by scientific instruments. Grid computing has emerged as an enabler for such collaborations as it aids communities in sharing resources to achieve common objectives. Data Grids provide services for accessing, replicating and managing data collections in these collaborations. Applications used in such Grids are distributed data-intensive, that is, they access and process distributed datasets to generate results. These applications need to transparently and efficiently access distributed data and computational resources. This thesis investigates properties of data-intensive computing environments and presents a software framework and algorithms for mapping distributed data-oriented applications to Grid resources. (For complete abstract open document)
|
3 |
Data Integration Over Horizontally Partitioned Databases In Service-oriented Data GridsSonmez Sunercan, Hatice Kevser 01 September 2010 (has links) (PDF)
Information integration over distributed and heterogeneous resources has been challenging in many terms: coping with various kinds of heterogeneity including data model, platform, access interfaces / coping with various forms of data distribution and maintenance policies, scalability, performance, security and trust, reliability and resilience, legal issues etc. It is obvious that each of these dimensions deserves a separate thread of research efforts. One particular challenge among the ones listed above that is more relevant to the work presented in this thesis is coping with various forms of data distribution and maintenance policies.
This thesis aims to provide a service-oriented data integration solution over data Grids for cases where distributed data sources are partitioned with overlapping sections of various proportions. This is an interesting variation which combines both replicated and partitioned data within the same data management framework. Thus, the data management infrastructure has to deal with specific challenges regarding the identification, access and aggregation of partitioned data with varying proportions of overlapping sections. To provide a solution we have extended OGSA-DAI DQP, a well-known service-oriented data access and integration middleware with distributed query processing facilities, by incorporating UnionPartitions operator into its algebra in order to cope with various unusual forms of horizontally partitioned databases. As a result / our solution extends OGSA-DAI DQP, in two points / 1 - A new operator type is added to the algebra to perform a specialized union of the partitions with different characteristics, 2 - OGSA-DAI DQP Federation Description is extended to include some more metadata to facilitate the successful execution of the newly introduced operator.
|
4 |
Placement of replicas in large-scale data grid environmentsShorfuzzaman, Mohammad 26 March 2012 (has links)
Data Grids provide services and infrastructure for distributed data-intensive applications accessing massive geographically distributed datasets. An important technique to speed access in Data Grids is replication, which provides nearby data access. Although data replication is one of the major techniques for promoting high data access, the problem of replica placement has not been widely studied for large-scale Grid environments. In this thesis, I propose improved data placement techniques useful when replicating potentially large data files in wide area data grids. These techniques are aimed at achieving faster data access as well as efficient utilization of bandwidth and storage resources. At the core of my approach is a new highly distributed replica placement algorithm that places data in strategic locations to improve overall data access performance while satisfying varying user/application and system demands. This improved efficiency of access to large data will improve the practicality of large-scale data and compute intensive collaborative scientific endeavors.
My thesis makes several contributions towards improving the state-of-the-art for replica placement in large-scale data grid environments. The major contributions are: (i) development of a new popularity-driven dynamic replica placement algorithm for hierarchically structured data grids that balance storage space utilisation and access latency; (ii) creation of an adaptive version of the base algorithm to dynamically adapt the frequency and degree of replication based on such factors as data request arrival rates, available storage capacities, etc.; (iii) development of a new highly distributed algorithm to determine a near-optimal replica placement while minimizing replication cost (access and update) for a given traffic pattern; (iv) creation of a distributed QoS-aware replica placement algorithm that supports multiple quality requirements both from user and system perspectives to support efficient transfers of large replicas.
Simulation results using widely observed data access patterns demonstrate how the effectiveness of my replica placement techniques is affected by various factors such as grid network characteristics (i.e. topology, number of nodes, storage and workload capacities of replica servers, link capacities, traffic pattern), QoS requirements, and so on. Finally, I compare the performance of my algorithms to a number of relevant algorithms from the literature and demonstrate their usefulness and superiority for conditions of interest.
|
5 |
Placement of replicas in large-scale data grid environmentsShorfuzzaman, Mohammad 26 March 2012 (has links)
Data Grids provide services and infrastructure for distributed data-intensive applications accessing massive geographically distributed datasets. An important technique to speed access in Data Grids is replication, which provides nearby data access. Although data replication is one of the major techniques for promoting high data access, the problem of replica placement has not been widely studied for large-scale Grid environments. In this thesis, I propose improved data placement techniques useful when replicating potentially large data files in wide area data grids. These techniques are aimed at achieving faster data access as well as efficient utilization of bandwidth and storage resources. At the core of my approach is a new highly distributed replica placement algorithm that places data in strategic locations to improve overall data access performance while satisfying varying user/application and system demands. This improved efficiency of access to large data will improve the practicality of large-scale data and compute intensive collaborative scientific endeavors.
My thesis makes several contributions towards improving the state-of-the-art for replica placement in large-scale data grid environments. The major contributions are: (i) development of a new popularity-driven dynamic replica placement algorithm for hierarchically structured data grids that balance storage space utilisation and access latency; (ii) creation of an adaptive version of the base algorithm to dynamically adapt the frequency and degree of replication based on such factors as data request arrival rates, available storage capacities, etc.; (iii) development of a new highly distributed algorithm to determine a near-optimal replica placement while minimizing replication cost (access and update) for a given traffic pattern; (iv) creation of a distributed QoS-aware replica placement algorithm that supports multiple quality requirements both from user and system perspectives to support efficient transfers of large replicas.
Simulation results using widely observed data access patterns demonstrate how the effectiveness of my replica placement techniques is affected by various factors such as grid network characteristics (i.e. topology, number of nodes, storage and workload capacities of replica servers, link capacities, traffic pattern), QoS requirements, and so on. Finally, I compare the performance of my algorithms to a number of relevant algorithms from the literature and demonstrate their usefulness and superiority for conditions of interest.
|
6 |
A Grid-based Middleware for Scalable Processing of Remote DataGlimcher, Leonid S. 24 June 2008 (has links)
No description available.
|
7 |
Improving Scheduling in Heterogeneous Grid and Hadoop SystemsRasooli, Oskooei Aysan 10 1900 (has links)
<p>Advances in network technologies and computing resources have led to the deployment of large scale computational systems, such as those following Grid or Cloud architectures. The scheduling problem is a significant issue in these distributed computing environments, where a scheduling algorithm should consider multiple objectives and performance metrics. Moreover, heterogeneity is increasing at both the application and resource levels. The heterogeneity in these systems can have a huge impact on performance in terms of metrics such as average completion time. However, traditional Grid and Cloud scheduling algorithms neglect heterogeneity in their scheduling decisions. This PhD dissertation studies the scheduling challenges in Computational Grid, Data Grid, and Cloud computing systems, and introduces new scheduling algorithms for each of these systems.</p> <p>The main issue in Grid scheduling is the wide distribution of resources. As a result, gathering full state information can add huge overhead to the scheduler. This thesis introduces a Computational Grid scheduling algorithm which simultaneously addresses minimizing completion times (by considering system heterogeneity), while requiring zero dynamic state information. Simulation results show the promising performance of this algorithm, and its robustness with respect to errors in parameter estimates.</p> <p>In the case of Data Grid schedulers, an efficient scheduling decision should select a combination of resources for a task that simultaneously mitigates the computation and the communication costs. Therefore, these schedulers need to consider a large search space to find an appropriate combination. This thesis introduces a new Data Grid scheduling algorithm, which dynamically makes replication and scheduling decisions. The proposed algorithm reduces the search space, decreases the required state information, and improves the performance by considering the system heterogeneity. Simulation results illustrate the promising performance of the introduced algorithm.</p> <p>Cloud computing can be considered as a next generation of Grid computing. One of the main challenges in Cloud systems is the enormous expansion of data in different applications. The MapReduce programming model and Hadoop framework were designed as a solution for executing large scale data-intensive applications. A large number of (heterogeneous) users, using the same Hadoop cluster, can result in tensions between the various performance metrics by which such systems are measured. This research introduces and implements a Hadoop scheduling system, which uses system information such as estimated job arrival rates and mean job execution times to make scheduling decisions. The proposed scheduling system, named COSHH (Classification and Optimization based Scheduler for Heterogeneous Hadoop systems), considers heterogeneity at both the application and cluster levels. The main objective of COSHH is to improve the average completion time of jobs. However, as it is concerned with other key Hadoop performance metrics, it also achieves competitive performance under minimum share satisfaction, fairness and locality metrics, with respect to other well-known Hadoop schedulers. The proposed scheduler can be efficiently applied in heterogeneous clusters, in contrast to most Hadoop schedulers, which assume homogeneous clusters.</p> <p>A Hadoop system can be described based on three factors: cluster, workload, and user. Each factor is either heterogeneous or homogeneous, which reflects the heterogeneity level of the Hadoop system. This PhD research studies the effect of heterogeneity in each of these factors on the performance of Hadoop schedulers. Three schedulers which consider different levels of Hadoop heterogeneity are used for the analysis: FIFO, Fair sharing, and COSHH. Performance issues are introduced for Hadoop schedulers, and experiments are provided to evaluate these issues. The reported results suggest guidelines for selecting an appropriate scheduler for different Hadoop systems. The focus of these guidelines is on systems which do not have significant fluctuations in the number of resources or jobs.</p> <p>There is a considerable challenge in Hadoop to schedule tasks and resources in a scalable manner. Moreover, the potential heterogeneous nature of deployed Hadoop systems tends to increase this challenge. This thesis analyzes the performance of widely used Hadoop schedulers including FIFO and Fair sharing and compares them with the COSHH scheduler. Based on the discussed insights, a hybrid solution is introduced, which selects appropriate scheduling algorithms for scalable and heterogeneous Hadoop systems with respect to the number of incoming jobs and available resources. The proposed hybrid scheduler considers the guidelines provided for heterogeneous Hadoop systems in the case that the number of jobs and resources change considerably.</p> <p>To improve the performance of high priority users, Hadoop guarantees minimum numbers of resource shares for these users at each point in time. This research compares different scheduling algorithms based on minimum share consideration and under different heterogeneous and homogeneous environments. For this evaluation, a real Hadoop system is developed and evaluated using Facebook workloads. Based on the experiments performed, a reliable scheduling algorithm is suggested which can work efficiently in different environments.</p> / Doctor of Philosophy (PhD)
|
Page generated in 0.0712 seconds