Global ETD Search

81	Biophysically Accurate Brain Modeling and Simulation using Hybrid MPI/OpenMP Parallel Processing Hu, Jingzhen 2012 May 1900 (has links) In order to better understand the behavior of the human brain, it is very important to perform large scale neural network simulation which may reveal the relationship between the whole network activity and the biophysical dynamics of individual neurons. However, considering the complexity of the network and the large amount of variables, researchers choose to either simulate smaller neural networks or use simple spiking neuron models. Recently, supercomputing platforms have been employed to greatly speedup the simulation of large brain models. However, there are still limitations of these works such as the simplicity of the modeled network structures and lack of biophysical details in the neuron models. In this work, we propose a parallel simulator using biophysically realistic neural models for the simulation of large scale neural networks. In order to improve the performance of the simulator, we adopt several techniques such as merging linear synaptic receptors mathematically and using two level time steps, which significantly accelerate the simulation. In addition, we exploit the efficiency of parallel simulation through three parallel implementation strategies: MPI parallelization, MPI parallelization with dynamic load balancing schemes and Hybrid MPI/OpenMP parallelization. Through experimental studies, we illustrate the limitation of MPI implementation due to the imbalanced workload among processors. It is shown that the two developed MPI load balancing schemes are not able to improve the simulation efficiency on the targeted parallel platform. Using 32 processors, the proposed hybrid approach, on the other hand, is more efficient than the MPI implementation and is about 31X faster than a serial implementation of the simulator for a network consisting of more than 100,000 neurons. Finally, it is shown that for large neural networks, the presented approach is able to simulate the transition from the 3Hz delta oscillation to epileptic behaviors due to the alterations of underlying cellular mechanisms. Biophysially model Large scale network Parallel MPI OpenMP Dynamic load balancing
82	Dynamic Load Balancing Schemes for Large-scale HLA-based Simulations De Grande, Robson E. 26 July 2012 (has links) Dynamic balancing of computation and communication load is vital for the execution stability and performance of distributed, parallel simulations deployed on shared, unreliable resources of large-scale environments. High Level Architecture (HLA) based simulations can experience a decrease in performance due to imbalances that are produced initially and/or during run-time. These imbalances are generated by the dynamic load changes of distributed simulations or by unknown, non-managed background processes resulting from the non-dedication of shared resources. Due to the dynamic execution characteristics of elements that compose distributed simulation applications, the computational load and interaction dependencies of each simulation entity change during run-time. These dynamic changes lead to an irregular load and communication distribution, which increases overhead of resources and execution delays. A static partitioning of load is limited to deterministic applications and is incapable of predicting the dynamic changes caused by distributed applications or by external background processes. Due to the relevance in dynamically balancing load for distributed simulations, many balancing approaches have been proposed in order to offer a sub-optimal balancing solution, but they are limited to certain simulation aspects, specific to determined applications, or unaware of HLA-based simulation characteristics. Therefore, schemes for balancing the communication and computational load during the execution of distributed simulations are devised, adopting a hierarchical architecture. First, in order to enable the development of such balancing schemes, a migration technique is also employed to perform reliable and low-latency simulation load transfers. Then, a centralized balancing scheme is designed; this scheme employs local and cluster monitoring mechanisms in order to observe the distributed load changes and identify imbalances, and it uses load reallocation policies to determine a distribution of load and minimize imbalances. As a measure to overcome the drawbacks of this scheme, such as bottlenecks, overheads, global synchronization, and single point of failure, a distributed redistribution algorithm is designed. Extensions of the distributed balancing scheme are also developed to improve the detection of and the reaction to load imbalances. These extensions introduce communication delay detection, migration latency awareness, self-adaptation, and load oscillation prediction in the load redistribution algorithm. Such developed balancing systems successfully improved the use of shared resources and increased distributed simulations' performance. Load Balancing High Performance Computing High Level Architecture Distributed Systems Parallel and Distributed Simulations Discrete Event Simulations
83	Scalable and Transparent Parallelization of Multiplayer Games Simion, Bogdan 15 February 2010 (has links) In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application's true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. multiplayer games parallelization transactional memory lock-based synchronization scalability load balancing performance 0984
84	iC2mpi: A Platform for Parallel Execution of Graph-Structured Iterative Computations Botadra, Harnish 02 August 2006 (has links) Parallelization of sequential programs is often daunting because of the substantial development cost involved. Various solutions have been proposed to address this concern, including directive-based approaches and parallelization platforms. These solutions have not always been successful, in part because many try to address all types of applications. We propose a platform for parallelization of a class of applications that have similar computational structure, namely graph-structured iterative applications. iC2mpi is a unique proof-of-concept prototype platform that provides relatively easy parallelization of existing sequential programs and facilitates experimentation with static partitioning and dynamic load balancing schemes. We demonstrate with various generic application graph topologies and an existing application, namely a time-stepped battlefield management simulation, that our platform can produce good performance with very little effort. PaGrid Metis Battlefiled Management Simulation Graph Partitioning Load Balancing Parallelization Computer Sciences
85	Scalable and Transparent Parallelization of Multiplayer Games Simion, Bogdan 15 February 2010 (has links) In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application's true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. multiplayer games parallelization transactional memory lock-based synchronization scalability load balancing performance 0984
86	Improving cache Behavior in CMP architectures throug cache partitioning techniques Moretó Planas, Miquel 19 March 2010 (has links) The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them.To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores.Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives.One of the resources that impacts most on the final performance of an application is the cache hierarchy. Caches store data recently used by the applications in order to take advantage of temporal and spatial locality of applications. Caches provide fast access to data, improving the performance of applications. Caches with low latencies have to be small, which prompts the design of a cache hierarchy organized into several levels of cache.In CMPs, the cache hierarchy is normally organized in a first level (L1) of instruction and data caches private to each core. A last level of cache (LLC) is normally shared among different cores in the processor (L2, L3 or both). Shared caches increase resource utilization and system performance. Large caches improve performance and efficiency by increasing the probability that each application can access data from a closer level of the cache hierarchy. It also allows an application to make use of the entire cache if needed.A second advantage of having a shared cache in a CMP design has to do with the cache coherency. In parallel applications, different threads share the same data and keep a local copy of this data in their cache. With multiple processors, it is possible for one processor to change the data, leaving another processor's cache with outdated data. Cache coherency protocol monitors changes to data and ensures that all processor caches have the most recent data. When the parallel application executes on the same physical chip, the cache coherency circuitry can operate at the speed of on-chip communications, rather than having to use the much slower between-chip communication, as is required with discrete processors on separate chips. These coherence protocols are simpler to design with a unified and shared level of cache onchip.Due to the advantages that multicore architectures offer, chip vendors use CMP architectures in current high performance, network, real-time and embedded systems. Several of these commercial processors have a level of the cache hierarchy shared by different cores. For example, the Sun UltraSPARC T2 has a 16-way 4MB L2 cache shared by 8 cores each one up to 8-way SMT. Other processors like the Intel Core 2 family also share up to a 12MB 24-way L2 cache. In contrast, the AMD K10 family has a private L2 cache per core and a shared L3 cache, with up to a 6MB 64-way L3 cache.As the long-term trend of increasing integration continues, the number of cores per chip is also projected to increase with each successive technology generation. Some significant studies have shown that processors with hundreds of cores per chip will appear in the market in the following years. The manycore era has already begun. Although this era provides many opportunities, it also presents many challenges. In particular, higher hardware resource sharing among concurrently executing threads can cause individual thread's performance to become unpredictable and might lead to violations of the individual applications' performance requirements. Current resource management mechanisms and policies are no longer adequate for future multicore systems.Some applications present low re-use of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses to the cache hierarchy.When no direct control over shared resources is exercised (the last level cache in this case), it is possible that a particular thread allocates most of the shared resources, degrading other threads performance. As a consequence, high resource sharing and resource utilization can cause systems to become unstable and violate individual applications' requirements. If we want to provide a Quality of Service (QoS) to applications, we need to enhance the control over shared resources and enrich the collaboration between the OS and the architecture.In this thesis, we propose software and hardware mechanisms to improve cache sharing in CMP architectures. We make use of a holistic approach, coordinating targets of software and hardware to improve system aggregate performance and provide QoS to applications. We make use of explicit resource allocation techniques to control the shared cache in a CMP architecture, with resource allocation targets driven by hardware and software mechanisms.The main contributions of this thesis are the following:- We have characterized different single- and multithreaded applications and classified workloads with a systematic method to better understand and explain the cache sharing effects on a CMP architecture. We have made a special effort in studying previous cache partitioning techniques for CMP architectures, in order to acquire the insight to propose improved mechanisms.- In CMP architectures with out-of-order processors, cache misses can be served in parallel and share the miss penalty to access main memory. We take this fact into account to propose new cache partitioning algorithms guided by the memory-level parallelism (MLP) of each application. With these algorithms, the system performance is improved (in terms of throughput and fairness) without significantly increasing the hardware required by previous proposals.- Driving cache partition decisions with indirect indicators of performance such as misses, MLP or data re-use may lead to suboptimal cache partitions. Ideally, the appropriate metric to drive cache partitions should be the target metric to optimize, which is normally related to IPC. Thus, we have developed a hardware mechanism, OPACU, which is able to obtain at run-time accurate predictions of the performance of an application when running with different cache assignments.- Using performance predictions, we have introduced a new framework to manage shared caches in CMP architectures, FlexDCP, which allows the OS to optimize different IPC-related target metrics like throughput or fairness and provide QoS to applications. FlexDCP allows an enhanced coordination between the hardware and the software layers, which leads to improved system performance and flexibility.- Next, we have made use of performance estimations to reduce the load imbalance problem in parallel applications. We have built a run-time mechanism that detects parallel applications sensitive to cache allocation and, in these situations, the load imbalance is reduced by assigning more cache space to the slowest threads. This mechanism, helps reducing the long optimization time in terms of man-years of effort devoted to large-scale parallel applications.- Finally, we have stated the main characteristics that future multicore processors with thousands of cores should have. An enhanced coordination between the software and hardware layers has been proposed to better manage the shared resources in these architectures. load balancing quality of service performance predictability cache partitioning shared cache CMP architectures 004
87	Achieving Scalable, Exhaustive Network Data Processing by Exploiting Parallelism Mawji, Afzal January 2004 (has links) Telecommunications companies (telcos) and Internet Service Providers (ISPs) monitor the traffic passing through their networks for the purposes of network evaluation and planning for future growth. Most monitoring techniques currently use a form of packet sampling. However, exhaustive monitoring is a preferable solution because it ensures accurate traffic characterization and also allows encoding operations, such as compression and encryption, to be performed. To overcome the very high computational cost of exhaustive monitoring and encoding of data, this thesis suggests exploiting parallelism. By utilizing a parallel cluster in conjunction with load balancing techniques, a simulation is created to distribute the load across the parallel processors. It is shown that a very scalable system, capable of supporting a fairly high data rate can potentially be designed and implemented. A complete system is then implemented in the form of a transparent Ethernet bridge, ensuring that the system can be deployed into a network without any change to the network. The system focuses its encoding efforts on obtaining the maximum compression rate and, to that end, utilizes the concept of streams, which attempts to separate data packets into individual flows that are correlated and whose redundancy can be removed through compression. Experiments show that compression rates are favourable and confirms good throughput rates and high scalability. Electrical & Computer Engineering exhaustive data traffic monitoring load balancing packet filtering
88	Load Balancing Schemes for Distributed Real-Time Interactive Virtual World Simulations Cunningham, Ian Joseph January 2000 (has links) Over the last several years, there has been tremendous growth in online gaming (i. e. playing games over the internet). The Massively Multiplayer Online Role Playing Game (MMORPG) is one type of online game. An MMORPG is played within a virtual world. Users have an in-game representation, called an avatar, that they control. Typically there are over a thousand avatars in the virtual world at one time. Users use client software to connect to an MMORPG server over the internet. If just one server is used then the number of avatars that can be supported in the virtual world at one time is severely limited. In order to overcome this, a multi-server approach is needed. Unlike traditional load balancing and partitioning schemes, which generally use task partitioning, data partitioning is required in this case. This thesis investigates schemes for partitioning and load balancing MMORPG applications on a network of processors. In particular, three different schemes were developed andexamined. These are: Static Av, Static MS and Dynamic MS. Static Avassigns avatars to each server, one at a time, as they enter thesimulation. Static MS assigns equal sized portions of the map of thevirtual world to each server. An avatar is assigned to the server thatowns the part of the map that the avatar is "standing"on. Dynamic MS divides the map into many more segments than there are servers. The map segments are dynamicallydistributed among the servers based on the results of aload balancing algorithm. The thesis details the algorithms and the performance associated with each of the schemes. In summary, Static Av does not perform well, whereas Static MS and Dynamic MS can be used to parallelize MMORPGapplications. To the best of our knowledge, this is thefirst published work that looks at the issue ofparallelizing and load balancing such applications. Electrical & Computer Engineering Simulation Virtual Worlds Load Balancing Distributed Computing
89	On Optimizing Traffic Distribution for Clusters of Network Intrusion Detection and Prevention Systems Le, Anh January 2008 (has links) To address the overload conditions caused by the increasing network traffic volume, recent literature in the network intrusion detection and prevention field has proposed the use of clusters of network intrusion detection and prevention systems (NIDPSs). We observe that simple traffic distribution schemes are usually used for NIDPS clusters. These schemes have two major drawbacks: (1) the loss of correlation information caused by the traffic distribution because correlated flows are not sent to the same NIDPS and (2) the unbalanced loads of the NIDPSs. The first drawback severely affects the ability to detect intrusions that require analysis of correlated flows. The second drawback greatly increases the chance of overloading an NIDPS even when loads of the others are low. In this thesis, we address these two drawbacks. In particular, we propose two novel traffic distribution systems: the Correlation-Based Load Balancer and the Correlation-Based Load Manager as two different solutions to the NIDPS traffic distribution problem. On the one hand, the Load Balancer and the Load Manager both consider the current loads of the NIDPSs while distributing traffic to provide fine-grained load balancing and dynamic load distribution, respectively. On the other hand, both systems take into account traffic correlation in their distributions, thereby significantly reducing the loss of correlation information during their distribution of traffic. We have implemented prototypes of both systems and evaluated them using extensive simulations and real traffic traces. Overall, the evaluation results show that both systems have low overhead in terms of the delays introduced to the packets. More importantly, compared to the naive hash-based distribution, the Load Balancer significantly improves the anomaly-based detection accuracy of DDoS attacks and port scans -- the two major attacks that require the analysis of correlated flows -- meanwhile, the Load Manager successfully maintains the anomaly-based detection accuracy of these two major attacks of the NIDPSs. Load Balancing Load Management Intrusion Detection System Intrusion Prevention System Computer Science
90	Evaluation of Load Balancing Algorithms in IP Networks : A case study at TeliaSonera Hasselström, Emil, Sjögren, Therese January 2005 (has links) The principle of load balancing is to distribute the data load more evenly over the network in order to increase the network performance and efficiency. With dynamic load balancing the routing is undated at certain intervals. This thesis was developed to evaluate load balancing methods in the IP-network of TeliaSonera.Load balancing using short path routing, bottleneck load balancing and load balancing using MPLS have been evaluated. Short path routing is a flow sharing technique that allows routing on paths other than the shortest one. Load balancing using short path routing is achieved by dynamic updates of the link weights. Bottleneck is in its nature a dynamic load balancing algorithm. Unlike load balancing using short path routing it updates the flow sharing, not the metrics. The algorithm uses information about current flow sharing and link loads to detect bottlenecks within the network. The information is used to calculate new flow sharing parameters. When using MPLS, one or more complete routing paths (LSPs) are defined at each edge LSR before sending any traffic. MPLS brings the ability to perform flow sharing by defining the paths to be used and how the outgoing data load is to be shared among these. The model has been built from data about the network supplied by TeliaSonera. The model consists of a topology part, a traffic part, a routing part and cost part. The traffic model consists of a OD demand matrix. The OD demand matrix has been estimated from collected link loads. This was done with estimation models; the gravity model and an optimisation model. The algorithms have been analysed at several scenarios; normal network, core node failure, core link failure and DWDM system failure. A cost function, where the cost increases as the link load increases has been used to evaluate the algorithms. The signalling requirements for implementation of the load balancing algorithm have also been investigated. Technology Load balancing Traffic engineering MPLS OD Matrix Simulation TEKNIKVETENSKAP TECHNOLOGY TEKNIKVETENSKAP

Search results