221 |
Monitoring-as-a-service in the cloudMeng, Shicong 03 April 2012 (has links)
State monitoring is a fundamental building block for Cloud services.
The demand for providing state monitoring as services (MaaS) continues to grow and is evidenced by CloudWatch from Amazon EC2, which allows cloud consumers to pay for monitoring a selection of performance metrics with coarse-grained periodical sampling of runtime states. One of the key challenges for wide deployment of MaaS is to provide better balance among a set of critical quality and performance parameters, such as accuracy, cost, scalability and customizability.
This dissertation research is dedicated to innovative research and
development of an elastic framework for providing state monitoring as
a service (MaaS). We analyze limitations of existing techniques, systematically identify the need and the challenges at different layers of a Cloud monitoring service platform, and develop a suite of
distributed monitoring techniques to support for flexible monitoring
infrastructure, cost-effective state monitoring and monitoring-enhanced Cloud management. At the monitoring infrastructure layer, we develop techniques to support multi-tenancy of monitoring services by exploring cost sharing between monitoring tasks and safeguarding monitoring resource usage. To provide elasticity in monitoring, we propose techniques to allow the monitoring infrastructure to self-scale with monitoring demand. At the cost-effective state monitoring layer, we devise several new state monitoring functionalities to meet unique functional requirements in Cloud monitoring. Violation likelihood state monitoring explores the benefits of consolidating monitoring workloads by allowing utility-driven monitoring intensity tuning on individual monitoring tasks and identifying correlations between monitoring tasks. Window based state monitoring leverages distributed windows for the best monitoring accuracy and communication efficiency. Reliable state monitoring is robust to both transient and long-lasting communication issues caused by component failures or cross-VM performance interferences. At the monitoring-enhanced Cloud management layer, we devise a novel technique to learn about the performance characteristics of both Cloud infrastructure and Cloud applications from cumulative performance monitoring data to increase the cloud deployment efficiency.
|
222 |
Distributed database support for networked real-time multiplayer gamesGrimm, Henrik January 2002 (has links)
<p>The focus of this dissertation is on large-scale and long-running networked real-time multiplayer games. In this type of games, each player controls one or many entities, which interact in a shared virtual environment. Three attributes - scalability, security, and fault tolerance - are considered essential for this type of games. The normal approaches for building this type of games, using a client/server or peer-to-peer architecture, fail in achieving all three attributes. We propose a server-network architecture that supports these attributes. In this architecture, a cluster of servers collectively manage the game state and each server manages a separate region of the virtual environment. We discuss how the architecture can be extended using proxies, and we compare it to other similar architectures. Further, we investigate how a distributed database management system can support the proposed architecture. Since efficiency is very important in this type of games, some properties of traditional database systems must be relaxed. We also show how methods for increasing scalability, such as interest management and dead reckoning, can be implemented in a database system. Finally, we suggest how the proposed architecture can be validated using a simulation of a large-scale game.</p>
|
223 |
Re-authentication of Critical Operations / Återautentisering av Kritiska OperationerYachouh, Marwan January 2002 (has links)
<p>This is a study on the development of a re-authentication prototype. Re- authentication serves as a receipt for e.g. system administrators that authorise them to carry out a critical operation in a system that already is protected by a security architecture. A critical operation is a kind of operation that can cause serious damage to a network node or a set of network nodes, if it is done without one giving it a second thought. The purpose is to prevent mistakes and secure the users’ audit trail. </p><p>The main task is to propose and implement a re-authentication prototype, that is to enable the incorporation of the re-authentication prototype to an already complete security architecture and yet preserve the security and performance level of the architecture. </p><p>This thesis deals with this problem by using digitally signed certificates to provide the necessary security issues. The certificates used are called re- authentication certificates and follows the X.509 attribute certificate standard. The re-authentication certificate is optimised so that it only holds authorisation information regarding one critical operation. An access control decision function is used to decide if the re-authentication certificate and its owner are authentic. On basis of that decision the user can get the authority to execute critical operations. </p><p>The finished prototype confirms that a re-authentication can be incorporated with the security architecture. The report also shows that the security status of the architecture is preserved. The performance of the prototype is rather difficult to prove since the prototype implementation only initialises the objects that are required to prove the security issues. A performance test can therefore never prove how the prototype will perform in an authentic environment. The performance is assumed to be adequate since it uses the same authentication function that is used by the security architecture.</p>
|
224 |
Fault tolerance in distributed systems : a coding-theoretic approachBalasubramanian, Bharath 19 November 2012 (has links)
Distributed systems are rapidly increasing in importance due to the need for scalable computations on huge volumes of data. This fact is reflected in many real-world distributed applications such as Amazon's EC2 cloud computing service, Facebook's Cassandra key-value store or Apache's Hadoop MapReduce framework. Multi-core architectures developed by companies such as Intel and AMD have further brought this to prominence, since workloads can now be distributed across many individual cores. The nodes or entities in such systems are often built using commodity hardware and are prone to physical failures and security vulnerabilities. Achieving fault tolerance in such systems is a challenging task, since it is not easy to observe and control these distributed entities. Replication is a standard approach for fault tolerance in distributed systems. The main advantage of this approach is that the backups incur very little overhead in terms of the time taken for normal operation or recovery. However, replication is grossly wasteful in terms of the number of backups required for fault tolerance. The large number of backups has two major implications. First, the total space or memory required for fault tolerance is considerably high. Second, there is a significant cost of resources such as the power required to run the backup processes. Given the large number of distributed servers employed in real-world applications, it is a hard task to provide fault tolerance while achieving both space and operational efficiency. In the world of data fault tolerance and communication, coding theory is used as the space efficient alternate for replication. A direct application of coding theory to distributed servers, treating the servers as blocks of data, is very inefficient in terms of the updates to the backups. This is primarily because each update to the server will affect many blocks in memory, all of which have to be re-encoded at the backups. This leads us to the following thesis statement: Can we design a mechanism for fault tolerance in distributed systems that combines the space efficiency of coding theory with the low operational overhead of replication? We present a new paradigm to solve this problem, broadly referred to as fusion. We provide fusion-based solutions for two models of computation that are representative of a large class of applications: (i) Systems modeled as deterministic finite state machines and, (ii) Systems modeled as programs containing data structures. For finite state machines, we use the notion of Hamming distances to present a polynomial time algorithm to generate efficient backup state machines. For programs hosting data structures, we use a combination of erasure codes and selective replication to generate efficient backups for most commonly used data structures such as queues, array lists, linked lists, vectors and maps. We present theoretical and experimental results that demonstrate the efficiency of our schemes over replication. Finally, we use our schemes to design an efficient solution for fault tolerance in two real-world applications: Amazons Dynamo key-value store, and Google's MapReduce framework. / text
|
225 |
Robot data and control server for Internet-based training on ground robotsKalyadin, Dmitry 01 June 2007 (has links)
To facilitate the emerging need for remote robot training and reach back, this thesis describes a system that allows for convenient web browser based robot operation over the Internet, while providing the means for recording and playback of all video, data and user actions. Training of first responder personnel on rescue robots is hindered by the fact that these devices are very expensive and are only affordable by a few specialized organizations that make them available by request at the time of a disaster. The system described in this thesis will allow first responders to practice on the robots without having to be physically present at same location. Having these capabilities of remote presence, the system can also be used in a real world response to transmit robot video and data to persons not present at the site of the incident, such as structural engineers or medical doctors.
The recording capability will be used as an aid during training and to help resolve accountability issues in the real world scenario. Similar demands in the area of network video surveillance are met by the use of a network DVR that records and relays video and controls between IP cameras and Internet clients. The server implemented in this thesis is unique in that it extends these capabilities to include data from various robot sensors. All of the mentioned above video, data, and controls are combined into a convenient web browser based graphical user interface. The server was implemented and tested using rescue robots, but could be tailored to any other distributed robot architecture where reliable and convenient web browser based robot operation over the Internet is desired.
System testing validated server capabilities of remote multi user robot operation, as well as its unique ability to store and play back external camera view along with robot video and data, to help with situation awareness. Conclusions drawn from the experiments indicate that this system can indeed be used for Internet robot training, as well as for other robotics research such as bandwidth regulation techniques or human-robot interaction studies by non computer science researchers who do not have physical access to robots.
|
226 |
Policy architecture for distributed storage systemsBelaramani, Nalini Moti 15 October 2009 (has links)
Distributed data storage is a building block for many distributed systems
such as mobile file systems, web service replication systems, enterprise file
systems, etc. New distributed data storage systems are frequently built as new
environment, requirements or workloads emerge. The goal of this dissertation
is to develop the science of distributed storage systems by making it easier
to build new systems. In order to achieve this goal, it proposes a new policy
architecture, PADS, that is based on two key ideas: first, by providing a set of
common mechanisms in an underlying layer, new systems can be implemented
by defining policies that orchestrate these mechanisms; second, policy can be
separated into routing and blocking policy, each addresses different parts of the
system design. Routing policy specifies how data flow among nodes in order
to meet performance, availability, and resource usage goals, whereas blocking
policy specifies when it is safe to access data in order to meet consistency and
durability goals. This dissertation presents a PADS prototype that defines a set of distributed
storage mechanisms that are sufficiently flexible and general to support
a large range of systems, a small policy API that is easy to use and captures
the right abstractions for distributed storage, and a declarative language
for specifying policy that enables quick, concise implementations of complex
systems.
We demonstrate that PADS is able to significantly reduce development
effort by constructing a dozen significant distributed storage systems spanning
a large portion of the design space over the prototype. We find that each
system required only a couple of weeks of implementation effort and required a few dozen lines of policy code. / text
|
227 |
Machine Vision and Autonomous Integration Into an Unmanned Aircraft SystemVan Horne, Chris 10 1900 (has links)
ITC/USA 2013 Conference Proceedings / The Forty-Ninth Annual International Telemetering Conference and Technical Exhibition / October 21-24, 2013 / Bally's Hotel & Convention Center, Las Vegas, NV / The University of Arizona's Aerial Robotics Club (ARC) sponsors the development of an unmanned aerial vehicle (UAV) able to compete in the annual Association for Unmanned Vehicle Systems International (AUVSI) Seafarer Chapter Student Unmanned Aerial Systems competition. Modern programming frameworks are utilized to develop a robust distributed imagery and telemetry pipeline as a backend for a mission operator user interface. This paper discusses the design changes made for the 2013 AUVSI competition including integrating low-latency first-person view, updates to the distributed task backend, and incremental and asynchronous updates the operator's user interface for real-time data analysis.
|
228 |
Modeling Large Social Networks in ContextHo, Qirong 01 July 2014 (has links)
Today’s social and internet networks contain millions or even billions of nodes, and copious amounts of side information (context) such as text, attribute, temporal, image and video data. A thorough analysis of a social network should consider both the graph and the associated side information, yet we also expect the algorithm to execute in a reasonable amount of time on even the largest networks. Towards the goal of rich analysis on societal-scale networks, this thesis provides (1) modeling and algorithmic techniques for incorporating network context into existing network analysis algorithms based on statistical models, and (2) strategies for network data representation, model design, algorithm design and distributed multi-machine programming that, together, ensure scalability to very large networks. The methods presented herein combine the flexibility of statistical models with key ideas and empirical observations from the data mining and social networks communities, and are supported by software libraries for cluster computing based on original distributed systems research. These efforts culminate in a novel mixed-membership triangle motif model that easily scales to large networks with over 100 million nodes on just a few cluster machines, and can be readily extended to accommodate network context using the other techniques presented in this thesis.
|
229 |
A study of transient bottlenecks: understanding and reducing latency long-tail problem in n-tier web applicationsWang, Qingyang 21 September 2015 (has links)
An essential requirement of cloud computing or data centers is to simultaneously achieve good performance and high utilization for cost efficiency. High utilization through virtualization and hardware resource sharing is critical for both cloud providers and cloud consumers to reduce management and infrastructure costs (e.g., energy cost, hardware cost) and to increase cost-efficiency. Unfortunately, achieving good performance (e.g., low latency) for web applications at high resource utilization remains an elusive goal. Both practitioners and researchers have experienced the latency long-tail problem in clouds during periods of even moderate utilization (e.g., 50%). In this dissertation, we show that transient bottlenecks are an important contributing factor to the latency long-tail problem. Transient bottlenecks are bottlenecks with a short lifespan on the order of tens of milliseconds. Though short-lived, transient bottleneck can cause a long-tail response time distribution that spans a spectrum of 2 to 3 orders of magnitude, from tens of milliseconds to tens of seconds, due to the queuing effect propagation and amplification caused by complex inter-tier resource dependencies in the system. Transient bottlenecks can arise from a wide range of factors at different system layers. For example, we have identified transient bottlenecks caused by CPU dynamic voltage and frequency scaling (DVFS) control at the CPU architecture layer, Java garbage collection (GC) at the system software layer, and virtual machine (VM) consolidation at the application layer. These factors interact with naturally bursty workloads from clients, often leading to transient bottlenecks that cause overall performance degradation even if all the system resources are far from being saturated (e.g., less than 50%). By combining fine-grained monitoring tools and a sophisticated analytical method to generate and analyze monitoring data, we are able to detect and study transient bottlenecks in a systematic way.
|
230 |
Physics Aware Programming Paradigm and Runtime ManagerZhang, Yeliang January 2007 (has links)
The overarching goal of this dissertation research is to realize a virtual collaboratory for the investigation of large-scale scientific computing applications which generally experience different execution phases at runtime and each phase has different computational, communication and storage requirements as well as different physical characteristics. Consequently, an optimal solution or numerical scheme for one execution phase might not be appropriate for the next phase of the application execution. Choosing the ideal numerical algorithms and solutions for all application runtime phases remains an active research area. In this dissertation, we present Physics Aware Programming (PAP) paradigm that enables programmers to identify the appropriate solution methods to exploit the heterogeneity and the dynamism of the application execution states. We implement a Physics Aware Runtime Manager (PARM) to exploit the PAP paradigm. PARM periodically monitors and analyzes the runtime characteristics of the application to identify its current execution phase (state). For each change in the application execution phase, PARM will adaptively exploit the spatial and temporal attributes of the application in the current state to identify the ideal numerical algorithms/solvers that optimize its performance. We have evaluated our approach using a real world application (Variable Saturated Aquifer Flow and Transport (VSAFT2D)) commonly used in subsurface modeling, diffusion problem kernel and seismic problem kernel. We evaluated the performance gain of the PAP paradigm with up to 2,000,000 nodes in the computation domain implemented on 32 processors. Our experimental results show that by exploiting the application physics characteristics at runtime and applying the appropriate numerical scheme with adapted spatial and temporal attributes, a significant speedup can be achieved (around 80%) and the overhead injected by PAP is negligible (less than 2%). We also show that the results using PAP is as accurate as the numerical solutions that use fine grid resolution.
|
Page generated in 0.106 seconds