Global ETD Search

1	Evaluation of basis functions for generating approximate linear programming (ALP) average cost solutions and policies for multiclass queueing networks Gurfein, Kate Elizabeth 16 August 2012 (has links) The average cost of operating a queueing network depends on several factors such as the complexity of the network and the service policy used. Approximate linear programming (ALP) is a method that can be used to compute an accurate lower bound on the optimal average cost as well as generate policies to be used in operating the network. These average cost solutions and policies are dependent on the type of basis function used in the ALP. In this paper, the ALP average cost solutions and policies are analyzed for twelve networks with four different types of basis functions (quadratic, linear, pure exponential, and mixed exponential). An approximate bound on the optimality gap between the ALP average cost solution and the optimal average cost solution is computed for each system, and the size of this bound is determined relative to the ALP average cost solution. Using the same set of networks, the performance of ALP generated policies are compared to the performance of the heuristic policies first-buffer-first-served (FBFS), last-buffer-first-served (LBFS), highest-queue-first-served (HQFS), and random-queue-first-served (RQFS). In general, ALP generated average cost solutions are considerably smaller than the simulated average cost under the corresponding policy, and therefore the approximate bounds on the optimality gaps are quite large. This bound increases with the complexity of the queueing network. Some ALP generated policies are not stabilizing policies for their corresponding networks, especially those produced using pure exponential and mixed exponential basis functions. For almost all systems, at least one of the heuristic policies results in mean average cost less than or nearly equal to the smallest mean average cost of all ALP generated policies in simulation runs. This means that generally there exists a heuristic policy which can perform as well as or better than any ALP generated policy. In conclusion, a useful bound on the optimality gap between the ALP average cost solution and the optimal average cost solution cannot be computed with this method. Further, heuristic policies, which are more computationally tractable than ALP generated policies, can generally match or exceed the performance of ALP generated policies, and thus computing such policies is often unnecessary for realizing cost benefits in queueing networks. / text Approximate linear programming ALP Multiclass queueing networks Dynamic programming Optimal policy Optimality gap
2	Optimization-based Approximate Dynamic Programming Petrik, Marek 01 September 2010 (has links) Reinforcement learning algorithms hold promise in many complex domains, such as resource management and planning under uncertainty. Most reinforcement learning algorithms are iterative - they successively approximate the solution based on a set of samples and features. Although these iterative algorithms can achieve impressive results in some domains, they are not sufficiently reliable for wide applicability; they often require extensive parameter tweaking to work well and provide only weak guarantees of solution quality. Some of the most interesting reinforcement learning algorithms are based on approximate dynamic programming (ADP). ADP, also known as value function approximation, approximates the value of being in each state. This thesis presents new reliable algorithms for ADP that use optimization instead of iterative improvement. Because these optimization-based algorithms explicitly seek solutions with favorable properties, they are easy to analyze, offer much stronger guarantees than iterative algorithms, and have few or no parameters to tweak. In particular, we improve on approximate linear programming - an existing method - and derive approximate bilinear programming - a new robust approximate method. The strong guarantees of optimization-based algorithms not only increase confidence in the solution quality, but also make it easier to combine the algorithms with other ADP components. The other components of ADP are samples and features used to approximate the value function. Relying on the simplified analysis of optimization-based methods, we derive new bounds on the error due to missing samples. These bounds are simpler, tighter, and more practical than the existing bounds for iterative algorithms and can be used to evaluate solution quality in practical settings. Finally, we propose homotopy methods that use the sampling bounds to automatically select good approximation features for optimization-based algorithms. Automatic feature selection significantly increases the flexibility and applicability of the proposed ADP methods. The methods presented in this thesis can potentially be used in many practical applications in artificial intelligence, operations research, and engineering. Our experimental results show that optimization-based methods may perform well on resource-management problems and standard benchmark problems and therefore represent an attractive alternative to traditional iterative methods. Approximate Dynamic Programming Approximate Linear Programming Markov Decision Problem Mathematical Optimization Reinforcement Learning Computer Sciences
3	Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and an Application Lakshminarayanan, Chandrashekar January 2015 (has links) (PDF) Problems involving optimal sequential making in uncertain dynamic systems arise in domains such as engineering, science and economics. Such problems can often be cast in the framework of Markov Decision Process (MDP). Solving an MDP requires computing the optimal value function and the optimal policy. The idea of dynamic programming (DP) and the Bellman equation (BE) are at the heart of solution methods. The three important exact DP methods are value iteration, policy iteration and linear programming. The exact DP methods compute the optimal value function and the optimal policy. However, the exact DP methods are inadequate in practice because the state space is often large and in practice, one might have to resort to approximate methods that compute sub-optimal policies. Further, in certain cases, the system observations are known only in the form of noisy samples and we need to design algorithms that learn from these samples. In this thesis we study interesting theoretical questions pertaining to approximate and learning algorithms, and also present an interesting application of MDPs in the domain of crowd sourcing. Approximate Dynamic Programming (ADP) methods handle the issue of large state space by computing an approximate value function and/or a sub-optimal policy. In this thesis, we are concerned with conditions that result in provably good policies. Motivated by the limitations of the PBE in the conventional linear algebra, we study the PBE in the (min, +) linear algebra. It is a well known fact that deterministic optimal control problems with cost/reward criterion are (min, +)/(max, +) linear and ADP methods have been developed for such systems in literature. However, it is straightforward to show that inﬁnite horizon discounted reward/cost MDPs are neither (min, +) nor (max, +) linear. We develop novel ADP schemes namely the Approximate Q Iteration (AQI) and Variational Approximate Q Iteration (VAQI), where the approximate solution is a (min, +) linear combination of a set of basis functions whose span constitutes a subsemimodule. We show that the new ADP methods are convergent and we present a bound on the performance of the sub-optimal policy. The Approximate Linear Program (ALP) makes use of linear function approximation (LFA) and oﬀers theoretical performance guarantees. Nevertheless, the ALP is diﬃcult to solve due to the presence of a large number of constraints and in practice, a reduced linear program (RLP) is solved instead. The RLP has a tractable number of constraints sampled from the original constraints of the ALP. Though the RLP is known to perform well in experiments, theoretical guarantees are available only for a speciﬁc RLP obtained under idealized assumptions. In this thesis, we generalize the RLP to deﬁne a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints of the ALP. The main contribution here is the novel theoretical framework developed to obtain error bounds for any given GRLP. Reinforcement Learning (RL) algorithms can be viewed as sample trajectory based solution methods for solving MDPs. Typically, RL algorithms that make use of stochastic approximation (SA) are iterative schemes taking small steps towards the desired value at each iteration. Actor-Critic algorithms form an important sub-class of RL algorithms, wherein, the critic is responsible for policy evaluation and the actor is responsible for policy improvement. The actor and critic iterations have deferent step-size schedules, in particular, the step-sizes used by the actor updates have to be generally much smaller than those used by the critic updates. Such SA schemes that use deferent step-size schedules for deferent sets of iterates are known as multitimescale stochastic approximation schemes. One of the most important conditions required to ensure the convergence of the iterates of a multi-timescale SA scheme is that the iterates need to be stable, i.e., they should be uniformly bounded almost surely. However, the conditions that imply the stability of the iterates in a multi-timescale SA scheme have not been well established. In this thesis, we provide veritable conditions that imply stability of two timescale stochastic approximation schemes. As an example, we also demonstrate that the stability of a widely used actor-critic RL algorithm follows from our analysis. Crowd sourcing (crowd) is a new mode of organizing work in multiple groups of smaller chunks of tasks and outsourcing them to a distributed and large group of people in the form of an open call. Recently, crowd sourcing has become a major pool for human intelligence tasks (HITs) such as image labeling, form digitization, natural language processing, machine translation evaluation and user surveys. Large organizations/requesters are increasingly interested in crowd sourcing the HITs generated out of their internal requirements. Task starvation leads to huge variation in the completion times of the tasks posted on to the crowd. This is an issue for frequent requesters desiring predictability in the completion times of tasks speciﬁed in terms of percentage of tasks completed within a stipulated amount of time. An important task attribute that aﬀects the completion time of a task is its price. However, a pricing policy that does not take the dynamics of the crowd into account might fail to achieve the desired predictability in completion times. Here, we make use of the MDP framework to compute a pricing policy that achieves predictable completion times in simulations as well as real world experiments. Dynamic Programming (DP) Markov Decision Process (MDP) Bellman Equation CBE Machine Learning Bellman Operator Crowdsourcing Approximate Linear Programming (ALP) Reinforcement Learning Stochastic Approximation Approximate Dynamic Programming (ADP) Approximate Linear Program Linear Function Approximation (LFA) Reduced Linear Program (RLP) Crowd Sourcing Computer Science and Automation

1

Page generated in 0.2485 seconds