Spelling suggestions: "subject:"achine 1earning inference"" "subject:"achine 1earning cnference""
1 |
Rethinking Serverless for Machine Learning InferenceEllore, Anish Reddy 21 August 2023 (has links)
In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless. / Master of Science / Most modern software applications leverage the power of machine learning to incorporate intelligent features. For instance, platforms like Yelp employ machine learning algorithms to detect fake reviews, while intelligent chatbots such as ChatGPT provide interactive conversations. Even Netflix relies on machine learning to recommend personalized content to its users. The process of creating these machine learning services involves several stages, including data collection, model training using the collected data, and serving the trained model to deploy the service. This final stage, known as inference, is crucial for delivering real-time predictions or responses to user queries. In our research, we focus on selecting serverless computing as the preferred infrastructure for deploying these popular inference workloads.
Serverless, also referred to as Function as a Service (FaaS), is an execution paradigm in cloud computing that allows users to efficiently run their code by providing scalability, elasticity and fine-grained billing. In this work we identified, model loading and model memory duplication as major bottlenecks in Serverless Inference. To solve these problems we propose a new approach which rethinks the way we serve FaaS requests. To support this design we use a hybrid scaling approach to implement the autoscale feature of serverless.
|
2 |
Stochastic modelling of flood phenomena based on the combination of mechanist and systemic approaches / Couplage entre approches mécaniste et systémique pour la modélisation stochastique des phénomènes de cruesBoutkhamouine, Brahim 14 December 2018 (has links)
Les systèmes de prévision des crues décrivent les transformations pluie-débit en se basant sur des représentations simplifiées. Ces représentations modélisent les processus physiques impliqués avec des descriptions empiriques, ou basées sur des équations de la mécanique classique. Les performances des modèles actuels de prévision des crues sont affectées par différentes incertitudes liées aux approximations et aux paramètres du modèle, aux données d’entrée et aux conditions initiales du bassin versant. La connaissance de ces incertitudes permet aux décideurs de mieux interpréter les prévisions et constitue une aide à la décision lors de la gestion de crue. L’analyse d’incertitudes dans les modèles hydrologiques existants repose le plus souvent sur des simulations de Monte-Carlo (MC). La mise en œuvre de ce type de techniques requiert un grand nombre de simulations et donc un temps de calcul potentiellement important. L'estimation des incertitudes liées à la modélisation hydrologique en temps réel reste donc une gageure. Dans ce projet de thèse, nous développons une méthodologie de prévision des crues basée sur les réseaux Bayésiens (RB). Les RBs sont des graphes acycliques dans lesquels les nœuds correspondent aux variables caractéristiques du système modélisé et les arcs représentent les dépendances probabilistes entre ces variables. La méthodologie présentée propose de construire les RBs à partir des principaux facteurs hydrologiques contrôlant la génération des crues, en utilisant à la fois les observations disponibles de la réponse du système et les équations déterministes décrivant les processus concernés. Elle est conçue pour prendre en compte la variabilité temporelle des différentes variables impliquées. Les dépendances probabilistes entre les variables (paramètres) peuvent être spécifiées en utilisant des données observées, des modèles déterministes existants ou des avis d’experts. Grâce à leurs algorithmes d’inférence, les RBs sont capables de propager rapidement, à travers le graphe, différentes sources d'incertitudes pour estimer leurs effets sur la sortie du modèle (ex. débit d'une rivière). Plusieurs cas d’études sont testés. Le premier cas d’étude concerne le bassin versant du Salat au sud-ouest de la France : un RB est utilisé pour simuler le débit de la rivière à une station donnée à partir des observations de 3 stations hydrométriques localisées en amont. Le modèle présente de bonnes performances pour l'estimation du débit à l’exutoire. Utilisé comme méthode inverse, le modèle affiche également de bons résultats quant à la caractérisation de débits d’une station en amont par propagation d’observations de débit sur des stations en aval. Le deuxième cas d’étude concerne le bassin versant de la Sagelva situé en Norvège, pour lequel un RB est utilisé afin de modéliser l'évolution du contenu en eau de la neige en fonction des données météorologiques disponibles. Les performances du modèle sont conditionnées par les données d’apprentissage utilisées pour spécifier les paramètres du modèle. En l'absence de données d'observation pertinentes pour l’apprentissage, une méthodologie est proposée et testée pour estimer les paramètres du RB à partir d’un modèle déterministe. Le RB résultant peut être utilisé pour effectuer des analyses d’incertitudes sans recours aux simulations de Monte-Carlo. Au regard des résultats enregistrés sur les différents cas d’études, les RBs se révèlent utiles et performants pour une utilisation en support d’un processus d'aide à la décision dans le cadre de la gestion du risque de crue. / Flood forecasting describes the rainfall-runoff transformation using simplified representations. These representations are based on either empirical descriptions, or on equations of classical mechanics of the involved physical processes. The performances of the existing flood predictions are affected by several sources of uncertainties coming not only from the approximations involved but also from imperfect knowledge of input data, initial conditions of the river basin, and model parameters. Quantifying these uncertainties enables the decision maker to better interpret the predictions and constitute a valuable decision-making tool for flood risk management. Uncertainty analysis on existing rainfall-runoff models are often performed using Monte Carlo (MC)- simulations. The implementation of this type of techniques requires a large number of simulations and consequently a potentially important calculation time. Therefore, quantifying uncertainties of real-time hydrological models is challenging. In this project, we develop a methodology for flood prediction based on Bayesian networks (BNs). BNs are directed acyclic graphs where the nodes correspond to the variables characterizing the modelled system and the arcs represent the probabilistic dependencies between these variables. The presented methodology suggests to build the RBs from the main hydrological factors controlling the flood generation, using both the available observations of the system response and the deterministic equations describing the processes involved. It is, thus, designed to take into account the time variability of different involved variables. The conditional probability tables (parameters), can be specified using observed data, existing hydrological models or expert opinion. Thanks to their inference algorithms, BN are able to rapidly propagate, through the graph, different sources of uncertainty in order to estimate their effect on the model output (e.g. riverflow). Several case studies are tested. The first case study is the Salat river basin, located in the south-west of France, where a BN is used to simulate the discharge at a given station from the streamflow observations at 3 hydrometric stations located upstream. The model showed good performances estimating the discharge at the outlet. Used in a reverse way, the model showed also satisfactory results when characterising the discharges at an upstream station by propagating back discharge observations of some downstream stations. The second case study is the Sagelva basin, located in Norway, where a BN is used to simulate the accumulation of snow water equivalent (SWE) given available weather data observations. The performances of the model are affected by the learning dataset used to train the BN parameters. In the absence of relevant observation data for learning, a methodology for learning the BN-parameters from deterministic models is proposed and tested. The resulted BN can be used to perform uncertainty analysis without any MC-simulations to be performed in real-time. From these case studies, it appears that BNs are a relevant decisionsupport tool for flood risk management.
|
3 |
ACCELERATING SPARSE MACHINE LEARNING INFERENCEAshish Gondimalla (14214179) 17 May 2024 (has links)
<p>Convolutional neural networks (CNNs) have become important workloads due to their<br>
impressive accuracy in tasks like image classification and recognition. Convolution operations<br>
are compute intensive, and this cost profoundly increases with newer and better CNN models.<br>
However, convolutions come with characteristics such as sparsity which can be exploited. In<br>
this dissertation, we propose three different works to capture sparsity for faster performance<br>
and reduced energy. </p>
<p><br></p>
<p>The first work is an accelerator design called <em>SparTen</em> for improving two-<br>
sided sparsity (i.e, sparsity in both filters and feature maps) convolutions with fine-grained<br>
sparsity. <em>SparTen</em> identifies efficient inner join as the key primitive for hardware acceleration<br>
of sparse convolution. In addition, <em>SparTen</em> proposes load balancing schemes for higher<br>
compute unit utilization. <em>SparTen</em> performs 4.7x, 1.8x and 3x better than dense architecture,<br>
one-sided architecture and SCNN, the previous state of the art accelerator. The second work<br>
<em>BARISTA</em> scales up SparTen (and SparTen like proposals) to large-scale implementation<br>
with as many compute units as recent dense accelerators (e.g., Googles Tensor processing<br>
unit) to achieve full speedups afforded by sparsity. However at such large scales, buffering,<br>
on-chip bandwidth, and compute utilization are highly intertwined where optimizing for<br>
one factor strains another and may invalidate some optimizations proposed in small-scale<br>
implementations. <em>BARISTA</em> proposes novel techniques to balance the three factors in large-<br>
scale accelerators. <em>BARISTA</em> performs 5.4x, 2.2x, 1.7x and 2.5x better than dense, one-<br>
sided, naively scaled two-sided and an iso-area two-sided architecture, respectively. The last<br>
work, <em>EUREKA</em> builds an efficient tensor core to execute dense, structured and unstructured<br>
sparsity with losing efficiency. <em>EUREKA</em> achieves this by proposing novel techniques to<br>
improve compute utilization by slightly tweaking operand stationarity. <em>EUREKA</em> achieves a<br>
speedup of 5x, 2.5x, along with 3.2x and 1.7x energy reductions over Dense and structured<br>
sparse execution respectively. <em>EUREKA</em> only incurs area and power overheads of 6% and<br>
11.5%, respectively, over Ampere</p>
|
Page generated in 0.074 seconds