Global ETD Search

151	Prédiction de la qualité des bois de chêne pour l’élevage des vins et des alcools : comparaison des approches physicochimiques, sensorielles et moléculaires Guichoux, Erwan 06 April 2011 (has links) Au cours du vieillissement, les caractéristiques organoleptiques du vin se modifient au contact du bois de chêne. Le composé aromatique le plus important, la whisky-lactone, aux notes noix de coco et boisé, est facilement détectable et apprécié par les consommateurs.Quercus petraea et Q. robur, les deux principales espèces européennes de chêne utilisées pour le vieillissement des vins, ont des profils aromatiques très contrastés, particulièrement pour la whisky-lactone. Parvenir à identifier l’espèce de chêne permettrait de fournir aux tonnelleries des lots de bois plus homogènes. L’objectif de cette étude est d’identifier l’espèce de chêne à partir de bois sec, à l’aide de marqueurs moléculaires utilisables dans un contexte industriel. Le bois sec est un tissu mort dans lequel l’ADN est très dégradé et donc difficilement accessible. Pour optimiser l’extraction d’ADN à partir de ce tissu, nous avons développé une méthode de PCR en temps-réel ciblant l’ADN chloroplastique, permettant ainsi d’évaluer l’efficacité des différents protocoles d’extraction. Nous avons également développé des marqueurs moléculaires (SSRs et SNPs) fortement différenciés entre espèces et particulièrement bien adaptés au bois. Grâce à des protocoles d’extraction d’ADN optimisés et ces marqueurs performants, nous avons pu identifier l’espèce sur des lots de bois séchés pendant deux ans. De plus, par l’étude de 262 SNPs dont la moitié est fortement différenciée entre espèces, nous avons démontré que les gènes sélectionnés (loci « outlier ») sont très performants pour délimiter ces deux espèces proches. Ils permettent également de détecter des processus démographiques fins (flux de gènes intra- et interspécifiques), alors que les gènes a priori non-sélectionnés (loci neutres) se révèlent peu informatifs. / Most of aromatic compounds in wine are directly induced during maturation by the contactwith oak wood. For example, whisky-lactone, the most important aromatic compound,which gives a coconut and woody taste, is easily detected and appreciated by consumers.Quercus petraea and Q. robur, the two major European oak species used for wine maturation,have very contrasted aromatic patterns, especially for whisky-lactone. Identifying the speciesused for cooperage will facilitate the maturation process, for instance by providing winerieswith more homogenous batches of barrels. The objective of our study is to characterize theoak species directly from dry wood, using molecular markers that will be applicable in anindustrial context. Unfortunately, dry wood is a dead tissue in which DNA is highlydegraded and difficult to access. To optimize DNA recovery from dry wood, we developed aquantitative PCR protocol based on chloroplast DNA to evaluate the efficiency of DNAisolation protocols. We identified and developed molecular markers (SSRs and SNPs)adapted to dry wood that are particularly diagnostic. Using an optimized DNA isolationprotocol and these powerful markers, the species identity from wood samples dried duringtwo years could be successfully characterized. Using 262 SNPs highly differentiated betweenthe two species, we also demonstrate that genes under selection (outlier loci) haveoutstanding power to delimitate the two oak species and provide unique insights on intraandinterspecific gene flow, whereas genes lacking such a signature (putatively neutral loci)provide little or no resolution. Quercus ADN dégradé Bois Loci outliers Méthodes d'affectation Multiplex SSRs SNPs Quercus Degraded DNA Wood Outlier loci Assignment methods SSR multiplex SNPs
152	Analyse robuste de formes basée géodésiques et variétés de formes / Robust shape analysis based on geodesics and shape manifolds Abboud, Michel 15 December 2017 (has links) L’un des problèmes majeurs en analyse de formes est celui de l’analyse statistique en présence de formes aberrantes. On assiste avec l’évolution des moyens de collecte automatique des données, à la présence des valeurs aberrantes qui peuvent affecter énormément l’analyse descriptive des formes. En effet, les approches de l’état de l’art ne sont pas assez robustes à la présence de formes aberrantes. En particulier, la forme moyenne calculée penche vers les observations aberrantes et peut ainsi porter des déformations irrégulières. Aussi, l’analyse par ACP de la variabilité dans une classe de formes donnée conduit à des modes de variation qui décrivent plutôt la variabilité portée par ces formes aberrantes. Dans ce travail de thèse, nous proposons un schéma d’analyse robuste aux aberrations qui peuvent entacher une classe de formes donnée. Notre approche est une variante robuste de l’ACP qui consiste à détecter et à restaurer les formes aberrantes préalablement à une ACP menée dans l’espace tangent relatif à la forme moyenne. Au lieu de simplement éliminer les formes aberrantes, nous voulons bénéficier de la variabilité locale correcte qui y est présente en intégrant leur version restaurée dans l’analyse. Nous proposons également une approche variationnelle et une ACP élastique pour l’analyse de la variabilité d’un ensemble de formes en s’appuyant sur une métrique robuste basée géodésique. La troisième contribution de la thèse se situe au niveau des algorithmes de classification des formes basée sur les statistiques de formes : classification utilisant la moyenne intrinsèque, ou relaxée, par ACP tangente et par formes propres.Les approches proposées sont évaluées et comparées aux approches de l’état de l’art sur les bases de formes HAND et MPEG-7. Les résultats obtenus démontrent la capacité du schéma proposé à surpasser la présence de formes aberrantes et fournir des modes de variation qui caractérisent la variabilité des formes étudiées. / A major and complex problem in shape analysis is the statistical analysis of a set of shapes containing aberrant shapes. With the evolution of automatic data acquisition means, outliers can occur and their presence may greatly affect the descriptive analysis of shapes.Actually, state-of-the-art approaches are not robust enough to outliers. In particular, the calculated mean shape deviates towards the aberrant observations and thus carries irregular deformations.Similarly, the PCA analysis of the variability in a given class of shapes leads to variation modes which rather describe the variability carried by these aberrant shapes.In this thesis work, we propose a robust analysis scheme to handle the effects of aberrations that can occur in a given set. Our approach is a robust variant of PCA that consists in detecting and restoring aberrant shapes prior to a PCA in the tangent space relative to the means shape.Instead of simply rejecting outliers, we want to benefit from the present correct local variability by integrating their restored version into the analysis. We also propose a variational approach and an elastic PCA for the analysis of the variability of a set of shapes by using a robust geodesic-based metric. The third contribution of the thesis lies in the algorithms of shape classification based on shapes statistics: classification using the intrinsic mean shape, or relaxed one, by tangent PCA and by eigenshapes.The proposed schemes are evaluated and compared with existing schemes through two shape databases, HAND and MPEG-7. The results show the proposed scheme’s ability to overcome the presence of aberrant shapes and provide variation modes that characterize the variability of studied shapes. Espace de formes Formes aberrantes Analyse de formes ACP robuste ACP élastique Détection Restauration Classification Shape space Outliers Shape analysis Robust PCA Elastic PCA Detection Restoration Classification
153	Stabilní rozdělení a jejich aplikace / Stable distributions and their applications Volchenkova, Irina January 2016 (has links) The aim of this thesis is to show that the use of heavy-tailed distributions in finance is theoretically unfounded and may cause significant misunderstandings and fallacies in model interpretation. The main reason seems to be a wrong understanding of the concept of the distributional tail. Also in models based on real data it seems more reasonable to concentrate on the central part of the distribution not tails. Powered by TCPDF (www.tcpdf.org)
154	Novos algoritmos de aprendizado para classificação de padrões utilizando floresta de caminhos ótimos / New learning algorithms for pattern classification using optimum-path forest Castelo Fernández, César Christian 05 November 2011 (has links) Orientadores: Pedro Jussieu de Rezende, Alexandre Xavier Falcão / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-18T13:40:27Z (GMT). No. of bitstreams: 1 CasteloFernandez_CesarChristian_M.pdf: 2721705 bytes, checksum: 0d764319e69f64e1b806f60bbbf54b92 (MD5) Previous issue date: 2011 / Resumo: O Reconhecimento de Padrões pode ser definido como a capacidade de identificar a classe de algum objeto dentre um dado conjunto de classes, baseando-se na informação fornecida por amostras conhecidas (conjunto de treinamento). Nesta dissertação, o foco de estudo é o paradigma de classificação supervisionada, no qual se conhece a classe de todas as amostras utilizadas para o projeto do classificador. Especificamente, estuda-se o Classificador baseado em Floresta de Caminhos Ótimos (Optimum-Path Forest - OPF) e propõem três novos algoritmos de aprendizado, os quais representam melhorias em comparação com o Classificador OPF tradicional. Primeiramente, é desenvolvida uma metodologia simples, porém efetiva, para detecção de outliers no conjunto de treinamento. O método visa uma melhoria na acurácia do Classificador OPF tradicional através da troca desses outliers por novas amostras do conjunto de avaliação e sua exclusão do processo de aprendizagem. Os outliers são detectados computando uma penalidade para cada amostra baseada nos seus acertos e erros na classificação, o qual pode ser medido através do número de falsos positivos/negativos e verdadeiros positivos/negativos obtidos por cada amostra. O método obteve uma melhoria na acurácia em comparação com o OPF tradicional, com apenas um pequeno aumento no tempo de treinamento. Em seguida, é proposto um aprimoramento ao primeiro algoritmo, que permite detectar com maior precisão os outliers presentes na base de dados. Neste caso, utiliza-se a informação de falsos positivos/negativos e verdadeiros positivos/negativos de cada amostra para explorar intrinsecamente as relações de adjacência de cada amostra e determinar se é outlier. Uma inovação do método é que não existe necessidade de se computar explicitamente tal adjacência, como é feito nas técnicas tradicionais, o qual pode ser inviável para grandes bases de dados. O método obteve uma boa taxa de detecção de outliers e um tempo de treinamento muito baixo em vista do tamanho das bases de dados utilizadas. Finalmente, é abordado o problema de se selecionar um úmero tão pequeno quanto possível de amostras de treinamento e se obter a maior acurácia possível sobre o conjunto de teste. Propõe-se uma metodologia que se inicia com um pequeno conjunto de treinamento e, através da classificação de um conjunto bem maior de avaliação, aprende quais amostras são as mais representativas para o conjunto de treinamento. Os resultados mostram que é possível obter uma melhor acurácia que o Classificador OPF tradicional ao custo de um pequeno incremento no tempo de treinamento, mantendo, no entanto, o conjunto de treinamento menor que o conjunto inicial, o que significa um tempo de teste reduzido / Abstract: Pattern recognition can be defined as the capacity of identifying the class of an object among a given set of classes, based on the information provided by known samples (training set). In this dissertation, the focus is on the supervised classification approach, for which we are given the classes of all the samples used in the design of the classifier. Specifically, the Optimum-Path Forest Classifier (OPF) is studied and three new learning algorithms are proposed, which represent improvements to the traditional OPF classifier. First of all, a simple yet effective methodology is developed for the detection of outliers in a training set. This method aims at improving OPF's accuracy through the swapping of outliers for new samples from the evaluating set and their exclusion from the learning process itself. Outliers are detected by computing a penalty for each sample based on its classification-hits and -misses, which can be measured through the number of false positive/negatives and true positives/negatives obtained by each sample. The method achieved an accuracy improvement over the traditional OPF, with just a slight increment in the training time. An improvement to the first algorithm is proposed, allowing for a more precise detection of outliers present in the dataset. In this case, the information on the number of false positive/negatives and true positives/negatives of each sample is used to explore the adjacency relations of each sample and determine whether it is an outlier. The method's merit is that there is no need of explicitly computing an actual vicinity, as the traditional techniques do, which could be infeasible for large datasets. The method achieves a good outlier detection rate and a very low training time, considering the size of the datasets. Finally, the problem of choosing a small number of training samples while achieving a high accuracy in the testing set is addressed. We propose a methodology which starts with a small training set and, through the classification of a much larger evaluating set, it learns which are the most representative samples for the training set. The results show that it is possible to achieve higher accuracy than the traditional OPF's at the cost of a slight increment in the training time, preserving, however, a smaller training set than the original one, leading to a lower testing time / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Reconhecimento de padrões Aprendizado de máquina Teoria dos grafos Valores estranhos (Estatistica) Processamento de imagens Visão por computador Pattern recognition Machine learning Graph theory Outliers (Statistics) Image processing Machine vision
155	On biclusters aggregation and its benefits for enumerative solutions = Agregação de biclusters e seus benefícios para soluções enumerativas / Agregação de biclusters e seus benefícios para soluções enumerativas Oliveira, Saullo Haniell Galvão de, 1988- 27 August 2018 (has links) Orientador: Fernando José Von Zuben / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-27T03:28:44Z (GMT). No. of bitstreams: 1 Oliveira_SaulloHaniellGalvaode_M.pdf: 1171322 bytes, checksum: 5488cfc9b843dbab6d7a5745af1e3d4b (MD5) Previous issue date: 2015 / Resumo: Biclusterização envolve a clusterização simultânea de objetos e seus atributos, definindo mo- delos locais de relacionamento entre os objetos e seus atributos. Assim como a clusterização, a biclusterização tem uma vasta gama de aplicações, desde suporte a sistemas de recomendação, até análise de dados de expressão gênica. Inicialmente, diversas heurísticas foram propostas para encontrar biclusters numa base de dados numérica. No entanto, tais heurísticas apresen- tam alguns inconvenientes, como não encontrar biclusters relevantes na base de dados e não maximizar o volume dos biclusters encontrados. Algoritmos enumerativos são uma proposta recente, especialmente no caso de bases numéricas, cuja solução é um conjunto de biclusters maximais e não redundantes. Contudo, a habilidade de enumerar biclusters trouxe mais um cenário desafiador: em bases de dados ruidosas, cada bicluster original se fragmenta em vá- rios outros biclusters com alto nível de sobreposição, o que impede uma análise direta dos resultados obtidos. Essa fragmentação irá ocorrer independente da definição escolhida de co- erência interna no bicluster, sendo mais relacionada com o próprio nível de ruído. Buscando reverter essa fragmentação, nesse trabalho propomos duas formas de agregação de biclusters a partir de resultados que apresentem alto grau de sobreposição: uma baseada na clusteriza- ção hierárquica com single linkage, e outra explorando diretamente a taxa de sobreposição dos biclusters. Em seguida, um passo de poda é executado para remover objetos ou atributos indesejados que podem ter sido incluídos como resultado da agregação. As duas propostas foram comparadas entre si e com o estado da arte, em diversos experimentos, incluindo bases de dados artificiais e reais. Essas duas novas formas de agregação não só reduziram significa- tivamente a quantidade de biclusters, essencialmente defragmentando os biclusters originais, mas também aumentaram consistentemente a qualidade da solução, medida em termos de precisão e recuperação, quando os biclusters são conhecidos previamente / Abstract: Biclustering involves the simultaneous clustering of objects and their attributes, thus defin- ing local models for the two-way relationship of objects and attributes. Just like clustering, biclustering has a broad set of applications, ranging from an advanced support for recom- mender systems of practical relevance to a decisive role in data mining techniques devoted to gene expression data analysis. Initially, heuristics have been proposed to find biclusters, and their main drawbacks are the possibility of losing some existing biclusters and the inca- pability of maximizing the volume of the obtained biclusters. Recently efficient algorithms were conceived to enumerate all the biclusters, particularly in numerical datasets, so that they compose a complete set of maximal and non-redundant biclusters. However, the ability to enumerate biclusters revealed a challenging scenario: in noisy datasets, each true bicluster becomes highly fragmented and with a high degree of overlapping, thus preventing a direct analysis of the obtained results. Fragmentation will happen no matter the boundary condi- tion adopted to specify the internal coherence of the valid biclusters, though the degree of fragmentation will be associated with the noise level. Aiming at reverting the fragmentation, we propose here two approaches for properly aggregating a set of biclusters exhibiting a high degree of overlapping: one based on single linkage and the other directly exploring the rate of overlapping. A pruning step is then employed to filter intruder objects and/or attributes that were added as a side effect of aggregation. Both proposals were compared with each other and also with the actual state-of-the-art in several experiments, including real and artificial datasets. The two newly-conceived aggregation mechanisms not only significantly reduced the number of biclusters, essentially defragmenting true biclusters, but also consistently in- creased the quality of the whole solution, measured in terms of Precision and Recall when the composition of the dataset is known a priori / Mestrado / Engenharia de Computação / Mestre em Engenharia Elétrica Aprendizado de máquina Análise por agrupamento Mineração de dados (Computação) Valores estranhos (Estatistica) Problemas de enumeração combinatória Machine learning Cluster analysis Data mining and knowledge discovery Outliers (statistics) Combinatorial enumeration problems
156	Statistická analýza rozsáhlých dat z průmyslu / Statistical analysis of big industrial data Zamazal, Petr January 2021 (has links) This thesis deals with processing of real data regarding waste collection. It describes select parts of the fields of statistical tests, identification of outliers, correlation analysis and linear regression. This theoretical basis is applied through the programming language Python to process the data into a form suitable for creating linear models. Final models explain between 70 \% and 85 \% variability. Finally, the information obtained through this analysis is used to specify recommendations for the waste management company.
157	Parcimonie, diversité morphologique et séparation robuste de sources / Sparse modeling, morphological diversity and robust source separation Chenot, Cécile 29 September 2017 (has links) Cette thèse porte sur le problème de Séparation Aveugle de Sources (SAS) en présence de données aberrantes. La plupart des méthodes de SAS sont faussées par la présence de déviations structurées par rapport au modèle de mélange linéaire classique: des évènements physiques inattendus ou des dysfonctionnements de capteurs en sont des exemples fréquents.Nous proposons un nouveau modèle prenant en compte explicitement les données aberrantes. Le problème de séparation en résultant, mal posé, est adressé grâce à la parcimonie. L'utilisation de cette dernière est particulièrement intéressante en SAS robuste car elle permet simultanément de démélanger les sources et de séparer les différentes contributions. Ces travaux sont étendus pour l'estimation de variabilité spectrale pour l'imagerie hyperspectrale terrestre.Des comparaisons avec des méthodes de l'état-de-l'art montrent la robustesse et la fiabilité des algorithmes associés pour un large éventail de configurations, incluant le cas déterminé. / This manuscript addresses the Blind Source Separation (BSS) problem in the presence of outliers. Most BSS techniques are hampered by the presence of structured deviations from the standard linear mixing model, such as unexpected physical events or malfunctions of sensors. We propose a new data model taking explicitly into account the deviations. The resulting joint estimation of the components is an ill-posed problem, tackled using sparse modeling. The latter is particularly efficient for solving robust BSS since it allows for a robust unmixing of the sources jointly with a precise separation of the components. These works are then extended for the estimation of spectral variability in the framework of terrestrial hyperspectral imaging. Numerical experiments highlight the robustness and reliability of the proposed algorithms in a wide range of settings, including the full-rank regime. Separation aveugle de sources Robustesse Parcimonie Factorisation de matrices Données aberrantes Diversité morphologique Blind source separation Robustness Sparse Modeling Matrix factorization Outliers Morphological diversity
158	Comparing unsupervised clustering algorithms to locate uncommon user behavior in public travel data : A comparison between the K-Means and Gaussian Mixture Model algorithms Andrésen, Anton, Håkansson, Adam January 2020 (has links) Clustering machine learning algorithms have existed for a long time and there are a multitude of variations of them available to implement. Each of them has its advantages and disadvantages, which makes it challenging to select one for a particular problem and application. This study focuses on comparing two algorithms, the K-Means and Gaussian Mixture Model algorithms for outlier detection within public travel data from the travel planning mobile application MobiTime1[1]. The purpose of this study was to compare the two algorithms against each other, to identify differences between their outlier detection results. The comparisons were mainly done by comparing the differences in number of outliers located for each model, with respect to outlier threshold and number of clusters. The study found that the algorithms have large differences regarding their capabilities of detecting outliers. These differences heavily depend on the type of data that is used, but one major difference that was found was that K-Means was more restrictive then Gaussian Mixture Model when it comes to classifying data points as outliers. The result of this study could help people determining which algorithms to implement for their specific application and use case. Machine learning clustering K-Means Gaussian Mixture Model expectation-maximum data analysis public transport silhouette analysis outliers outlier detection data algorithms experiment Computer and Information Sciences Data- och informationsvetenskap
159	Robust gamma generalized linear models with applications in actuarial science Wang, Yuxi 09 1900 (has links) Les modèles linéaires généralisés (GLMs) constituent l’une des classes de modèles les plus populaires en statistique. Cette classe contient une grande variété de modèles de régression fréquemment utilisés, tels que la régression linéaire normale, la régression logistique et les gamma GLMs. Dans les GLMs, la distribution de la variable de réponse définit une famille exponentielle. Un désavantage de ces modèles est qu’ils ne sont pas robustes par rapport aux valeurs aberrantes. Pour les modèles comme la régression linéaire normale et les gamma GLMs, la non-robustesse est une conséquence des ailes exponentielles des densités. La différence entre les tendances de l’ensemble des données et celles des valeurs aberrantes donne lieu à des inférences et des prédictions biaisées. A notre connaissance, il n’existe pas d’approche bayésienne robuste spécifique pour les GLMs. La méthode la plus populaire est fréquentiste ; c’est celle de Cantoni and Ronchetti (2001). Leur approche consiste à adapter les M-estimateurs robustes pour la régression linéaire au contexte des GLMs. Cependant, leur estimateur est dérivé d’une modification de la dérivée de la log-vraisemblance, au lieu d’une modification de la vraisemblance (comme avec les M-estimateurs robustes pour la régression linéaire). Par conséquent, il n’est pas possible d’établir une correspondance claire entre la fonction modifiée à optimiser et un modèle. Le fait de proposer un modèle robuste présente deux avantages. Premièrement, il permet de comprendre et d’interpréter la modélisation. Deuxièmement, il permet l’analyse fréquentiste et bayésienne. La méthode que nous proposons s’inspire des idées de la régression linéaire robuste bayésienne. Nous adaptons l’approche proposée par Gagnon et al. (2020), qui consiste à utiliser une distribution normale modifiée avec des ailes plus relevées pour le terme d’erreur. Dans notre contexte, la distribution de la variable de réponse est une version modifiée où la partie centrale de la densité est conservée telle quelle, tandis que les extrémités sont remplacées par des ailes log-Pareto, se comportant comme (1/\|x\|)(1/ log \|x\|)λ. Ce mémoire se concentre sur les gamma GLMs. La performance est mesurée à la fois théoriquement et empiriquement, avec une analyse des données sur les coûts hospitaliers. / Generalized linear models (GLMs) form one of the most popular classes of models in statistics. This class contains a large variety of commonly used regression models, such as normal linear regression, logistic regression and gamma GLMs. In GLMs, the response variable distribution defines an exponential family. A drawback of these models is that they are non-robust against outliers. For models like the normal linear regression and gamma GLMs, the non-robustness is a consequence of the exponential tails of the densities. The difference in trends in the bulk of the data and the outliers yields skewed inference and prediction. To our knowledge, there is no Bayesian robust approach specifically for GLMs. The most popular method is frequentist; it is that of Cantoni and Ronchetti (2001). Their approach is to adapt the robust M-estimators for linear regression to the context of GLMs. However, their estimator is derived from a modification of the derivative of the log-likelihood, instead of from a modification of the likelihood (as with robust M-estimators for linear regression). As a consequence, it is not possible to establish a clear correspondence between the modified function to optimize and a model. Having a robust model has two advantages. First, it allows for an understanding and an interpretation of the modelling. Second, it allows for both frequentist and Bayesian analysis. The method we propose is based on ideas from Bayesian robust linear regression. We adapt the approach proposed by Gagnon et al. (2020), which consists of using a modified normal distribution with heavier tails for the error term. In our context, the distribution of the response variable is a modified version where the central part of the density is kept as is, while the extremities are replaced by log-Pareto tails, behaving like (1/\|x\|)(1/ log \|x\|)λ. The focus of this thesis is on gamma GLMs. The performance is measured both theoretically and empirically, with an analysis of hospital costs data. Bayesian statistics heavy-tailed distributions outlier detection outliers Pearson residuals statistiques bayésiennes distributions à ailes relevées détection des valeurs aberrantes valeurs aberrantes résidus de Pearson Statistics / Statistiques (UMI : 0463)
160	Scalable Architecture for Automating Machine Learning Model Monitoring de la Rúa Martínez, Javier January 2020 (has links) Last years, due to the advent of more sophisticated tools for exploratory data analysis, data management, Machine Learning (ML) model training and model serving into production, the concept of MLOps has gained more popularity. As an effort to bring DevOps processes to the ML lifecycle, MLOps aims at more automation in the execution of diverse and repetitive tasks along the cycle and at smoother interoperability between teams and tools involved. In this context, the main cloud providers have built their own ML platforms [4, 34, 61], offered as services in their cloud solutions. Moreover, multiple frameworks have emerged to solve concrete problems such as data testing, data labelling, distributed training or prediction interpretability, and new monitoring approaches have been proposed [32, 33, 65]. Among all the stages in the ML lifecycle, one of the most commonly overlooked although relevant is model monitoring. Recently, cloud providers have presented their own tools to use within their platforms [4, 61] while work is ongoing to integrate existent frameworks [72] into open-source model serving solutions [38]. Most of these frameworks are either built as an extension of an existent platform (i.e lack portability), follow a scheduled batch processing approach at a minimum rate of hours, or present limitations for certain outliers and drift algorithms due to the platform architecture design in which they are integrated. In this work, a scalable automated cloudnative architecture is designed and evaluated for ML model monitoring in a streaming approach. An experimentation conducted on a 7-node cluster with 250.000 requests at different concurrency rates shows maximum latencies of 5.9, 29.92 and 30.86 seconds after request time for 75% of distance-based outliers detection, windowed statistics and distribution-based data drift detection, respectively, using windows of 15 seconds length and 6 seconds of watermark delay. / Under de senaste åren har konceptet MLOps blivit alltmer populärt på grund av tillkomsten av mer sofistikerade verktyg för explorativ dataanalys, datahantering, modell-träning och model serving som tjänstgör i produktion. Som ett försök att föra DevOps processer till Machine Learning (ML)-livscykeln, siktar MLOps på mer automatisering i utförandet av mångfaldiga och repetitiva uppgifter längs cykeln samt på smidigare interoperabilitet mellan team och verktyg inblandade. I det här sammanhanget har de största molnleverantörerna byggt sina egna ML-plattformar [4, 34, 61], vilka erbjuds som tjänster i deras molnlösningar. Dessutom har flera ramar tagits fram för att lösa konkreta problem såsom datatestning, datamärkning, distribuerad träning eller tolkning av förutsägelse, och nya övervakningsmetoder har föreslagits [32, 33, 65]. Av alla stadier i ML-livscykeln förbises ofta modellövervakning trots att det är relevant. På senare tid har molnleverantörer presenterat sina egna verktyg att kunna användas inom sina plattformar [4, 61] medan arbetet pågår för att integrera befintliga ramverk [72] med lösningar för modellplatformer med öppen källkod [38]. De flesta av dessa ramverk är antingen byggda som ett tillägg till en befintlig plattform (dvs. saknar portabilitet), följer en schemalagd batchbearbetningsmetod med en lägsta hastighet av ett antal timmar, eller innebär begränsningar för vissa extremvärden och drivalgoritmer på grund av plattformsarkitekturens design där de är integrerade. I det här arbetet utformas och utvärderas en skalbar automatiserad molnbaserad arkitektur för MLmodellövervakning i en streaming-metod. Ett experiment som utförts på ett 7nodskluster med 250.000 förfrågningar vid olika samtidigheter visar maximala latenser på 5,9, 29,92 respektive 30,86 sekunder efter tid för förfrågningen för 75% av avståndsbaserad detektering av extremvärden, windowed statistics och distributionsbaserad datadriftdetektering, med hjälp av windows med 15 sekunders längd och 6 sekunders fördröjning av vattenstämpel. Model Monitoring Streaming Scalability Cloud-native Data Drift Outliers Machine Learning Modellövervakning Streaming-metod Skalbarhet Molnbaserad Dataskift Outlierupptäckt Maskininlärning Computer and Information Sciences Data- och informationsvetenskap

Search results