Global ETD Search

201	Contributions to decision tree based learning / Contributions à l’apprentissage de l’arbre des décisions Qureshi, Taimur 08 July 2010 (has links) Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data learning techniques which aim at producing high-level information, or models, from data. A Typical knowledge discovery process consists of data selection, data preparation, data transformation, data mining and interpretation/validation of the results. Thus, we develop automatic learning techniques which contribute to the data preparation, transformation and mining tasks of knowledge discovery. In doing so, we try to improve the prediction accuracy of the overall learning process. Our work focuses on decision tree based learning and thus, we introduce various preprocessing and transformation techniques such as discretization, fuzzy partitioning and dimensionality reduction to improve this type of learning. However, these techniques can be used in other learning methods e.g. discretization can also be used for naive-bayes classifiers. The data preparation step represents almost 80 percent of the problem and is both time consuming and critical for the quality of modeling. Discretization of continuous features is an important problem that has effects on accuracy, complexity, variance and understandability of the induction models. In this thesis, we propose and develop resampling based aggregation techniques that improve the quality of discretization. Later, we validate by comparing with other discretization techniques and with an optimal partitioning method on 10 benchmark data sets.The second part of our thesis concerns with automatic fuzzy partitioning for soft decision tree induction. Soft or fuzzy decision tree is an extension of the classical crisp tree induction such that fuzzy logic is embedded into the induction process with the effect of more accurate models and reduced variance, but still interpretable and autonomous. We modify the above resampling based partitioning method to generate fuzzy partitions. In addition we propose, develop and validate another fuzzy partitioning method that improves the accuracy of the decision tree.Finally, we adopt a topological learning scheme and perform non-linear dimensionality reduction. We modify an existing manifold learning based technique and see whether it can enhance the predictive power and interpretability of classification. / La recherche avancée dans les méthodes d'acquisition de données ainsi que les méthodes de stockage et les technologies d'apprentissage, s'attaquent défi d'automatiser de manière systématique les techniques d'apprentissage de données en vue d'extraire des connaissances valides et utilisables.La procédure de découverte de connaissances s'effectue selon les étapes suivants: la sélection des données, la préparation de ces données, leurs transformation, le fouille de données et finalement l'interprétation et validation des résultats trouvés. Dans ce travail de thèse, nous avons développé des techniques qui contribuent à la préparation et la transformation des données ainsi qu'a des méthodes de fouille des données pour extraire les connaissances. A travers ces travaux, on a essayé d'améliorer l'exactitude de la prédiction durant tout le processus d'apprentissage. Les travaux de cette thèse se basent sur les arbres de décision. On a alors introduit plusieurs approches de prétraitement et des techniques de transformation; comme le discrétisation, le partitionnement flou et la réduction des dimensions afin d'améliorer les performances des arbres de décision. Cependant, ces techniques peuvent être utilisées dans d'autres méthodes d'apprentissage comme la discrétisation qui peut être utilisées pour la classification bayesienne.Dans le processus de fouille de données, la phase de préparation de données occupe généralement 80 percent du temps. En autre, elle est critique pour la qualité de la modélisation. La discrétisation des attributs continus demeure ainsi un problème très important qui affecte la précision, la complexité, la variance et la compréhension des modèles d'induction. Dans cette thèse, nous avons proposes et développé des techniques qui ce basent sur le ré-échantillonnage. Nous avons également étudié d'autres alternatives comme le partitionnement flou pour une induction floue des arbres de décision. Ainsi la logique floue est incorporée dans le processus d'induction pour augmenter la précision des modèles et réduire la variance, en maintenant l'interprétabilité.Finalement, nous adoptons un schéma d'apprentissage topologique qui vise à effectuer une réduction de dimensions non-linéaire. Nous modifions une technique d'apprentissage à base de variété topologiques `manifolds' pour savoir si on peut augmenter la précision et l'interprétabilité de la classification. Apprentissage Topologique Arbres de Décision Classification Discrétisation Fouille des Données Partitionnement Flou Préparation de Données Ré-échantillonnage Réduction de Dimensions Classification Data Mining Data Preprocessing Decision Trees Dimensionality Reduction Discretization Fuzzy Partitioning Resampling Topological Learning
202	Análise da influência de funções de distância para o processamento de consultas por similaridade em recuperação de imagens por conteúdo / Analysis of the influence of distance functions to answer similarity queries in content-based image retrieval. Pedro Henrique Bugatti 16 April 2008 (has links) A recuperação de imagens baseada em conteúdo (Content-based Image Retrieval - CBIR) embasa-se sobre dois aspectos primordiais, um extrator de características o qual deve prover as características intrínsecas mais significativas dos dados e uma função de distância a qual quantifica a similaridade entre tais dados. O grande desafio é justamente como alcançar a melhor integração entre estes dois aspectos chaves com intuito de obter maior precisão nas consultas por similaridade. Apesar de inúmeros esforços serem continuamente despendidos para o desenvolvimento de novas técnicas de extração de características, muito pouca atenção tem sido direcionada à importância de uma adequada associação entre a função de distância e os extratores de características. A presente Dissertação de Mestrado foi concebida com o intuito de preencher esta lacuna. Para tal, foi realizada a análise do comportamento de diferentes funções de distância com relação a tipos distintos de vetores de características. Os três principais tipos de características intrínsecas às imagens foram analisados, com respeito a distribuição de cores, textura e forma. Além disso, foram propostas duas novas técnicas para realização de seleção de características com o desígnio de obter melhorias em relação à precisão das consultas por similaridade. A primeira técnica emprega regras de associação estatísticas e alcançou um ganho de até 38% na precisão, enquanto que a segunda técnica utilizando a entropia de Shannon alcançou um ganho de aproximadamente 71% ao mesmo tempo em que reduz significantemente a dimensionalidade dos vetores de características. O presente trabalho também demonstra que uma adequada utilização das funções de distância melhora efetivamente os resultados das consultas por similaridade. Conseqüentemente, desdobra novos caminhos para realçar a concepção de sistemas CBIR / The retrieval of images by visual content relies on a feature extractor to provide the most meaningful intrinsic characteristics (features) from the data, and a distance function to quantify the similarity between them. A challenge in this field supporting content-based image retrieval (CBIR) to answer similarity queries is how to best integrate these two key aspects. There are plenty of researching on algorithms for feature extraction of images. However, little attention have been paid to the importance of the use of a well-suited distance function associated to a feature extractor. This Master Dissertation was conceived to fill in this gap. Therefore, herein it was investigated the behavior of different distance functions regarding distinct feature vector types. The three main types of image features were evaluated, regarding color distribution, texture and shape. It was also proposed two new techniques to perform feature selection over the feature vectors, in order to improve the precision when answering similarity queries. The first technique employed statistical association rules and achieve up to 38% gain in precision, while the second one employing the Shannon entropy achieved 71%, while siginificantly reducing the size of the feature vector. This work also showed that the proper use of a distance function effectively improves the similarity query results. Therefore, it opens new ways to enhance the acceptance of CBIR systems Consultas por Similaridade Extração de Características Funções de Distância Imagens Médicas Redução da Dimensionalidade Content-Based Image Retrieval Dimensionality Reduction Distance Functions Feature Extraction Medical Images Similarity Queries
203	Analýza kvality ovzduší v kancelářských a obytných prostorech / Air Quality Analysis in Office and Residential Areas Tisovčík, Peter January 2019 (has links) The goal of the thesis was to study the indoor air quality measurement focusing on the concentration of carbon dioxide. Within the theoretical part, data mining including basic classification methods and approaches to dimensionality reduction was introduced. In addition, the principles of the developed system within IoTCloud project and available possibilities for measurement of necessary quantities were studied. In the practical part, the suitable sensors for given rooms were selected and long-term measurement was performed. Measured data was used to create the system for window opening detection and for the design of appropriate way of air change regulation in a room. The aim of regulation was to improve air quality using natural ventilation.
204	Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing / Undersökning av samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på begränsad data Pettersson, Christoffer January 2016 (has links) The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this could become a useful tool for determining which users of a service should receive a particular email to increase the conversion rate and thereby reach out to more relevant people based on previous trends. / Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Med mer data kan detta bli ett användbart verktyg för att bestämma mottagare av specifika emailutskick för att på sikt kunna öka öppningsfrekvensen och därmed nå ut till mer relevanta mottagare baserat på tidigare resultat. Machine learning Unsupervised Natural language processing nlp clustering centroid based k-means text clustering limited data email clustering lsa svd tf-idf dimensionality reduction the gap statistic Lloyd's algorithm vectorization feature extraction Computer Sciences Datavetenskap (datalogi)
205	Linear and Nonlinear Dimensionality-Reduction-Based Surrogate Models for Real-Time Design Space Exploration of Structural Responses Bird, Gregory David 03 August 2020 (has links) Design space exploration (DSE) is a tool used to evaluate and compare designs as part of the design selection process. While evaluating every possible design in a design space is infeasible, understanding design behavior and response throughout the design space may be accomplished by evaluating a subset of designs and interpolating between them using surrogate models. Surrogate modeling is a technique that uses low-cost calculations to approximate the outcome of more computationally expensive calculations or analyses, such as finite element analysis (FEA). While surrogates make quick predictions, accuracy is not guaranteed and must be considered. This research addressed the need to improve the accuracy of surrogate predictions in order to improve DSE of structural responses. This was accomplished by performing comparative analyses of linear and nonlinear dimensionality-reduction-based radial basis function (RBF) surrogate models for emulating various FEA nodal results. A total of four dimensionality reduction methods were investigated, namely principal component analysis (PCA), kernel principal component analysis (KPCA), isometric feature mapping (ISOMAP), and locally linear embedding (LLE). These methods were used in conjunction with surrogate modeling to predict nodal stresses and coordinates of a compressor blade. The research showed that using an ISOMAP-based dual-RBF surrogate model for predicting nodal stresses decreased the estimated mean error of the surrogate by 35.7% compared to PCA. Using nonlinear dimensionality-reduction-based surrogates did not reduce surrogate error for predicting nodal coordinates. A new metric, the manifold distance ratio (MDR), was introduced to measure the nonlinearity of the data manifolds. When applied to the stress and coordinate data, the stress space was found to be more nonlinear than the coordinate space for this application. The upfront training cost of the nonlinear dimensionality-reduction-based surrogates was larger than that of their linear counterparts but small enough to remain feasible. After training, all the dual-RBF surrogates were capable of making real-time predictions. This same process was repeated for a separate application involving the nodal displacements of mode shapes obtained from a FEA modal analysis. The modal assurance criterion (MAC) calculation was used to compare the predicted mode shapes, as well as their corresponding true mode shapes obtained from FEA, to a set of reference modes. The research showed that two nonlinear techniques, namely LLE and KPCA, resulted in lower surrogate error in the more complex design spaces. Using a RBF kernel, KPCA achieved the largest average reduction in error of 13.57%. The results also showed that surrogate error was greatly affected by mode shape reversal. Four different approaches of identifying reversed mode shapes were explored, all of which resulted in varying amounts of surrogate error. Together, the methods explored in this research were shown to decrease surrogate error when performing DSE of a turbomachine compressor blade. As surrogate accuracy increases, so does the ability to correctly make engineering decisions and judgements throughout the design process. Ultimately, this will help engineers design better turbomachines. design space exploration surrogate modeling dimensionality reduction principal component analysis kernel principal component analysis isometric feature mapping locally linear embedding finite element analysis modal analysis modal assurance criterion turbomachinery compressor blades Engineering
206	Evaluation of Archetypal Analysis and Manifold Learning for Phenotyping of Acute Kidney Injury Dylan M Rodriquez (10695618) 07 May 2021 (has links) Disease subtyping has been a critical aim of precision and personalized medicine. With the potential to improve patient outcomes, unsupervised and semi-supervised methods for determining phenotypes of subtypes have emerged with a recent focus on matrix and tensor factorization. However, interpretability of proposed models is debatable. Principal component analysis (PCA), a traditional method of dimensionality reduction, does not impose non-negativity constraints. Thus coefficients of the principal components are, in cases, difficult to translate to real physical units. Non-negative matrix factorization (NMF) constrains the factorization to positive numbers such that representative types resulting from the factorization are additive. Archetypal analysis (AA) extends this idea and seeks to identify pure types, archetypes, at the extremes of the data from which all other data can be expressed as a convex combination, or by proportion, of the archetypes. Using AA, this study sought to evaluate the sufficiency of AKI staging criteria through unsupervised subtyping. Archetype analysis failed to find a direct 1:1 mapping of archetypes to physician staging and also did not provide additional insight into patient outcomes. Several factors of the analysis such as quality of the data source and the difficulty in selecting features contributed to the outcome. Additionally, after performing feature selection with lasso across data subsets, it was determined that current staging criteria is sufficient to determine patient phenotype with serum creatinine at time of diagnosis to be a necessary factor. Applied Computer Science Archetypal Analysis Manifold Learning Dimensionality reduction Clinical Informatics Acute Kidney Injury Outcomes electronic health records (EHR) Cerner Health Facts database UMAP Disease subtype discovery
207	Deep Scenario Generation of Financial Markets / Djup scenario generering av finansiella marknader Carlsson, Filip, Lindgren, Philip January 2020 (has links) The goal of this thesis is to explore a new clustering algorithm, VAE-Clustering, and examine if it can be applied to find differences in the distribution of stock returns and augment the distribution of a current portfolio of stocks and see how it performs in different market conditions. The VAE-clustering method is as mentioned a newly introduced method and not widely tested, especially not on time series. The first step is therefore to see if and how well the clustering works. We first apply the algorithm to a dataset containing monthly time series of the power demand in Italy. The purpose in this part is to focus on how well the method works technically. When the model works well and generates proper results with the Italian Power Demand data, we move forward and apply the model on stock return data. In the latter application we are unable to find meaningful clusters and therefore unable to move forward towards the goal of the thesis. The results shows that the VAE-clustering method is applicable for time series. The power demand have clear differences from season to season and the model can successfully identify those differences. When it comes to the financial data we hoped that the model would be able to find different market regimes based on time periods. The model is though not able distinguish different time periods from each other. We therefore conclude that the VAE-clustering method is applicable on time series data, but that the structure and setting of the financial data in this thesis makes it to hard to find meaningful clusters. The major finding is that the VAE-clustering method can be applied to time series. We highly encourage further research to find if the method can be successfully used on financial data in different settings than tested in this thesis. / Syftet med den här avhandlingen är att utforska en ny klustringsalgoritm, VAE-Clustering, och undersöka om den kan tillämpas för att hitta skillnader i fördelningen av aktieavkastningar och förändra distributionen av en nuvarande aktieportfölj och se hur den presterar under olika marknadsvillkor. VAE-klusteringsmetoden är som nämnts en nyinförd metod och inte testad i stort, särskilt inte på tidsserier. Det första steget är därför att se om och hur klusteringen fungerar. Vi tillämpar först algoritmen på ett datasätt som innehåller månatliga tidsserier för strömbehovet i Italien. Syftet med denna del är att fokusera på hur väl metoden fungerar tekniskt. När modellen fungerar bra och ger tillfredställande resultat, går vi vidare och tillämpar modellen på aktieavkastningsdata. I den senare applikationen kan vi inte hitta meningsfulla kluster och kan därför inte gå framåt mot målet som var att simulera olika marknader och se hur en nuvarande portfölj presterar under olika marknadsregimer. Resultaten visar att VAE-klustermetoden är väl tillämpbar på tidsserier. Behovet av el har tydliga skillnader från säsong till säsong och modellen kan framgångsrikt identifiera dessa skillnader. När det gäller finansiell data hoppades vi att modellen skulle kunna hitta olika marknadsregimer baserade på tidsperioder. Modellen kan dock inte skilja olika tidsperioder från varandra. Vi drar därför slutsatsen att VAE-klustermetoden är tillämplig på tidsseriedata, men att strukturen på den finansiella data som undersöktes i denna avhandling gör det svårt att hitta meningsfulla kluster. Den viktigaste upptäckten är att VAE-klustermetoden kan tillämpas på tidsserier. Vi uppmuntrar ytterligare forskning för att hitta om metoden framgångsrikt kan användas på finansiell data i andra former än de testade i denna avhandling Variational Autoencoder Generative Models Latent Space Dimensionality Reduction Unsupervised Learning Clustering VAE-Clustering Scenario Generation Market Regime Variational Autoencoder generativa modeller latent rum dimensionsreducering klustring scenario generering Mathematics Matematik
208	Monitoring Vehicle Suspension Elements Using Machine Learning Techniques / Tillståndsövervakning av komponenter i fordonsfjädringssystem genom maskininlärningstekniker Karlsson, Henrik January 2019 (has links) Condition monitoring (CM) is widely used in industry, and there is a growing interest in applying CM on rail vehicle systems. Condition based maintenance has the possibility to increase system safety and availability while at the sametime reduce the total maintenance costs.This thesis investigates the feasibility of using condition monitoring of suspension element components, in this case dampers, in rail vehicles. There are different methods utilized to detect degradations, ranging from mathematicalmodelling of the system to pure "knowledge-based" methods, using only large amount of data to detect patterns on a larger scale. In this thesis the latter approach is explored, where acceleration signals are evaluated on severalplaces on the axleboxes, bogieframes and the carbody of a rail vehicle simulation model. These signals are picked close to the dampers that are monitored in this study, and frequency response functions (FRF) are computed between axleboxes and bogieframes as well as between bogieframes and carbody. The idea is that the FRF will change as the condition of the dampers change, and thus act as indicators of faults. The FRF are then fed to different classificationalgorithms, that are trained and tested to distinguish between the different damper faults.This thesis further investigates which classification algorithm shows promising results for the problem, and which algorithm performs best in terms of classification accuracy as well as two other measures. Another aspect explored is thepossibility to apply dimensionality reduction to the extracted indicators (features). This thesis is also looking into how the three performance measures used are affected by typical varying operational conditions for a rail vehicle,such as varying excitation and carbody mass. The Linear Support Vector Machine classifier using the whole feature space, and the Linear Discriminant Analysis classifier combined with Principal Component Analysis dimensionality reduction on the feature space both show promising results for the taskof correctly classifying upcoming damper degradations. / Tillståndsövervakning används brett inom industrin och det finns ett ökat intresse för att applicera tillståndsövervakning inom spårfordons olika system. Tillståndsbaserat underhåll kan potentiellt öka ett systems säkerhet och tillgänglighetsamtidigt som det kan minska de totala underhållskostnaderna.Detta examensarbete undersöker möjligheten att applicera tillståndsövervakning av komponenter i fjädringssystem, i detta fall dämpare, hos spårfordon. Det finns olika metoder för att upptäcka försämringar i komponenternas skick, från matematisk modellering av systemet till mer ”kunskaps-baserade” metodersom endast använder stora mängder data för att upptäcka mönster i en större skala. I detta arbete utforskas den sistnämnda metoden, där accelerationssignaler inhämtas från axelboxar, boggieramar samt vagnskorg från en simuleringsmodellav ett spårfordon. Dessa signaler är extraherade nära de dämpare som övervakas, och används för att beräkna frekvenssvarsfunktioner mellan axelboxar och boggieramar, samt mellan boggieramar och vagnskorg. Tanken är att frekvenssvarsfunktionerna förändras när dämparnas skick förändras ochpå så sätt fungera som indikatorer av dämparnas skick. Frekvenssvarsfunktionerna används sedan för att träna och testa olika klassificeringsalgoritmer för att kunna urskilja olika dämparfel.Detta arbete undersöker vidare vilka klassificeringsalgoritmer som visar lovande resultat för detta problem, och vilka av dessa som presterar bäst med avseende på noggrannheten i prediktionerna, samt två andra mått på algoritmernasprestanda. En annan aspekt som undersöks är möjligheten att applicera dimensionalitetsminskning på de extraherade indikatorerna. Detta arbete undersöker också hur de tre prestandamåtten som används påverkas av typiska förändringar i driftsförhållanden för ett spårfordon såsom varierande exciteringfrån spåret och vagnkorgsmassa. Resultaten visar lovande prestanda för klassificeringsalgoritmen ”Linear Support Vector Machine” som använder hela rymden med felindikatorer, samt algoritmen ”Linear Discriminant Analysis” i kombination med ”Principal Component Analysis” dimensionalitetsreducering. Condition monitoring condition based maintenance FDI diagnostics machine learning classification algorithms dimensionality reduction feature selection feature transformation frequency response functions. Tillståndsövervakning tillståndsbaserat underhåll FDI diagnostik maskininlärning klassificeringsalgoritmer dimensionalitetsreducering särdragsextrahering särdragstransformering frekvenssvarsfunktioner. Vehicle Engineering Farkostteknik
209	Towards Scalable Machine Learning with Privacy Protection Fay, Dominik January 2023 (has links) The increasing size and complexity of datasets have accelerated the development of machine learning models and exposed the need for more scalable solutions. This thesis explores challenges associated with large-scale machine learning under data privacy constraints. With the growth of machine learning models, traditional privacy methods such as data anonymization are becoming insufficient. Thus, we delve into alternative approaches, such as differential privacy. Our research addresses the following core areas in the context of scalable privacy-preserving machine learning: First, we examine the implications of data dimensionality on privacy for the application of medical image analysis. We extend the classification algorithm Private Aggregation of Teacher Ensembles (PATE) to deal with high-dimensional labels, and demonstrate that dimensionality reduction can be used to improve privacy. Second, we consider the impact of hyperparameter selection on privacy. Here, we propose a novel adaptive technique for hyperparameter selection in differentially gradient-based optimization. Third, we investigate sampling-based solutions to scale differentially private machine learning to dataset with a large number of records. We study the privacy-enhancing properties of importance sampling, highlighting that it can outperform uniform sub-sampling not only in terms of sample efficiency but also in terms of privacy. The three techniques developed in this thesis improve the scalability of machine learning while ensuring robust privacy protection, and aim to offer solutions for the effective and safe application of machine learning in large datasets. / Den ständigt ökande storleken och komplexiteten hos datamängder har accelererat utvecklingen av maskininlärningsmodeller och gjort behovet av mer skalbara lösningar alltmer uppenbart. Den här avhandlingen utforskar tre utmaningar förknippade med storskalig maskininlärning under dataskyddskrav. För stora och komplexa maskininlärningsmodeller blir traditionella metoder för integritet, såsom datananonymisering, otillräckliga. Vi undersöker därför alternativa tillvägagångssätt, såsom differentiell integritet. Vår forskning behandlar följande utmaningar inom skalbar och integitetsmedveten maskininlärning: För det första undersöker vi hur hög data-dimensionalitet påverkar integriteten för medicinsk bildanalys. Vi utvidgar klassificeringsalgoritmen Private Aggregation of Teacher Ensembles (PATE) för att hantera högdimensionella etiketter och visar att dimensionsreducering kan användas för att förbättra integriteten. För det andra studerar vi hur valet av hyperparametrar påverkar integriteten. Här föreslår vi en ny adaptiv teknik för val av hyperparametrar i gradient-baserad optimering med garantier på differentiell integritet. För det tredje granskar vi urvalsbaserade lösningar för att skala differentiellt privat maskininlärning till stora datamängder. Vi studerar de integritetsförstärkande egenskaperna hos importance sampling och visar att det kan överträffa ett likformigt urval av sampel, inte bara när det gäller effektivitet utan även för integritet. De tre teknikerna som utvecklats i denna avhandling förbättrar skalbarheten för integritetsskyddad maskininlärning och syftar till att erbjuda lösningar för effektiv och säker tillämpning av maskininlärning på stora datamängder. / <p>QC 20231101</p> Machine Learning Privacy Differential Privacy Dimensionality Reduction Image Segmentation Hyperparameter Selection Adaptive Optimization Privacy Amplification Importance Sampling Maskininlärning Dataskydd Differentiell Integritet Dimensionsreducering Bildsegmentering Hyperparameterurval Adaptiv Optimering Integritetsförstärkning Importance Sampling Computer Sciences Datavetenskap (datalogi)
210	Heart- and Sapwood Segmentation on Hyperspectral Images using Deep Learning Hallin, Samuel, Samnegård, Simon January 2023 (has links) For manufacturers in the wood industry, an important way to make the production more effective is to automate the process of detecting defects and different attributes on boards. One important attribute on most boards is heartwood and sapwood. This thesis project was conducted at the company MiCROTEC and aims to investigate methods to classify heartwood and sapwood on boards. The dataset used in this project consisted of oak boards. In order to increase the amount of information retrieved from the boards, hyperspectral imaging was used instead of conventional RGB cameras. Based on this data, deep learning models in the form of U-Net and U-within-U-Net architecture as well as different spectral dimensionality reduction methods were developed to segment boards in heartwood and sapwood. The performance of these deep learning models was compared to PLS-DA and SVM. PLS-DA has already been used at MiCROTEC and has been used in this work for comparison as a baseline model. The result of the thesis work showed that a deep learning approach could increase the F1-Score from 0.730 for the baseline classifier PLS-DA to an F1-Score of 0.918, and that the different spectral reduction methods only had a small impact on the result. The increase in F1-score was mainly due to an increase in precision, since the PLS-DA had a similar recall as the deep learning models. Computer Vision Hyperspectral Imaging Heartwood Sapwood Deep Learning Segmentation Dimensionality Reduction PCA PLS SVM U-NET U-within-U-Net Computer and Information Sciences Data- och informationsvetenskap Wood Science Trävetenskap

Search results