21 |
DistillaBSE: Task-agnostic distillation of multilingual sentence embeddings : Exploring deep self-attention distillation with switch transformersBubla, Boris January 2021 (has links)
The recent development of massive multilingual transformer networks has resulted in drastic improvements in model performance. These models, however, are so large they suffer from large inference latency and consume vast computing resources. Such features hinder widespread adoption of the models in industry and some academic settings. Thus there is growing research into reducing their parameter count and increasing their inference speed, with significant interest in the use of knowledge distillation techniques. This thesis uses the existing approach of deep self-attention distillation to develop a task-agnostic distillation of the language agnostic BERT sentence embedding model. It also explores the use of the Switch Transformer architecture in distillation contexts. The result is DistilLaBSE, a task-agnostic distillation of LaBSE used to create a 10 times faster version of LaBSE, whilst retaining over 99% cosine similarity of its sentence embeddings on a holdout test from the same domain as the training samples, namely the OpenSubtitles dataset. It is also shown that DistilLaBSE achieves similar scores when embedding data from two other domains, namely English tweets and customer support banking data. This faster version of LaBSE allows industry practitioners and resourcelimited academic groups to apply a more convenient version of LaBSE to their various applications and research tasks. / Den senaste utvecklingen av massiva flerspråkiga transformatornätverk har resulterat i drastiska förbättringar av modellprestanda. Dessa modeller är emellertid så stora att de lider av stor inferenslatens och förbrukar stora datorresurser. Sådana funktioner hindrar bred spridning av modeller i branschen och vissa akademiska miljöer. Således växer det forskning om att minska deras parametrar och öka deras inferenshastighet, med stort intresse för användningen av kunskapsdestillationstekniker. Denna avhandling använder det befintliga tillvägagångssättet med djup uppmärksamhetsdestillation för att utveckla en uppgiftsagnostisk destillation av språket agnostisk BERT- innebördmodell. Den utforskar också användningen av Switch Transformerarkitekturen i destillationskontexter. Resultatet är DistilLaBSE, en uppgiftsagnostisk destillation av LaBSE som används för att skapa en 10x snabbare version av LaBSE, samtidigt som man bibehåller mer än 99 % cosinuslikhet i sina meningsinbäddningar på ett uthållstest från samma domän som träningsproverna, nämligen OpenSubtitles dataset. Det visas också att DistilLaBSE uppnår liknande poäng när man bäddar in data från två andra domäner, nämligen engelska tweets och kundsupportbankdata. Denna snabbare version av LaBSE tillåter branschutövare och resursbegränsade akademiska grupper
|
22 |
Using Satellite Images and Deep Learning to Detect Water Hidden Under the Vegetation : A cross-modal knowledge distillation-based method to reduce manual annotation work / Användning Satellitbilder och Djupinlärning för att Upptäcka Vatten Gömt Under Vegetationen : En tvärmodal kunskapsdestillationsbaserad metod för att minska manuellt anteckningsarbeteCristofoli, Ezio January 2024 (has links)
Detecting water under vegetation is critical to tracking the status of geological ecosystems like wetlands. Researchers use different methods to estimate water presence, avoiding costly on-site measurements. Optical satellite imagery allows the automatic delineation of water using the concept of the Normalised Difference Water Index (NDWI). Still, optical imagery is subject to visibility conditions and cannot detect water under the vegetation, a typical situation for wetlands. Synthetic Aperture Radar (SAR) imagery works under all visibility conditions. It can detect water under vegetation but requires deep network algorithms to segment water presence, and manual annotation work is required to train the deep models. This project uses DEEPAQUA, a cross-modal knowledge distillation method, to eliminate the manual annotation needed to extract water presence from SAR imagery with deep neural networks. In this method, a deep student model (e.g., UNET) is trained to segment water in SAR imagery. The student model uses the NDWI algorithm as the non-parametric, cross-modal teacher. The key prerequisite is that NDWI works on the optical imagery taken from the exact location and simultaneously as the SAR. Three different deep architectures are tested in this project: UNET, SegNet, and UNET++, and the Otsu method is used as the baseline. Experiments on imagery from Swedish wetlands in 2020-2022 show that cross-modal distillation consistently achieved better segmentation performances across architectures than the baseline. Additionally, the UNET family of algorithms performed better than SegNet with a confidence of 95%. The UNET++ model achieved the highest Intersection Over Union (IOU) performance. However, no statistical evidence emerged that UNET++ performs better than UNET, with a confidence of 95%. In conclusion, this project shows that cross-modal knowledge distillation works well across architectures and removes tedious and expensive manual work hours when detecting water from SAR imagery. Further research could evaluate performances on other datasets and student architectures. / Att upptäcka vatten under vegetation är avgörande för att hålla koll på statusen på geologiska ekosystem som våtmarker. Forskare använder olika metoder för att uppskatta vattennärvaro vilket undviker kostsamma mätningar på plats. Optiska satellitbilder tillåter automatisk avgränsning av vatten med hjälp av konceptet Normalised Difference Water Index (NDWI). Optiska bilder fortfarande beroende av siktförhållanden och kan inte upptäcka vatten under vegetationen, en typisk situation för våtmarker. Synthetic Aperture Radar (SAR)-bilder fungerar under alla siktförhållanden. Den kan detektera vatten under vegetation men kräver djupa nätverksalgoritmer för att segmentera vattennärvaro, och manuellt anteckningsarbete krävs för att träna de djupa modellerna. Detta projekt använder DEEPAQUA, en cross-modal kunskapsdestillationsmetod, för att eliminera det manuella annoteringsarbete som behövs för att extrahera vattennärvaro från SAR-bilder med djupa neurala nätverk. I denna metod tränas en djup studentmodell (t.ex. UNET) att segmentera vatten i SAR-bilder semantiskt. Elevmodellen använder NDWI, som fungerar på de optiska bilderna tagna från den exakta platsen och samtidigt som SAR, som den icke-parametriska, cross-modal lärarmodellen. Tre olika djupa arkitekturer testas i detta examensarbete: UNET, SegNet och UNET++, och Otsu-metoden används som baslinje. Experiment på bilder tagna på svenska våtmarker 2020-2022 visar att cross-modal destillation konsekvent uppnådde bättre segmenteringsprestanda över olika arkitekturer jämfört med baslinjen. Dessutom presterade UNET-familjen av algoritmer bättre än SegNet med en konfidens på 95%. UNET++-modellen uppnådde högsta prestanda för Intersection Over Union (IOU). Det framkom dock inga statistiska bevis för att UNET++ presterar bättre än UNET, med en konfidens på 95%. Sammanfattningsvis visar detta projekt att cross-modal kunskapsdestillation fungerar bra över olika arkitekturer och tar bort tidskrävande och kostsamma manuella arbetstimmar vid detektering av vatten från SAR-bilder. Ytterligare forskning skulle kunna utvärdera prestanda på andra datamängder och studentarkitekturer.
|
23 |
Apprentissage de stratégies de calcul adaptatives pour les réseaux neuronaux profondsKamanda, Aton 07 1900 (has links)
La théorie du processus dual stipule que la cognition humaine fonctionne selon deux modes distincts : l’un pour le traitement rapide, habituel et associatif, appelé communément "système 1" et le second, ayant un traitement plus lent, délibéré et contrôlé, que l’on nomme "système 2". Cette distinction indique une caractéristique sous-jacente importante de la cognition humaine : la possibilité de passer de manière adaptative à différentes stratégies de calcul selon la situation. Cette capacité est étudiée depuis longtemps dans différents domaines et de nombreux bénéfices hypothétiques semblent y être liés. Cependant, les réseaux neuronaux profonds sont souvent construits sans cette capacité à gérer leurs ressources calculatoires de manière optimale. Cette limitation des modèles actuels est d’autant plus préoccupante que de plus en plus de travaux récents semblent montrer une relation linéaire entre la capacité de calcul utilisé et les performances du modèle lors de la phase d’évaluation. Pour résoudre ce problème, ce mémoire propose différentes approches et étudie leurs impacts sur les modèles, tout d’abord, nous étudions un agent d’apprentissage par renforcement profond qui est capable d’allouer plus de calcul aux situations plus difficiles. Notre approche permet à l’agent d’adapter ses ressources computationnelles en fonction des exigences de la situation dans laquelle il se trouve, ce qui permet en plus d’améliorer le temps de calcul, améliore le transfert entre des tâches connexes et la capacité de généralisation. L’idée centrale commune à toutes nos approches est basée sur les théories du coût de l’effort venant de la littérature sur le contrôle cognitif qui stipule qu’en rendant l’utilisation de ressource cognitive couteuse pour l’agent et en lui laissant la possibilité de les allouer lors de ses décisions il va lui-même apprendre à déployer sa capacité de calcul de façon optimale. Ensuite, nous étudions des variations de la méthode sur une tâche référence d’apprentissage profond afin d’analyser précisément le comportement du modèle et quels sont précisément les bénéfices d’adopter une telle approche. Nous créons aussi notre propre tâche "Stroop MNIST" inspiré par le test de Stroop utilisé en psychologie afin de valider certaines hypothèses sur le comportement des réseaux neuronaux employant notre méthode. Nous finissons par mettre en lumière les liens forts qui existent entre apprentissage dual et les méthodes de distillation des connaissances. Notre approche a la particularité d’économiser des ressources computationnelles lors de la phase d’inférence. Enfin, dans la partie finale, nous concluons en mettant en lumière les contributions du mémoire, nous détaillons aussi des travaux futurs, nous approchons le problème avec les modèles basés sur l’énergie, en apprenant un paysage d’énergie lors de l’entrainement, le modèle peut ensuite lors de l’inférence employer une capacité de calcul dépendant de la difficulté de l’exemple auquel il fait face plutôt qu’une simple propagation avant fixe ayant systématiquement le même coût calculatoire. Bien qu’ayant eu des résultats expérimentaux infructueux, nous analysons les promesses que peuvent tenir une telle approche et nous émettons des hypothèses sur les améliorations potentielles à effectuer. Nous espérons, avec nos contributions, ouvrir la voie vers des algorithmes faisant un meilleur usage de leurs ressources computationnelles et devenant par conséquent plus efficace en termes de coût et de performance, ainsi que permettre une compréhension plus intime des liens qui existent entre certaines méthodes en apprentissage machine et la théorie du processus dual. / The dual-process theory states that human cognition operates in two distinct modes: one for rapid, habitual and associative processing, commonly referred to as "system 1", and the second, with slower, deliberate and controlled processing, which we call "system 2". This distinction points to an important underlying feature of human cognition: the ability to switch adaptively to different computational strategies depending on the situation. This ability has long been studied in various fields, and many hypothetical benefits seem to be linked to it. However, deep neural networks are often built without this ability to optimally manage their computational resources. This limitation of current models is all the more worrying as more and more recent work seems to show a linear relationship between the computational capacity used and model performance during the evaluation phase. To solve this problem, this thesis proposes different approaches and studies their impact on models. First, we study a deep reinforcement learning agent that is able to allocate more computation to more difficult situations. Our approach allows the agent to adapt its computational resources according to the demands of the situation in which it finds itself, which in addition to improving computation time, enhances transfer between related tasks and generalization capacity. The central idea common to all our approaches is based on cost-of-effort theories from the cognitive control literature, which stipulate that by making the use of cognitive resources costly for the agent, and allowing it to allocate them when making decisions, it will itself learn to deploy its computational capacity optimally. We then study variations of the method on a reference deep learning task, to analyze precisely how the model behaves and what the benefits of adopting such an approach are. We also create our own task "Stroop MNIST" inspired by the Stroop test used in psychology to validate certain hypotheses about the behavior of neural networks employing our method. We end by highlighting the strong links between dual learning and knowledge distillation methods. Finally, we approach the problem with energy-based models, by learning an energy landscape during training, the model can then during inference employ a computational capacity dependent on the difficulty of the example it is dealing with rather than a simple fixed forward propagation having systematically the same computational cost. Despite unsuccessful experimental results, we analyze the promise of such an approach and speculate on potential improvements. With our contributions, we hope to pave the way for algorithms that make better use of their computational resources, and thus become more efficient in terms of cost and performance, as well as providing a more intimate understanding of the links that exist between certain machine learning methods and dual process theory.
|
Page generated in 0.131 seconds