Spelling suggestions: "subject:"[een] KNOWLEDGE DISTILLATION"" "subject:"[enn] KNOWLEDGE DISTILLATION""
1 |
Towards Communication-Efficient Federated Learning Through Particle Swarm Optimization and Knowledge DistillationZaman, Saika 01 May 2024 (has links) (PDF)
The widespread popularity of Federated Learning (FL) has led researchers to delve into its various facets, primarily focusing on personalization, fair resource allocation, privacy, and global optimization, with less attention puts towards the crucial aspect of ensuring efficient and cost-optimized communication between the FL server and its agents. A major challenge in achieving successful model training and inference on distributed edge devices lies in optimizing communication costs amid resource constraints, such as limited bandwidth, and selecting efficient agents. In resource-limited FL scenarios, where agents often rely on unstable networks, the transmission of large model weights can substantially degrade model accuracy and increase communication latency between the FL server and agents. Addressing this challenge, we propose a novel strategy that integrates a knowledge distillation technique with a Particle Swarm Optimization (PSO)-based FL method. This approach focuses on transmitting model scores instead of weights, significantly reducing communication overhead and enhancing model accuracy in unstable environments. Our method, with potential applications in smart city services and industrial IoT, marks a significant step forward in reducing network communication costs and mitigating accuracy loss, thereby optimizing the communication efficiency between the FL server and its agents.
|
2 |
REFT: Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained EnvironmentsDesai, Humaid Ahmed Habibullah 22 November 2023 (has links)
Federated Learning (FL) is a sub-domain of machine learning (ML) that enforces privacy by allowing the user's local data to reside on their device. Instead of having users send their personal data to a server where the model resides, FL flips the paradigm and brings the model to the user's device for training. Existing works share model parameters or use distillation principles to address the challenges of data heterogeneity. However, these methods ignore some of the other fundamental challenges in FL: device heterogeneity and communication efficiency. In practice, client devices in FL differ greatly in their computational power and communication resources. This is exacerbated by unbalanced data distribution, resulting in an overall increase in training times and the consumption of more bandwidth. In this work, we present a novel approach for resource-efficient FL called emph{REFT} with variable pruning and knowledge distillation techniques to address the computational and communication challenges faced by resource-constrained devices.
Our variable pruning technique is designed to reduce computational overhead and increase resource utilization for clients by adapting the pruning process to their individual computational capabilities. Furthermore, to minimize bandwidth consumption and reduce the number of back-and-forth communications between the clients and the server, we leverage knowledge distillation to create an ensemble of client models and distill their collective knowledge to the server. Our experimental results on image classification tasks demonstrate the effectiveness of our approach in conducting FL in a resource-constrained environment. We achieve this by training Deep Neural Network (DNN) models while optimizing resource utilization at each client. Additionally, our method allows for minimal bandwidth consumption and a diverse range of client architectures while maintaining performance and data privacy. / Master of Science / In a world driven by data, preserving privacy while leveraging the power of machine learning (ML) is a critical challenge. Traditional approaches often require sharing personal data with central servers, raising concerns about data privacy. Federated Learning (FL), is a cutting-edge solution that turns this paradigm on its head. FL brings the machine learning model to your device, allowing it to learn from your data without ever leaving your device. While FL holds great promise, it faces its own set of challenges. Existing research has largely focused on making FL work with different types of data, but there are still other issues to be resolved. Our work introduces a novel approach called REFT that addresses two critical challenges in FL: making it work smoothly on devices with varying levels of computing power and reducing the amount of data that needs to be transferred during the learning process. Imagine your smartphone and your laptop. They all have different levels of computing power. REFT adapts the learning process to each device's capabilities using a proposed technique called Variable Pruning. Think of it as a personalized fitness trainer, tailoring the workout to your specific fitness level. Additionally, we've adopted a technique called knowledge distillation. It's like a student learning from a teacher, where the teacher shares only the most critical information. In our case, this reduces the amount of data that needs to be sent across the internet, saving bandwidth and making FL more efficient. Our experiments, which involved training machines to recognize images, demonstrate that REFT works well, even on devices with limited resources. It's a step forward in ensuring your data stays private while still making machine learning smarter and more accessible.
|
3 |
Efficient Recycling Of Non-Ferrous Materials Using Cross-Modal Knowledge DistillationBrundin, Sebastian, Gräns, Adam January 2021 (has links)
This thesis investigates the possibility of utilizing data from multiple modalities to enable an automated recycling system to separate ferrous from non-ferrous debris. The two methods sensor fusion and hallucinogenic sensor fusion were implemented in a four-step approach of deep CNNs. Sensor fusion implies that multiple modalities are run simultaneously during the operation of the system.The individual outputs are further fused, and the joint performance expects to be superior to having only one of the sensors. In hallucinogenic sensor fusion, the goal is to achieve the benefits of sensor fusion in respect to cost and complexity even when one of the modalities is reduced from the system. This is achieved by leveraging data from a more complex modality onto a simpler one in a student/teacher approach. As a result, the teacher modality will train the student sensor to hallucinate features beyond its visual spectra. Based on the results of a performed prestudy involving multiple types of modalities, a hyperspectral sensor was deployed as the teacher to complement a simple RGB camera. Three studies involving differently composed datasets were further conducted to evaluate the effectiveness of the methods. The results show that the joint performance of a hyperspectral sensor and an RGB camera is superior to both individual dispatches. It can also be concluded that training a network with hyperspectral images can improve the classification accuracy when operating with only RGB data. However, the addition of a hyperspectral sensor might be considered as superfluous as this report shows that the standardized shapes of industrial debris enable a single RGB to achieve an accuracy above 90%. The material used in this thesis can also be concluded to be suboptimal for hyperspectral analysis. Compared to the vegetation scenes, only a limited amount of additional data could be obtained by including wavelengths besides the ones representing red, green and blue.
|
4 |
Enhancing Object Detection Methods by Knowledge Distillation for Automotive Driving in Real-World SettingsKian, Setareh 07 August 2023 (has links)
No description available.
|
5 |
Compressing Deep Learning models for Natural Language UnderstandingAit Lahmouch, Nadir January 2022 (has links)
Uppgifter för behandling av naturliga språk (NLP) har under de senaste åren visat sig vara särskilt effektiva när man använder förtränade språkmodeller som BERT. Det enorma kravet på datorresurser som krävs för att träna sådana modeller gör det dock svårt att använda dem i verkligheten. För att lösa detta problem har komprimeringsmetoder utvecklats. I det här projektet studeras, genomförs och testas några av dessa metoder för komprimering av neurala nätverk för textbearbetning. I vårt fall var den mest effektiva metoden Knowledge Distillation, som består i att överföra kunskap från ett stort neuralt nätverk, som kallas läraren, till ett litet neuralt nätverk, som kallas eleven. Det finns flera varianter av detta tillvägagångssätt, som skiljer sig åt i komplexitet. Vi kommer att titta på två av dem i det här projektet. Den första gör det möjligt att överföra kunskap mellan ett neuralt nätverk och en mindre dubbelriktad LSTM, genom att endast använda resultatet från den större modellen. Och en andra, mer komplex metod som uppmuntrar elevmodellen att också lära sig av lärarmodellens mellanliggande lager för att utvinna kunskap. Det slutliga målet med detta projekt är att ge företagets datavetare färdiga komprimeringsmetoder för framtida projekt som kräver användning av djupa neurala nätverk för NLP. / Natural language processing (NLP) tasks have proven to be particularly effective when using pre-trained language models such as BERT. However, the enormous demand on computational resources required to train such models makes their use in the real world difficult. To overcome this problem, compression methods have emerged in recent years. In this project, some of these neural network compression approaches for text processing are studied, implemented and tested. In our case, the most efficient method was Knowledge Distillation, which consists in transmitting knowledge from a large neural network, called teacher, to a small neural network, called student. There are several variants of this approach, which differ in their complexity. We will see two of them in this project, the first one which allows a knowledge transfer between any neural network and another smaller bidirectional LSTM, using only the output of the larger model. And a second, more complex approach that encourages the student model to also learn from the intermediate layers of the teacher model for incremental knowledge extraction. The ultimate goal of this project is to provide the company’s data scientists with ready-to-use compression methods for their future projects requiring the use of deep neural networks for NLP.
|
6 |
Energy-efficient Neuromorphic Computing for Resource-constrained Internet of Things DevicesLiu, Shiya 03 November 2023 (has links)
Due to the limited computation and storage resources of Internet of Things (IoT) devices, many emerging intelligent applications based on deep learning techniques heavily depend on cloud computing for computation and storage. However, cloud computing faces technical issues with long latency, poor reliability, and weak privacy, resulting in the need for on-device computation and storage. Also, on-device computation is essential for many time-critical applications, which require real-time data processing and energy-efficient. Furthermore, the escalating requirements for on-device processing are driven by network bandwidth limitations and consumer anticipations concerning data privacy and user experience. In the realm of computing, there is a growing interest in exploring novel technologies that can facilitate ongoing advancements in performance. Of the various prospective avenues, the field of neuromorphic computing has garnered significant recognition as a crucial means to achieve fast and energy-efficient machine intelligence applications for IoT devices. The programming of neuromorphic computing hardware typically involves the construction of a spiking neural network (SNN) capable of being deployed onto the designated neuromorphic hardware. This dissertation presents a range of methodologies aimed at enhancing the precision and energy efficiency of SNNs. To be more precise, these advancements are achieved by incorporating four essential methods. The first method is the quantization of neural networks through knowledge distillation. This work introduces a quantization technique that effectively reduces the computational and storage resource requirements of a model while minimizing the loss of accuracy. To further enhance the reduction of quantization errors, the second method introduces a novel quantization-aware training algorithm specifically designed for training quantized spiking neural network (SNN) models intended for execution on the Loihi chip, a specialized neuromorphic computing chip. SNNs generally exhibit lower accuracy performance compared to deep neural networks (DNNs). The third approach introduces a DNN-SNN co-learning algorithm, which enhances the performance of SNN models by leveraging knowledge obtained from DNN models. The design of the neural architecture plays a vital role in enhancing the accuracy and energy efficiency of an SNN model. The fourth method presents a novel neural architecture search algorithm specifically tailored for SNNs on the Loihi chip. The method selects an optimal architecture based on gradients induced by the architecture at initialization across different data samples without the need for training the architecture. To demonstrate the effectiveness and performance across diverse machine intelligence applications, our methods are evaluated through (i) image classification, (ii) spectrum sensing, and (iii) modulation symbol detection. / Doctor of Philosophy / In the emerging Internet of Things (IoT), our everyday devices, from smart home gadgets to wearables, can autonomously make intelligent decisions. However, due to their limited computing power and storage, many IoT devices heavily depend on cloud computing, which brings along issues like slow response times, privacy concerns, and unreliable connections. Neuromorphic computing is a recognized and crucial approach for achieving fast and energy-efficient machine intelligence applications in IoT devices. Inspired by the human brain's neural networks, this cutting-edge approach allows devices to perform complex tasks efficiently and in real-time. The programming of this neuromorphic hardware involves creating spiking neural networks (SNNs). This dissertation presents several innovative methods to improve the precision and energy efficiency of these SNNs. Firstly, a technique called "quantization" reduces the computational and storage requirements of models without sacrificing accuracy. Secondly, a unique training algorithm is designed to enhance the performance of SNN models. Thirdly, a clever co-learning algorithm allows SNN models to learn from traditional deep neural networks (DNNs), further improving their accuracy. Lastly, a novel neural architecture search algorithm finds the best architecture for SNNs on the designated neuromorphic chip, without the need for extensive training. By making IoT devices smarter and more efficient, neuromorphic computing brings us closer to a world where our gadgets can perform intelligent tasks independently, enhancing convenience and privacy for users across the globe.
|
7 |
The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via SuperstillingGundry, Chaz Allen 14 August 2023 (has links) (PDF)
Recent advances including the Transformer architecture have revolutionized the Natural Language Processing community by providing immense performance improvements across many tasks, including the development of Large Language Models (LLMs). LLMs show enormous promise as few-shot learners, common-sense knowledge repositories, conversational agents, writing assistants, and coding tools, and are gaining widespread traction in commercial industry. However, LLMs are expensive and time-consuming to train, requiring many passes over terabytes of data for the largest models. In this paper, we present Superstilling, a method for reducing the sample complexity of language model training by distilling the knowledge from a previously-trained model (the teacher) into a new, larger model (the student). This method does not require conformity between the architectures of the two models, and can be applied even when the weights and training data of the teacher model are not available, for example in federated learning scenarios. We apply Superstilling to train models of various sizes and show this method can decrease sample complexity by more than 10\% on models with over 160M parameters. We also show that in certain scenarios, Superstilling can be used to speed up training despite the need to run the teacher and student models simultaneously.
|
8 |
Advancing Learned Lossy Image Compression through Knowledge Distillation and Contextual ClusteringYichi Zhang (19960344) 29 October 2024 (has links)
<p dir="ltr">In recent decades, the rapid growth of internet traffic, particularly driven by high-definition images/videos has highlighted the critical need for effective image compression to reduce bit rates and enable efficient data transmission. Learned lossy image compression (LIC), which uses end-to-end deep neural networks, has emerged as a highly promising method, even outperforming traditional methods such as the intra-coding of the versatile video coding (VVC) standard. This thesis contributes to the field of LIC in two ways. First, we present a theoretical bound-guided knowledge distillation technique, which utilizes estimated bound information rate-distortion (R-D) functions to guide the training of LIC models. Implemented with a modified hierarchical variational autoencoder (VAE), this method demonstrates superior rate-distortion performance with reduced computational complexity. Next, we introduce a token mixer neural architecture, referred to as <i>contextual clustering</i>, which serves as an alternative to conventional convolutional layers or self-attention mechanisms in transformer architectures. Contextual clustering groups pixels based on their cosine similarity and uses linear layers to aggregate features within each cluster. By integrating with current LIC methods, we not only improve coding performance but also reduce computational load. </p>
|
9 |
Exploration of Knowledge Distillation Methods on Transformer Language Models for Sentiment Analysis / Utforskning av metoder för kunskapsdestillation på transformatoriska språkmodeller för analys av känslorLiu, Haonan January 2022 (has links)
Despite the outstanding performances of the large Transformer-based language models, it proposes a challenge to compress the models and put them into the industrial environment. This degree project explores model compression methods called knowledge distillation in the sentiment classification task on Transformer models. Transformers are neural models having stacks of identical layers. In knowledge distillation for Transformer, a student model with fewer layers will learn to mimic intermediate layer vectors from a teacher model with more layers by designing and minimizing loss. We implement a framework to compare three knowledge distillation methods: MiniLM, TinyBERT, and Patient-KD. Student models produced by the three methods are evaluated by accuracy score on the SST-2 and SemEval sentiment classification dataset. The student models’ attention matrices are also compared with the teacher model to find the best student model for capturing dependencies in the input sentences. The comparison results show that the distillation method focusing on the Attention mechanism can produce student models with better performances and less variance. We also discover the over-fitting issue in Knowledge Distillation and propose a Two-Step Knowledge Distillation with Transformer Layer and Prediction Layer distillation to alleviate the problem. The experiment results prove that our method can produce robust, effective, and compact student models without introducing extra data. In the future, we would like to extend our framework to support more distillation methods on Transformer models and compare performances in tasks other than sentiment classification. / Trots de stora transformatorbaserade språkmodellernas enastående prestanda är det en utmaning att komprimera modellerna och använda dem i en industriell miljö. I detta examensarbete undersöks metoder för modellkomprimering som kallas kunskapsdestillation i uppgiften att klassificera känslor på Transformer-modeller. Transformers är neurala modeller med staplar av identiska lager. I kunskapsdestillation för Transformer lär sig en elevmodell med färre lager att efterlikna mellanliggande lagervektorer från en lärarmodell med fler lager genom att utforma och minimera förluster. Vi genomför en ram för att jämföra tre metoder för kunskapsdestillation: MiniLM, TinyBERT och Patient-KD. Elevmodeller som produceras av de tre metoderna utvärderas med hjälp av noggrannhetspoäng på datasetet för klassificering av känslor SST-2 och SemEval. Elevmodellernas uppmärksamhetsmatriser jämförs också med den från lärarmodellen för att ta reda på vilken elevmodell som är bäst för att fånga upp beroenden i de inmatade meningarna. Jämförelseresultaten visar att destillationsmetoden som fokuserar på uppmärksamhetsmekanismen kan ge studentmodeller med bättre prestanda och mindre varians. Vi upptäcker också problemet med överanpassning i kunskapsdestillation och föreslår en tvåstegs kunskapsdestillation med transformatorskikt och prediktionsskikt för att lindra problemet. Experimentresultaten visar att vår metod kan producera robusta, effektiva och kompakta elevmodeller utan att införa extra data. I framtiden vill vi utöka vårt ramverk för att stödja fler destillationmetoder på Transformer-modeller och jämföra prestanda i andra uppgifter än sentimentklassificering.
|
10 |
Deep Ensembles for Self-Training in NLP / Djupa Ensembler för Självträninig inom DatalingvistikAlness Borg, Axel January 2022 (has links)
With the development of deep learning methods the requirement of having access to large amounts of data has increased. In this study, we have looked at methods for leveraging unlabeled data while only having access to small amounts of labeled data, which is common in real-world scenarios. We have investigated a method called self-training for leveraging the unlabeled data when training a model. It works by training a teacher model on the labeled data that then labels the unlabeled data for a student model to train on. A popular method in machine learning is ensembling which is a way of improving a single model by combining multiple models. With previous studies mainly focusing on self-training with image data and showing that ensembles can successfully be used for images, we wanted to see if the same applies to text data. We mainly focused on investigating how ensembles can be used as teachers for training a single student model. This was done by creating different ensemble models and comparing them against the individual members in the ensemble. The results showed that ensemble do not necessarily improves the accuracy of the student model over a single model but in certain cases when used correctly they can provide benefits. We found that depending on the dataset bagging BERT models can perform the same or better than a larger BERT model and this translates to the student model. Bagging multiple smaller models also has the benefit of being easier to scale and more computationally efficient to train in comparison to scaling a single model. / Med utvecklingen av metoder för djupinlärning har kravet på att ha tillgång till stora mängder data ökat som är vanligt i verkliga scenarier. I den här studien har vi tittat på metoder för att utnytja oannoterad data när vi bara har tillgång till små mängder annoterad data. Vi har undersökte en metod som kallas självträning för att utnytja oannoterd data när man tränar en modell. Det fungerar genom att man tränar en lärarmodell på annoterad data som sedan annoterar den oannoterade datan för en elevmodell att träna på. En populär metod inom maskininlärning är ensembling som är en teknik för att förbättra en ensam modell genom att kombinera flera modeller. Tidigare studier har främst inriktade på självträning med bilddata och visat att ensembler framgångsrikt kan användas för bild data, vill vi se om detsamma gäller för textdata. Vi fokuserade främst på att undersöka hur ensembler kan användas som lärare för att träna en enskild elevmodell. Detta gjordes genom att skapa olika ensemblemodeller och jämföra dem med de enskilda medlemmarna i ensemblen. Resultaten visade att ensembler inte nödvändigtvis förbättrar elevmodellens noggrannhet jämfört med en enda modell, men i vissa fall kan de ge fördelar när de används på rätt sätt. Vi fann att beroende på datasetet kan bagging av BERT-modeller prestera likvärdigt eller bättre än en större BERT-modell och detta översätts även till studentmodellen prestandard. Att använda bagging av flera mindre modeller har också fördelen av att de är lättare att skala up och mer beräkningseffektivt att träna i jämförelse med att skala up en enskild modell.
|
Page generated in 0.0603 seconds