1 |
Towards Communication-Efficient Federated Learning Through Particle Swarm Optimization and Knowledge DistillationZaman, Saika 01 May 2024 (has links) (PDF)
The widespread popularity of Federated Learning (FL) has led researchers to delve into its various facets, primarily focusing on personalization, fair resource allocation, privacy, and global optimization, with less attention puts towards the crucial aspect of ensuring efficient and cost-optimized communication between the FL server and its agents. A major challenge in achieving successful model training and inference on distributed edge devices lies in optimizing communication costs amid resource constraints, such as limited bandwidth, and selecting efficient agents. In resource-limited FL scenarios, where agents often rely on unstable networks, the transmission of large model weights can substantially degrade model accuracy and increase communication latency between the FL server and agents. Addressing this challenge, we propose a novel strategy that integrates a knowledge distillation technique with a Particle Swarm Optimization (PSO)-based FL method. This approach focuses on transmitting model scores instead of weights, significantly reducing communication overhead and enhancing model accuracy in unstable environments. Our method, with potential applications in smart city services and industrial IoT, marks a significant step forward in reducing network communication costs and mitigating accuracy loss, thereby optimizing the communication efficiency between the FL server and its agents.
|
2 |
REFT: Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained EnvironmentsDesai, Humaid Ahmed Habibullah 22 November 2023 (has links)
Federated Learning (FL) is a sub-domain of machine learning (ML) that enforces privacy by allowing the user's local data to reside on their device. Instead of having users send their personal data to a server where the model resides, FL flips the paradigm and brings the model to the user's device for training. Existing works share model parameters or use distillation principles to address the challenges of data heterogeneity. However, these methods ignore some of the other fundamental challenges in FL: device heterogeneity and communication efficiency. In practice, client devices in FL differ greatly in their computational power and communication resources. This is exacerbated by unbalanced data distribution, resulting in an overall increase in training times and the consumption of more bandwidth. In this work, we present a novel approach for resource-efficient FL called emph{REFT} with variable pruning and knowledge distillation techniques to address the computational and communication challenges faced by resource-constrained devices.
Our variable pruning technique is designed to reduce computational overhead and increase resource utilization for clients by adapting the pruning process to their individual computational capabilities. Furthermore, to minimize bandwidth consumption and reduce the number of back-and-forth communications between the clients and the server, we leverage knowledge distillation to create an ensemble of client models and distill their collective knowledge to the server. Our experimental results on image classification tasks demonstrate the effectiveness of our approach in conducting FL in a resource-constrained environment. We achieve this by training Deep Neural Network (DNN) models while optimizing resource utilization at each client. Additionally, our method allows for minimal bandwidth consumption and a diverse range of client architectures while maintaining performance and data privacy. / Master of Science / In a world driven by data, preserving privacy while leveraging the power of machine learning (ML) is a critical challenge. Traditional approaches often require sharing personal data with central servers, raising concerns about data privacy. Federated Learning (FL), is a cutting-edge solution that turns this paradigm on its head. FL brings the machine learning model to your device, allowing it to learn from your data without ever leaving your device. While FL holds great promise, it faces its own set of challenges. Existing research has largely focused on making FL work with different types of data, but there are still other issues to be resolved. Our work introduces a novel approach called REFT that addresses two critical challenges in FL: making it work smoothly on devices with varying levels of computing power and reducing the amount of data that needs to be transferred during the learning process. Imagine your smartphone and your laptop. They all have different levels of computing power. REFT adapts the learning process to each device's capabilities using a proposed technique called Variable Pruning. Think of it as a personalized fitness trainer, tailoring the workout to your specific fitness level. Additionally, we've adopted a technique called knowledge distillation. It's like a student learning from a teacher, where the teacher shares only the most critical information. In our case, this reduces the amount of data that needs to be sent across the internet, saving bandwidth and making FL more efficient. Our experiments, which involved training machines to recognize images, demonstrate that REFT works well, even on devices with limited resources. It's a step forward in ensuring your data stays private while still making machine learning smarter and more accessible.
|
3 |
Efficient Recycling Of Non-Ferrous Materials Using Cross-Modal Knowledge DistillationBrundin, Sebastian, Gräns, Adam January 2021 (has links)
This thesis investigates the possibility of utilizing data from multiple modalities to enable an automated recycling system to separate ferrous from non-ferrous debris. The two methods sensor fusion and hallucinogenic sensor fusion were implemented in a four-step approach of deep CNNs. Sensor fusion implies that multiple modalities are run simultaneously during the operation of the system.The individual outputs are further fused, and the joint performance expects to be superior to having only one of the sensors. In hallucinogenic sensor fusion, the goal is to achieve the benefits of sensor fusion in respect to cost and complexity even when one of the modalities is reduced from the system. This is achieved by leveraging data from a more complex modality onto a simpler one in a student/teacher approach. As a result, the teacher modality will train the student sensor to hallucinate features beyond its visual spectra. Based on the results of a performed prestudy involving multiple types of modalities, a hyperspectral sensor was deployed as the teacher to complement a simple RGB camera. Three studies involving differently composed datasets were further conducted to evaluate the effectiveness of the methods. The results show that the joint performance of a hyperspectral sensor and an RGB camera is superior to both individual dispatches. It can also be concluded that training a network with hyperspectral images can improve the classification accuracy when operating with only RGB data. However, the addition of a hyperspectral sensor might be considered as superfluous as this report shows that the standardized shapes of industrial debris enable a single RGB to achieve an accuracy above 90%. The material used in this thesis can also be concluded to be suboptimal for hyperspectral analysis. Compared to the vegetation scenes, only a limited amount of additional data could be obtained by including wavelengths besides the ones representing red, green and blue.
|
4 |
Enhancing Object Detection Methods by Knowledge Distillation for Automotive Driving in Real-World SettingsKian, Setareh 07 August 2023 (has links)
No description available.
|
5 |
Compressing Deep Learning models for Natural Language UnderstandingAit Lahmouch, Nadir January 2022 (has links)
Uppgifter för behandling av naturliga språk (NLP) har under de senaste åren visat sig vara särskilt effektiva när man använder förtränade språkmodeller som BERT. Det enorma kravet på datorresurser som krävs för att träna sådana modeller gör det dock svårt att använda dem i verkligheten. För att lösa detta problem har komprimeringsmetoder utvecklats. I det här projektet studeras, genomförs och testas några av dessa metoder för komprimering av neurala nätverk för textbearbetning. I vårt fall var den mest effektiva metoden Knowledge Distillation, som består i att överföra kunskap från ett stort neuralt nätverk, som kallas läraren, till ett litet neuralt nätverk, som kallas eleven. Det finns flera varianter av detta tillvägagångssätt, som skiljer sig åt i komplexitet. Vi kommer att titta på två av dem i det här projektet. Den första gör det möjligt att överföra kunskap mellan ett neuralt nätverk och en mindre dubbelriktad LSTM, genom att endast använda resultatet från den större modellen. Och en andra, mer komplex metod som uppmuntrar elevmodellen att också lära sig av lärarmodellens mellanliggande lager för att utvinna kunskap. Det slutliga målet med detta projekt är att ge företagets datavetare färdiga komprimeringsmetoder för framtida projekt som kräver användning av djupa neurala nätverk för NLP. / Natural language processing (NLP) tasks have proven to be particularly effective when using pre-trained language models such as BERT. However, the enormous demand on computational resources required to train such models makes their use in the real world difficult. To overcome this problem, compression methods have emerged in recent years. In this project, some of these neural network compression approaches for text processing are studied, implemented and tested. In our case, the most efficient method was Knowledge Distillation, which consists in transmitting knowledge from a large neural network, called teacher, to a small neural network, called student. There are several variants of this approach, which differ in their complexity. We will see two of them in this project, the first one which allows a knowledge transfer between any neural network and another smaller bidirectional LSTM, using only the output of the larger model. And a second, more complex approach that encourages the student model to also learn from the intermediate layers of the teacher model for incremental knowledge extraction. The ultimate goal of this project is to provide the company’s data scientists with ready-to-use compression methods for their future projects requiring the use of deep neural networks for NLP.
|
6 |
Knowledge Distillation of DNABERT for Prediction of Genomic Elements / Kunskapsdestillation av DNABERT för prediktion av genetiska attributPalés Huix, Joana January 2022 (has links)
Understanding the information encoded in the human genome and the influence of each part of the DNA sequence is a fundamental problem of our society that can be key to unveil the mechanism of common diseases. With the latest technological developments in the genomics field, many research institutes have the tools to collect massive amounts of genomic data. Nevertheless, there is a lack of tools that can be used to process and analyse these datasets in a biologically reliable and efficient manner. Many deep learning solutions have been proposed to solve current genomic tasks, but most of the times the main research interest is in the underlying biological mechanisms rather than high scores of the predictive metrics themselves. Recently, state-of-the-art in deep learning has shifted towards large transformer models, which use an attention mechanism that can be leveraged for interpretability. The main drawbacks of these large models is that they require a lot of memory space and have high inference time, which may make their use unfeasible in practical applications. In this work, we test the appropriateness of knowledge distillation to obtain more usable and equally performing models that genomic researchers can easily fine-tune to solve their scientific problems. DNABERT, a transformer model pre-trained on DNA data, is distilled following two strategies: DistilBERT and MiniLM. Four student models with different sizes are obtained and fine-tuned for promoter identification. They are evaluated in three key aspects: classification performance, usability and biological relevance of the predictions. The latter is assessed by visually inspecting the attention maps of TATA-promoter predictions, which are expected to have a peak of attention at the well-known TATA motif present in these sequences. Results show that is indeed possible to obtain significantly smaller models that are equally performant in the promoter identification task without any major differences between the two techniques tested. The smallest distilled model experiences less than 1% decrease in all performance metrics evaluated (accuracy, F1 score and Matthews Correlation Coefficient) and an increase in the inference speed by 7.3x, while only having 15% of the parameters of DNABERT. The attention maps for the student models show that they successfully learn to mimic the general understanding of the DNA that DNABERT possesses.
|
7 |
Energy-efficient Neuromorphic Computing for Resource-constrained Internet of Things DevicesLiu, Shiya 03 November 2023 (has links)
Due to the limited computation and storage resources of Internet of Things (IoT) devices, many emerging intelligent applications based on deep learning techniques heavily depend on cloud computing for computation and storage. However, cloud computing faces technical issues with long latency, poor reliability, and weak privacy, resulting in the need for on-device computation and storage. Also, on-device computation is essential for many time-critical applications, which require real-time data processing and energy-efficient. Furthermore, the escalating requirements for on-device processing are driven by network bandwidth limitations and consumer anticipations concerning data privacy and user experience. In the realm of computing, there is a growing interest in exploring novel technologies that can facilitate ongoing advancements in performance. Of the various prospective avenues, the field of neuromorphic computing has garnered significant recognition as a crucial means to achieve fast and energy-efficient machine intelligence applications for IoT devices. The programming of neuromorphic computing hardware typically involves the construction of a spiking neural network (SNN) capable of being deployed onto the designated neuromorphic hardware. This dissertation presents a range of methodologies aimed at enhancing the precision and energy efficiency of SNNs. To be more precise, these advancements are achieved by incorporating four essential methods. The first method is the quantization of neural networks through knowledge distillation. This work introduces a quantization technique that effectively reduces the computational and storage resource requirements of a model while minimizing the loss of accuracy. To further enhance the reduction of quantization errors, the second method introduces a novel quantization-aware training algorithm specifically designed for training quantized spiking neural network (SNN) models intended for execution on the Loihi chip, a specialized neuromorphic computing chip. SNNs generally exhibit lower accuracy performance compared to deep neural networks (DNNs). The third approach introduces a DNN-SNN co-learning algorithm, which enhances the performance of SNN models by leveraging knowledge obtained from DNN models. The design of the neural architecture plays a vital role in enhancing the accuracy and energy efficiency of an SNN model. The fourth method presents a novel neural architecture search algorithm specifically tailored for SNNs on the Loihi chip. The method selects an optimal architecture based on gradients induced by the architecture at initialization across different data samples without the need for training the architecture. To demonstrate the effectiveness and performance across diverse machine intelligence applications, our methods are evaluated through (i) image classification, (ii) spectrum sensing, and (iii) modulation symbol detection. / Doctor of Philosophy / In the emerging Internet of Things (IoT), our everyday devices, from smart home gadgets to wearables, can autonomously make intelligent decisions. However, due to their limited computing power and storage, many IoT devices heavily depend on cloud computing, which brings along issues like slow response times, privacy concerns, and unreliable connections. Neuromorphic computing is a recognized and crucial approach for achieving fast and energy-efficient machine intelligence applications in IoT devices. Inspired by the human brain's neural networks, this cutting-edge approach allows devices to perform complex tasks efficiently and in real-time. The programming of this neuromorphic hardware involves creating spiking neural networks (SNNs). This dissertation presents several innovative methods to improve the precision and energy efficiency of these SNNs. Firstly, a technique called "quantization" reduces the computational and storage requirements of models without sacrificing accuracy. Secondly, a unique training algorithm is designed to enhance the performance of SNN models. Thirdly, a clever co-learning algorithm allows SNN models to learn from traditional deep neural networks (DNNs), further improving their accuracy. Lastly, a novel neural architecture search algorithm finds the best architecture for SNNs on the designated neuromorphic chip, without the need for extensive training. By making IoT devices smarter and more efficient, neuromorphic computing brings us closer to a world where our gadgets can perform intelligent tasks independently, enhancing convenience and privacy for users across the globe.
|
8 |
The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via SuperstillingGundry, Chaz Allen 14 August 2023 (has links) (PDF)
Recent advances including the Transformer architecture have revolutionized the Natural Language Processing community by providing immense performance improvements across many tasks, including the development of Large Language Models (LLMs). LLMs show enormous promise as few-shot learners, common-sense knowledge repositories, conversational agents, writing assistants, and coding tools, and are gaining widespread traction in commercial industry. However, LLMs are expensive and time-consuming to train, requiring many passes over terabytes of data for the largest models. In this paper, we present Superstilling, a method for reducing the sample complexity of language model training by distilling the knowledge from a previously-trained model (the teacher) into a new, larger model (the student). This method does not require conformity between the architectures of the two models, and can be applied even when the weights and training data of the teacher model are not available, for example in federated learning scenarios. We apply Superstilling to train models of various sizes and show this method can decrease sample complexity by more than 10\% on models with over 160M parameters. We also show that in certain scenarios, Superstilling can be used to speed up training despite the need to run the teacher and student models simultaneously.
|
9 |
Advancing Learned Lossy Image Compression through Knowledge Distillation and Contextual ClusteringYichi Zhang (19960344) 29 October 2024 (has links)
<p dir="ltr">In recent decades, the rapid growth of internet traffic, particularly driven by high-definition images/videos has highlighted the critical need for effective image compression to reduce bit rates and enable efficient data transmission. Learned lossy image compression (LIC), which uses end-to-end deep neural networks, has emerged as a highly promising method, even outperforming traditional methods such as the intra-coding of the versatile video coding (VVC) standard. This thesis contributes to the field of LIC in two ways. First, we present a theoretical bound-guided knowledge distillation technique, which utilizes estimated bound information rate-distortion (R-D) functions to guide the training of LIC models. Implemented with a modified hierarchical variational autoencoder (VAE), this method demonstrates superior rate-distortion performance with reduced computational complexity. Next, we introduce a token mixer neural architecture, referred to as <i>contextual clustering</i>, which serves as an alternative to conventional convolutional layers or self-attention mechanisms in transformer architectures. Contextual clustering groups pixels based on their cosine similarity and uses linear layers to aggregate features within each cluster. By integrating with current LIC methods, we not only improve coding performance but also reduce computational load. </p>
|
10 |
Knowledge Distillation for Semantic Segmentation and Autonomous Driving. : Astudy on the influence of hyperparameters, initialization of a student network and the distillation method on the semantic segmentation of urban scenes.Sanchez Nieto, Juan January 2022 (has links)
Reducing the size of a neural network whilst maintaining a comparable performance is an important problem to be solved since the constrictions on resources of small devices make it impossible to deploy large models in numerous real-life scenarios. A prominent example is autonomous driving, where computer vision tasks such as object detection and semantic segmentation need to be performed in real time by mobile devices. In this thesis, the knowledge and spherical knowledge distillation techniques are utilized to train a small model (PSPNet50) under the supervision of a large model (PSPNet101) in order to perform semantic segmentation of urban scenes. The importance of the distillation hyperparameters is studied first, namely the influence of the temperature and the weights of the loss function on the performance of the distilled model, showing no decisive advantage over the individual training of the student. Thereafter, distillation is performed utilizing a pretrained student, revealing a good improvement in performance. Contrary to expectations, the pretrained student benefits from a high learning rate when training resumes under distillation, especially in the spherical knowledge distillation case, displaying a superior and more stable performance when compared to the regular knowledge distillation setting. These findings are validated by several experiments conducted using the Cityscapes dataset. The best distilled model achieves 87.287% pixel accuracy and a 42.0% mean Intersection-Over-Union value (mIoU) on the validation set, higher than the 86.356% pixel accuracy and 39.6% mIoU obtained by the baseline student. On the test set, the official evaluation obtained by submission to the Cityscapes website yields 42.213% mIoU for the distilled model and 41.085% for the baseline student. / Att minska storleken på ett neuralt nätverk med bibehållen prestanda är ett viktigt problem som måste lösas, eftersom de begränsade resurserna i små enheter gör det omöjligt att använda stora modeller i många verkliga situationer. Ett framträdande exempel är autonom körning, där datorseende uppgifter som objektsdetektering och semantisk segmentering måste utföras i realtid av mobila enheter. I den här avhandlingen används tekniker för destillation av kunskap och sfärisk kunskap för att träna en liten modell (PSPNet50) under övervakning av en stor modell (PSPNet101) för att utföra semantisk segmentering av stadsscener. Betydelsen av hyperparametrarna för destillation studeras först, nämligen temperaturens och förlustfunktionens vikter för den destillerade modellens prestanda, vilket inte visar någon avgörande fördel jämfört med individuell träning av eleven. Därefter utförs destillation med hjälp av en utbildad elev, vilket visar på en god förbättring av prestanda. Tvärtemot förväntningarna har den utbildade eleven en hög inlärningshastighet när utbildningen återupptas under destillation, särskilt i fallet med sfärisk kunskapsdestillation, vilket ger en överlägsen och stabilare prestanda jämfört med den vanliga kunskapsdestillationssituationen. Dessa resultat bekräftas av flera experiment som utförts med hjälp av datasetet Cityscapes. Den bästa destillerade modellen uppnår 87.287% pixelprecision och ett 42.0% medelvärde för skärning över union (mIoU) på valideringsuppsättningen, vilket är högre än de 86.356% pixelprecision och 39.6% mIoU som uppnåddes av grundstudenten. I testuppsättningen ger den officiella utvärderingen som gjordes på webbplatsen Cityscapes 42.213% mIoU för den destillerade modellen och 41.085% för grundstudenten.
|
Page generated in 0.0943 seconds