1 |
Efficient Utilization of Video Embeddings from Video-Language ModelsLindgren, Felix January 2023 (has links)
In the digital age where video content is abundant, this thesis investigates the efficient adaptation of an existing video-language model (VLM) to new data. The research leverages CLIP, a robust language-vision model, for various video-related tasks including video retrieval. The study explores using pre-trained VLMs to extract video embeddings without the need for extensive retraining. The effectiveness of a smaller model using aggregation is compared with larger models and the application of logistic regression for few-shot learning on video embeddings is examined. The aggregation was done using both non-learning through mean-pooling and also by utilizing a transformer. The video-retrieval models were evaluated on the ActivityNet Captions dataset which contains long videos with dense descriptions while the linear probes were evaluated on ActivityNet200 a video classification dataset. The study's findings suggest that most models improved when additional frames were employed through aggregation, leading to improved performance. A model trained with fewer frames was able to surpass those trained with two or four times more frames by instead using aggregation. The incorporation of patch dropout and the freezing of embeddings proved advantageous by enhancing performance and conserving training resources. Furthermore, using a linear probe showed that the extracted features were of high quality requiring only 2-4 samples per class to match the zero-shot performance.
|
2 |
Optimizing Accuracy-Efficiency Tradeoffs in Emerging Neural WorkloadsAmrit Nagarajan (17593524) 11 December 2023 (has links)
<p>Deep Neural Networks (DNNs) are constantly evolving, enabling the power of deep learning to be applied to an ever-growing range of applications, such as Natural Language Processing (NLP), recommendation systems, graph processing, etc. However, these emerging neural workloads present large computational demands for both training and inference. In this dissertation, we propose optimizations that take advantage of the unique characteristics of different emerging workloads to simultaneously improve accuracy and computational efficiency.</p>
<p><br></p>
<p>First, we consider Language Models (LMs) used in NLP. We observe that the design process of LMs (pre-train a foundation model, and subsequently fine-tune it for different downstream tasks) leads to models that are highly over-parameterized for the downstream tasks. We propose AxFormer, a systematic framework that applies accuracy-driven approximations to create accurate and efficient LMs for a given downstream task. AxFormer eliminates task-irrelevant knowledge, and helps the model focus only on the relevant parts of the input.</p>
<p><br></p>
<p>Second, we find that during fine-tuning of LMs, the presence of variable-length input sequences necessitates the use of padding tokens when batching sequences, leading to ineffectual computations. It is also well known that LMs over-fit to the small task-specific training datasets used during fine-tuning, despite the use of known regularization techniques. Based on these insights, we present TokenDrop + BucketSampler, a framework that synergistically combines a new regularizer that drops a random subset of insignificant words in each sequence in every epoch, and a length-aware batching method to simultaneously reduce padding and address the overfitting issue.</p>
<p><br></p>
<p>Next, we address the computational challenges of Transformers used for processing inputs of several important modalities, such as text, images, audio and videos. We present Input Compression with Positional Consistency (ICPC), a new data augmentation method that applies varying levels of compression to each training sample in every epoch, thereby simultaneously reducing over-fitting and improving training efficiency. ICPC also enables efficient variable-effort inference, where easy samples can be inferred at high compression levels, and vice-versa.</p>
<p><br></p>
<p>Finally, we focus on optimizing Graph Neural Networks (GNNs), which are commonly used for learning on non-Euclidean data. Few-shot learning with GNNs is an important challenge, since real-world graphical data is often sparsely labeled. Self-training, wherein the GNN is trained in stages by augmenting the training data with a subset of the unlabeled data and their pseudolabels, has emerged as a promising approach. However, self-training significantly increases the computational demands of training. We propose FASTRAIN-GNN, a framework for efficient and accurate self-training of GNNs with few labeled nodes. FASTRAIN-GNN optimizes the GNN architecture, training data, training parameters, and the graph topology during self-training.</p>
<p><br></p>
<p>At inference time, we find that ensemble GNNs are significantly more accurate and robust than single-model GNNs, but suffer from high latency and storage requirements. To address this challenge, we propose GNN Ensembles through Error Node Isolation (GEENI). The key concept in GEENI is to identify nodes that are likely to be incorrectly classified (error nodes) and suppress their outgoing messages, leading to simultaneous accuracy and efficiency improvements. </p>
<p><br></p>
|
Page generated in 0.109 seconds