Global ETD Search

11	Building A More Efficient Mobile Vision System Through Adaptive Video Analytics Junpeng Guo (20349582) 17 December 2024 (has links) <p dir="ltr">Mobile vision is becoming the norm, transforming our daily lives. It powers numerous applications, enabling seamless interactions between the digital and physical worlds, such as augmented reality, real-time object detection, and many others. The popularity of mobile vision has spurred advancements from both computer vision (CV) and mobile edge computing (MEC) communities. The former focuses on improving analytics accuracy through the use of proper deep neural networks (DNNs), while the latter addresses the resource limitations of mobile environments by coordinating tasks between mobile and edge devices, determining which data to transmit and process to enable real-time performance. </p><p dir="ltr"> Despite recent advancements, existing approaches typically integrate the functionalities of the two camps at a basic task level. They rely on a uniform on-device processing scheme that streams the same type of data and uses the same DNN model for identical CV tasks, regardless of the analytical complexity of the current input, input size, or latency requirements. This lack of adaptability to dynamic contexts limits their ability to achieve optimal efficiency in scenarios involving diverse source data, varying computational resources, and differing application requirements. </p><p dir="ltr">Our approach seeks to move beyond task-level adaptation by emphasizing customized optimizations tailored to dynamic use scenarios. This involves three key adaptive strategies: dynamically compressing source data based on contextual information, selecting the appropriate computing model (e.g., DNN or sub-DNN) for the vision task, and establishing a feedback mechanism for context-aware runtime tuning. Additionally, for scenarios involving movable cameras, the feedback mechanism guides the data capture process to further enhance performance. These innovations are explored across three use cases categorized by the capture device: one stationary camera, one moving camera, and cross-camera analytics. </p><p dir="ltr">My dissertation begins with a stationary camera scenario, where we improve efficiency by adapting to the use context on both the device and edge sides. On the device side, we explore a broader compression space and implement adaptive compression based on data context. Specifically, we leverage changes in confidence scores as feedback to guide on-device compression, progressively reducing data volume while preserving the accuracy of visual analytics. On the edge side, instead of training a specialized DNN for each deployment scenario, we adaptively select the best-fit sub-network for the given context. A shallow sub-network is used to “test the waters”, accelerating the search for a deep sub-network that maximizes analytical accuracy while meeting latency requirements.</p><p dir="ltr"> Next, we explore scenarios involving a moving camera, such as those mounted on drones. These introduce new challenges, including increased data encoding demands due to camera movement and degraded analytics performance (e.g., tracking) caused by changing perspectives. To address these issues, we leverage drone-specific domain knowledge to optimize compression for object detection by applying global motion compensation and assigning different resolutions at a tile-granularity level based on the far-near effect. Furthermore, we tackle the more complex task of object tracking and following, where the analytics results directly influence the drone’s navigation. To enable effective target following with minimal processing overhead, we design an adaptive frame rate tracking mechanism that dynamically adjusts based on changing contexts.</p><p dir="ltr"> Last but not least, we extend the work to cross-camera analytics, focusing on coordination between one stationary ground-based camera and one moving aerial camera. The primary challenge lies in addressing significant misalignments (e.g., scale, rotation, and lighting variations) between the two perspectives. To overcome these issues, we propose a multi-exit matching mechanism that prioritizes local feature matching while incorporating global features and additional cues, such as color and location, to refine matches as needed. This approach ensures accurate identification of the same target across viewpoints while minimizing computational overhead by dynamically adapting to the complexity of the matching task. </p><p dir="ltr">While the current work primarily addresses ideal conditions, assuming favorable weather, optimal lighting, and reliable network performance, it establishes a solid foundation for future innovations in adaptive video processing under more challenging conditions. Future efforts will focus on enhancing robustness against adversarial factors, such as sensing data drift and transmission losses. Additionally, we plan to explore multi-camera coordination and multimodal data integration, leveraging the growing potential of large language models to further advance this field.</p> Computer vision Mobile computing Video Analytics Mobile Computing,
12	Video Analytics for Agricultural Applications Shengtai Ju (19180429) 20 July 2024 (has links) <p dir="ltr">Agricultural applications often require human experts with domain knowledge to ensure compliance and improve productivity, which can be costly and inefficient. To tackle this problem, automated video systems can be implemented for agricultural tasks thanks to the ubiquity of cameras. In this thesis, we focus on designing and implementing video analytics systems for real applications in agriculture by combining both traditional image processing and recent advancements in computer vision. Existing research and available methods have been heavily focused on obtaining the best performance on large-scale benchmarking datasets, while neglecting the applications to real-world problems. Our goal is to bridge the gap between state-of-art methods and real agricultural applications. More specifically, we design video systems for the two tasks of monitoring turkey behavior for turkey welfare and handwashing action recognition for improved food safety. For monitoring turkeys, we implement a turkey detector, a turkey tracker, and a turkey head tracker by combining object detection and multi-object tracking. Furthermore, we detect turkey activities by incorporating motion information. For recognizing handwashing activities, we combine a hand extraction method for focusing on the hand regions with a neural network to build a hand image classifier. In addition, we apply a two-stream network with RGB and hand streams to further improve performance and robustness.</p><p dir="ltr">Besides designing a robust hand classifier, we explore how dataset attributes and distribution shifts can impact system performance. In particular, distribution shifts caused by changes in hand poses and shadow can cause a classifier’s performance to degrade sharply or breakdown beyond a certain point. To better explore the impact of hand poses and shadow and to mitigate the induced breakdown points, we generate synthetic data with desired variations to introduce controlled distribution shift. Experimental results show that the breakdown points are heavily impacted by pose and shadow conditions. In addition, we demonstrate mitigation strategies to significant performance degradation by using selective additional training data and adding synthetic shadow to images. By incorporating domain knowledge and understanding the applications, we can effectively design video analytics systems and apply advanced techniques in agricultural scenarios.</p> Video Analytics Video Processing Agricultural Systems Camera Systems Food Safety Animal Tracking Synthetic Dataset
13	IMPROVING QOE OF 5G APPLICATIONS (VR AND VIDEO ANALYTICS APPLICATION) ON EDGE DEVICES Sibendu Paul (14270921) 17 May 2024 (has links) <p>Recent advancements in deep learning (DL) and high-communication bandwidth access networks such as 5G enable applications that require intelligence and faster computational power at the edge with low power consumption. In this thesis, we study how to improve the Quality-of-Experience (QoE) of these emerging 5G applications, e.g., virtual reality (VR) and video analytics on edge devices. These 5G applications either require high-quality visual effects with a stringent latency requirement (for VR) or high analytics accuracy (for video analytics) while maintaining frame rate requirements under dynamic conditions. </p> <p>In part 1, we study how to support high-quality untethered immersive multiplayer VR on commodity mobile devices. Simply replicating the prior-art for a single-user VR will result in a linear increase in network bandwidth requirement that exceeds the bandwidth of WiFi (802.11ac). We propose a novel technique, <em>Coterie, </em>that splits the rendering of background environment (BE) frames between the mobile device and the edge server that drastically enhances the similarity of the BE frames and reduces the network load via frame caching and reuse. Our proposed VR framework, Coterie, reduces per-player network requirement by over 10x and easily supports 4 players on Pixel 2 over 802.11ac while maintaining the QoE constraints of 4K VR.</p> <p>In part 2, we study how to achieve high accuracy of analytics in video analytics pipelines (VAP). We observe that the frames captured by the surveillance video cameras powering a variety of 24X7 analytics applications are not always pristine -- they can be distorted due to environmental condition changes, lighting issues, sensor noise, compression, etc. Such distortions not only deteriorate the accuracy of deep learning applications but also negatively impact the utilization of the edge server resources used to run these computationally expensive DL models. First, we study how to dynamically filter out low-quality frames captured. We propose a lightweight DL-based quality estimator, <em>AQuA</em>, that can be used to filter out low-quality frames that can lead to high-confidence errors (false-positives) if fed into the analytic units (AU) in the VAP. AQuA-filter reduces false positives by 17% and the compute and network usage by up to 27% when used in a face-recognition VAP. Second, we study how to reduce such poor-quality frame captures by the camera. We propose <em>CamTuner, </em>a system that automatically and dynamically adapts the complex camera settings to changing environmental conditions based on analytical quality estimation to enhance the accuracy of video analytics. In a real customer deployment, <em>CamTuner</em> enhances VAP accuracy by detecting 15.9% additional persons and 2.6%–4.2% additional cars (without any false positives) than the default camera setting. While <em>CamTuner</em> focuses on improving the accuracy of single-AU running on a camera stream, next we present <em>Elixir</em>, a system that enhances the video stream quality for multiple analytics on a video stream by jointly optimizing different AUs’ objectives. In a real-world deployment, <em>Elixir</em> correctly detects 7.1% (22,068) and 5.0% (15,731) more cars, 94% (551) and 72% (478) more faces, and 670.4% (4975) and 158.6% (3507) more persons than the default-camera-setting and time-sharing approaches, respectively.</p> Computer vision Video Analytics Virtual Reality 5G Quality-of-Experience Edge Computing deep learning
14	Designing a Prototype for Visual Exploration of Narrative Patterns in News Videos Liebl, Bernhard, Burghardt, Manuel 04 July 2024 (has links) News videos play an important rule in shaping our everyday communication. At the same time, news videos use narrative patterns to keep people entertained. Understanding how these patterns work and are being applied in news videos is crucial for understanding how they may affect a videos ideological message, which is an important dimension in times of fake news and disinformation campaigns. We present Zoetrope, a web-based tool that supports the discovery of narrative patterns in news videos by means of a visual exploration approach. Zoetrope integrates a number of multimodal information extraction frameworks into an interactive visualization, to allow for an efficient exploratory access to large collections of news videos visual analytics, video analytics info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/770 ddc:770
15	Video Analytics with Spatio-Temporal Characteristics of Activities Cheng, Guangchun 05 1900 (has links) As video capturing devices become more ubiquitous from surveillance cameras to smart phones, the demand of automated video analysis is increasing as never before. One obstacle in this process is to efficiently locate where a human operator’s attention should be, and another is to determine the specific types of activities or actions without ambiguity. It is the special interest of this dissertation to locate spatial and temporal regions of interest in videos and to develop a better action representation for video-based activity analysis. This dissertation follows the scheme of “locating then recognizing” activities of interest in videos, i.e., locations of potentially interesting activities are estimated before performing in-depth analysis. Theoretical properties of regions of interest in videos are first exploited, based on which a unifying framework is proposed to locate both spatial and temporal regions of interest with the same settings of parameters. The approach estimates the distribution of motion based on 3D structure tensors, and locates regions of interest according to persistent occurrences of low probability. Two contributions are further made to better represent the actions. The first is to construct a unifying model of spatio-temporal relationships between reusable mid-level actions which bridge low-level pixels and high-level activities. Dense trajectories are clustered to construct mid-level actionlets, and the temporal relationships between actionlets are modeled as Action Graphs based on Allen interval predicates. The second is an effort for a novel and efficient representation of action graphs based on a sparse coding framework. Action graphs are first represented using Laplacian matrices and then decomposed as a linear combination of primitive dictionary items following sparse coding scheme. The optimization is eventually formulated and solved as a determinant maximization problem, and 1-nearest neighbor is used for action classification. The experiments have shown better results than existing approaches for regions-of-interest detection and action recognition. video analytics action recognition spatio-temporal anomaly detection sparse coding Image analysis. Image processing. Pattern recognition systems.
16	FakeNarratives – First Forays in Understanding Narratives of Disinformation in Public and Alternative News Videos Tseng, Chiao-I;, Liebl, Bernhard, Burghardt, Manuel, Bateman, John 04 July 2024 (has links) No description available. info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/770 ddc:770
17	TASK-AWARE VIDEO COMPRESSION AND QUALITY ESTIMATION IN PRACTICAL VIDEO ANALYTICS SYSTEMS Praneet Singh (20797433) 28 February 2025 (has links) <p dir="ltr">Practical video analytics systems that perform computer vision tasks are widely used in critical real-world scenarios such as autonomous driving and public safety. These end-to-end systems sequentially perform tasks like object detection, segmentation, and recognition such that the performance of each analytics task depends on how well the previous tasks are performed. Typically, these systems are deployed in resources and bandwidth-constrained environments, so video compression algorithms like HEVC are necessary to minimize transmission bandwidth at the expense of input quality. Furthermore, to optimize resource utilization of these systems, the analytics tasks should be executed solely on inputs that may provide valuable insights on task performance. Hence, it is essential to understand the impact of compression and input data quality on the overall performance of end-to-end video analytics systems, using meaningfully curated datasets and interpretable evaluation procedures. This information is crucial for the overall improvement of system performance. Thus, in this thesis we focus on:</p><ol><li>Understanding the effects of compression on the performance of video analytics systems that perform tasks such as pedestrian detection, face detection, and face recognition. With this, we develop a task-aware video encoding strategy for HEVC that improves system performance under compression.</li><li>Designing methodologies to perform a meaningful and interpretable evaluation of an end-to-end system that sequentially performs face detection, alignment, and recognition. This involves balancing datasets, creating consistent ground truths, and capturing the performance interdependence between the various tasks of the system.</li><li>Estimating how image quality is linked to task performance in end-to-end face analytics systems. Here, we design novel task-aware image Quality Estimators (QEs) that determine the suitability of images for face detection. We also propose systematic evaluation protocols to showcase the efficacy of our novel face detection QEs and existing face recognition QEs. </li></ol><p dir="ltr"><br></p> Image and video coding Video processing Practical Video Analytics Quality Estimation Deep Learning End-to-end systems Evaluation Protocols Image Quality Video Compression Object Detection Face Recognition Dataset Balancing Data Curation
18	Zoetrope – Interactive Feature Exploration in News Videos Liebl, Bernhard, Burghardt, Manuel 11 July 2024 (has links) No description available. info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/770 ddc:770

Search results