Global ETD Search

21	Learning Structured Representations for Understanding Visual and Multimedia Data Zareian, Alireza January 2021 (has links) Recent advances in Deep Learning (DL) have achieved impressive performance in a variety of Computer Vision (CV) tasks, leading to an exciting wave of academic and industrial efforts to develop Artificial Intelligence (AI) facilities for every aspect of human life. Nevertheless, there are inherent limitations in the understanding ability of DL models, which limit the potential of AI in real-world applications, especially in the face of complex, multimedia input. Despite tremendous progress in solving basic CV tasks, such as object detection and action recognition, state-of-the-art CV models can merely extract a partial summary of visual content, which lacks a comprehensive understanding of what happens in the scene. This is partly due to the oversimplified definition of CV tasks, which often ignore the compositional nature of semantics and scene structure. It is even less studied how to understand the content of multiple modalities, which requires processing visual and textual information in a holistic and coordinated manner, and extracting interconnected structures despite the semantic gap between the two modalities. In this thesis, we argue that a key to improve the understanding capacity of DL models in visual and multimedia domains is to use structured, graph-based representations, to extract and convey semantic information more comprehensively. To this end, we explore a variety of ideas to define more realistic DL tasks in both visual and multimedia domains, and propose novel methods to solve those tasks by addressing several fundamental challenges, such as weak supervision, discovery and incorporation of commonsense knowledge, and scaling up vocabulary. More specifically, inspired by the rich literature of semantic graphs in Natural Language Processing (NLP), we explore innovative scene understanding tasks and methods that describe images using semantic graphs, which reflect the scene structure and interactions between objects. In the first part of this thesis, we present progress towards such graph-based scene understanding solutions, which are more accurate, need less supervision, and have more human-like common sense compared to the state of the art. In the second part of this thesis, we extend our results on graph-based scene understanding to the multimedia domain, by incorporating the recent advances in NLP and CV, and developing a new task and method from the ground up, specialized for joint information extraction in the multimedia domain. We address the inherent semantic gap between visual content and text by creating high-level graph-based representations of images, and developing a multitask learning framework to establish a common, structured semantic space for representing both modalities. In the third part of this thesis, we explore another extension of our scene understanding methodology, to open-vocabulary settings, in order to make scene understanding methods more scalable and versatile. We develop visually grounded language models that use naturally supervised data to learn the meaning of all words, and transfer that knowledge to CV tasks such as object detection with little supervision. Collectively, the proposed solutions and empirical results set a new state of the art for the semantic comprehension of visual and multimedia content in a structured way, in terms of accuracy, efficiency, scalability, and robustness. Artificial intelligence Computer vision Computer vision--Mathematical models Semantics
22	Development of an algorithmic method for the recognition of biological objects Bernier, Thomas. January 1997 (has links) No description available. Computer vision -- Mathematical models. Computer algorithms.
23	A General Framework for Model Adaptation to Meet Practical Constraints in Computer Vision Huang, Shiyuan January 2024 (has links) Recent advances in deep learning models have shown impressive capabilities in various computer vision tasks, which encourages the integration of these models into real-world vision systems such as smart devices. This integration presents new challenges as models need to meet complex real-world requirements. This thesis is dedicated to building practical deep learning models, where we focus on two main challenges in vision systems: data efficiency and variability. We address these issues by providing a general model adaptation framework that extends models with practical capabilities. In the first part of the thesis, we explore model adaptation approaches for efficient representation. We illustrate the benefits of different types of efficient data representations, including compressed video modalities from video codecs, low-bit features and sparsified frames and texts. By using such efficient representation, the system complexity such as data storage, processing and computation can be greatly reduced. We systematically study various methods to extract, learn and utilize these representations, presenting new methods to adapt machine learning models for them. The proposed methods include a compressed-domain video recognition model with coarse-to-fine distillation training strategy, a task-specific feature compression framework for low-bit video-and-language understanding, and a learnable token sparsification approach for sparsifying human-interpretable video inputs. We demonstrate new perspectives of representing vision data in a more practical and efficient way in various applications. The second part of the thesis focuses on open environment challenges, where we explore model adaptation for new, unseen classes and domains. We examine the practical limitations in current recognition models, and introduce various methods to empower models in addressing open recognition scenarios. This includes a negative envisioning framework for managing new classes and outliers, and a multi-domain translation approach for dealing with unseen domain data. Our study shows a promising trajectory towards models exhibiting the capability to navigate through diverse data environments in real-world applications. Computer science Deep learning (Machine learning) Computer vision--Mathematical models Machine learning--Mathematical models Video compression
24	Temporal encoding in the visual system Almagor, Maier January 1977 (has links) A new model for temporal and spatial encoding in the visual system is developed and presented. The model indicates that spatial information is encoded in a manner similar to the encoding of temporal information. Experimental evidence related to this model is presented and analyzed. The temporal part of the model has been further developed. The model is based on two integrators in series with a temporal differentiator. The outputs of a varying number of similar, surrounding parallel cells can be pooled together in spatial integration. The length of the integration time as well as the number of cells spatially pooled together are controlled by the amount of spatial and temporal integrated light falling in and around every point on the retina. Three series of experiments were conducted to validate the model. The experiments used (1) a TV display of random, dynamic noise and (2) a specially developed stimulus generator which is able to produce very large hom~geneous visual fields which can be easily modulated to reproduce a large variety of temporal waveforms having a rise time longer than 1 msec. The obtained results support the proposed model. The principal findings are: (1) Time integration of the eye is locally controlled and set across the retina and has very fast dynamics. (2) The obtained CFF curves suggest a correlation between the frequency at which maximum sensitivity is obtained and the sensitivity itself. (3) As predicted by the model, temporal bands are developed in the visual system for stimuli showing temporal discontinuity points. The width of the temporal bands was measured and a strong correlation was found between the temporal band width and the integration time. The width of the temporal bands is a function of the luminance level at which they are produced; it is not dependent on the stimulus slope. The apparent brightness of the temporal band is, however, dependent on the slope of the stimulus. The present findings about the temporal and spatial integration of the eye-brain system suggest that they work as a fast adaptation mechanism and that they play a central role in visual perception, explaining homogeneously such disparate phenomena as the spatial Mach bands and their assymetry, the Broca-Sulzer effect, and backward masking. Suggestions about further research are offered. / Ph. D. LD5655.V856 1977.A45 Visual perception Vision -- Mathematical models Vision -- Research
25	Bayesian and information-theoretic tools for neuroscience Endres, Dominik M. January 2006 (has links) The overarching purpose of the studies presented in this report is the exploration of the uses of information theory and Bayesian inference applied to neural codes. Two approaches were taken: Starting from first principles, a coding mechanism is proposed, the results are compared to a biological neural code. Secondly, tools from information theory are used to measure the information contained in a biological neural code. Chapter 3: The REC model proposed by Harpur and Prager codes inputs into a sparse, factorial representation, maintaining reconstruction accuracy. Here I propose a modification of the REC model to determine the optimal network dimensionality. The resulting code for unfiltered natural images is accurate, highly sparse and a large fraction of the code elements show localized features. Furthermore, I propose an activation algorithm for the network that is faster and more accurate than a gradient descent based activation method. Moreover, it is demonstrated that asymmetric noise promotes sparseness. Chapter 4: A fast, exact alternative to Bayesian classification is introduced. Computational time is quadratic in both the number of observed data points and the number of degrees of freedom of the underlying model. As an example application, responses of single neurons from high-level visual cortex (area STSa) to rapid sequences of complex visual stimuli are analyzed. Chapter 5: I present an exact Bayesian treatment of a simple, yet sufficiently general probability distribution model. The model complexity, exact values of the expectations of entropies and their variances can be computed with polynomial effort given the data. The expectation of the mutual information becomes thus available, too, and a strict upper bound on its variance. The resulting algorithm is first tested on artificial data. To that end, an information theoretic similarity measure is derived. Second, the algorithm is demonstrated to be useful in neuroscience by studying the information content of the neural responses analyzed in the previous chapter. It is shown that the information throughput of STS neurons is maximized for stimulus durations of approx. 60ms. 610.21
26	Calculating degenerate structures via convex optimization with applications in computer vision and pattern recognition. / CUHK electronic theses & dissertations collection January 2012 (has links) 在諸多電腦視覺和模式識別的問題中，採集到的圖像和視頻資料通常是高維的。直接計算這些高維資料常常面臨計算可行性和穩定性等方面的困難。然而，現實世界中的資料通常由少數物理因素產生，因而本質上存在退化的結構。例如，它們可以用子空間、子空間的集合、流形或者分層流形等模型來描述。計算並運用這些內在退化結構不僅有助於深入理解問題的本質，而且能夠幫助解決實際應用中的難題。 / 隨著近些年凸優化理論和應用的發展，一些NP難題諸如低稚矩陣的計算和稀疏表示的問題已經有了近乎完美和高效的求解方法。本論文旨在研究如何應用這些技術來計算高維資料中的退化結構，並著重研究子空間和子空間的集合這兩種結構，以及它們在現實應用方面的意義。這些應用包括:人臉圖像的配准、背景分離以及自動植物辨別。 / 在人臉圖像配准的問題中，同一人臉在不同光照下的面部圖像經過逐圖元配准後應位於一個低維的子空間中。基於此假設，我們提出了一個新的圖像配准方法，能夠對某未知人臉的多副不同光照、表情和姿態下的圖像進行聯合配准，使得每一幅面部圖像的圖元與事先訓練的一般人臉模型相匹配。其基本思想是追尋一個低維的且位於一般人臉子空間附近的仿射子空間。相比于傳統的基於外觀模型的配准方法(例如主動外觀模型)依賴于準確的外觀模型的缺點，我們提出的方法僅需要一個一般人臉模型就可以很好地對該未知人臉的多副圖像進行聯合配准，即使該人臉與訓練該模型的樣本相差很大。實驗結果表明，該方法的配准精度在某些情況下接近于理想情形，即：當該目標人臉的模型事先已知時，傳統方法所能夠達到的配准精度。 / In a wide range of computer vision and pattern recognition problems, the captured images and videos often live in high-dimensional observation spaces. Directly computing them may suffer from computational infeasibility and numerical instability. On the other hand, the data in the real world are often generated due to limited number of physical causes, and thus embed degenerate structures in the nature. For instance, they can be modeled by a low-dimensional subspace, a union of subspaces, a manifold or even a manifold stratification. Discovering and harnessing such intrinsic structures not only brings semantic insight into the problems at hand, but also provides critical information to overcome challenges encountered in the practice. / Recent years have witnessed great development in both the theory and application of convex optimization. Efficient and elegant solutions have been found for NP-hard problems such as low-rank matrix recovery and sparse representation. In this thesis, we study the problem of discovering degenerate structures of high-¬dimensional inputs using these techniques. Especially we focus ourselves on low-dimensional subspaces and their unions, and address their application in overcoming the challenges encoun-tered under three practical scenarios: face image alignment, background subtraction and automatic plant identification. / In facial image alignment, we propose a method that jointly brings multiple images of an unseen face into alignment with a pre-trained generic appearance model despite different poses, expressions and illumination conditions of the face in the images. The idea is to pursue an intrinsic affine subspace of the target face that is low-dimensional while at the same time lies close to the generic subspace. Compared with conventional appearance-based methods that rely on accurate appearance mod-els, ours works well with only a generic one and performs much better on unseen faces even if they significantly differ from those for training the generic model. The result is approximately good as that in an idealistic case where a specific model for the target face is provided. / For background subtraction, we propose a background model that captures the changes caused by the background switching among a few configurations, like traffic lights statuses. The background is modeled as a union of low-dimensional subspaces, each characterizing one configuration of the background, and the proposed algorithm automatically switches among them and identifies violating elements as foreground pixels. Moreover, we propose a robust learning approach that can work with foreground-present training samples at the background modeling stage it builds a correct background model with outlying foreground pixels automatically pruned out. This is practically important when foreground-free training samples are difficult to obtain in scenarios such as traffic monitoring. / For automatic plant identification, we propose a novel and practical method that recognizes plants based on leaf shapes extracted from photographs. Different from existing studies that are mostly focused on simple leaves, the proposed method is de-signed to recognize both simple and compound leaves. The key to that is, instead of either measuring geometric features or matching shape features as in conventional methods, we describe leaves by counting on them the numbers of certain shape patterns. The patterns are learned in a way that they form a degenerate polytope (a spe-cial union of affine subspaces) in the feature space, and can simulate, to some extent, the "keys" used by botanists - each pattern reflects a common feature of several dif-ferent species and all the patterns together can form a discriminative rule for recog-nition. Experiments conducted on a variety of datasets show that our algorithm sig-nificantly outperforms the state-of-art methods in terms of recognition accuracy, ef-ficiency and storage, and thus has a good promise for practicing. / In conclusion, our performed studies show that: 1) the visual data with semantic meanings are often not random - although they can be high-dimensional, they typically embed degenerate structures in the observation space. 2) With appropriate assumptions made and clever computational tools developed, these structures can be efficiently and stably calculated. 3) The employment of these intrinsic structures helps overcoming practical challenges and is critical for computer vision and pattern recognition algorithms to achieve good performance. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / 在背景分離的問題中，靜態場景在不同光照情形下的背景可以被描述為一個線性子空間。然而在實際應用中，背景的局部和突然的變化有可能違背此假設，尤其是當背景在幾個狀態之間切換的情形下，例如交通燈在不同組合狀態之間切換。為了解決該問題，本論文中提出了一個新的背景模型，它將背景描述為一些子空間的集合，每個子空間對應一個背景狀態。我們將背景分離的問題轉化為稀疏逼近的問題，因此演算法能夠自動在多個狀態中切換並成功檢測出前景物體。此外，本論文提出了一個魯棒的字典學習方法。在訓練背景模型的過程中，它能夠處理含有前景物體的圖像，並在訓練過程中自動將前景部分去掉。這個優點在難以收集完整背景訓練樣本的應用情形(譬如交通監視等)下有明顯的優勢。 / 在植物種類自動辨別的問題中，本論文中提出了一個新的有效方法，它通過提取和對比植物葉片的輪廓對植物進行識別和分類。不同于傳統的基於測量幾何特徵或者在形狀特徵之間配對的方法，我們提出使用葉子上某些外形模式的數量來表達樹葉。這些模式在特徵空間中形成一個退化的多面體結構(一種特殊的仿射空間的集合)，而且在某種程度上能夠類比植物學中使用的分類檢索表每個模式都反映了一些不同植物的某個共性，例如某種邊緣、某種形狀、某種子葉的佈局等等;而所有模式組合在一起能夠形成具有很高區分度的分類準則。通過對演算法在四個數據庫上的測試，我們發現本論文提出的方法無論在識別精度還是在效率和存儲方面都相比于目前主流方法有顯著提高，因此具有很好的應用性。 / 總之，我們進行的一些列研究說明:(1) 有意義的視覺資料通常是內在相關的，儘管它們的維度可能很高，但是它們通常都具有某種退化的結構。(2) 合理的假設和運用計算工具可以高效、穩健地發現這些結構。(3) 利用這些結構有助於解決實際應用中的難題，且能夠使得電腦視覺和模式識別演算法達到好的性能。 / Zhao, Cong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 107-121). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / Dedication --- p.i / Acknowledgements --- p.ii / Abstract --- p.v / Abstract (in Chinese) --- p.viii / Publication List --- p.xi / Nomenclature --- p.xii / Contents --- p.xiv / List of Figures --- p.xviii / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Background --- p.2 / Chapter 1.2.1 --- Subspaces --- p.3 / Chapter 1.2.2 --- Unions of Subspaces --- p.6 / Chapter 1.2.3 --- Manifolds and Stratifications --- p.8 / Chapter 1.3 --- Thesis Outline --- p.10 / Chapter Chapter 2 --- Joint Face Image Alignment --- p.13 / Chapter 2.1 --- Introduction --- p.14 / Chapter 2.2 --- Related Works --- p.16 / Chapter 2.3 --- Background --- p.18 / Chapter 2.3.1 --- Active Appearance Model --- p.18 / Chapter 2.3.2 --- Multi-Image Alignment using AAM --- p.20 / Chapter 2.3.3 --- Limitations in Practice --- p.21 / Chapter 2.4 --- The Proposed Method --- p.23 / Chapter 2.4.1 --- Two Important Assumptions --- p.23 / Chapter 2.4.2 --- The Subspace Pursuit Problem --- p.27 / Chapter 2.4.3 --- Reformulation --- p.27 / Chapter 2.4.4 --- Efficient Solution --- p.30 / Chapter 2.4.5 --- Discussions --- p.32 / Chapter 2.5 --- Experiments --- p.34 / Chapter 2.5.1 --- Settings --- p.34 / Chapter 2.5.2 --- Results and Discussions --- p.36 / Chapter 2.6 --- Summary --- p.38 / Chapter Chapter 3 --- Background Subtraction --- p.40 / Chapter 3.1 --- Introduction --- p.41 / Chapter 3.2 --- Related Works --- p.43 / Chapter 3.3 --- The Proposed Method --- p.48 / Chapter 3.3.1 --- Background Modeling --- p.48 / Chapter 3.3.2 --- Background Subtraction --- p.49 / Chapter 3.3.3 --- Foreground Object Detection --- p.52 / Chapter 3.3.4 --- Background Modeling by Dictionary Learning --- p.53 / Chapter 3.4 --- Robust Dictionary Learning --- p.54 / Chapter 3.4.1 --- Robust Sparse Coding --- p.56 / Chapter 3.4.2 --- Robust Dictionary Update --- p.57 / Chapter 3.5 --- Experimentation --- p.59 / Chapter 3.5.1 --- Local and Sudden Changes --- p.59 / Chapter 3.5.2 --- Non-structured High-frequency Changes --- p.62 / Chapter 3.5.3 --- Discussions --- p.65 / Chapter 3.6 --- Summary --- p.66 / Chapter Chapter 4 --- Plant Identification using Leaves --- p.67 / Chapter 4.1 --- Introduction --- p.68 / Chapter 4.2 --- Related Works --- p.70 / Chapter 4.3 --- Review of IDSC Feature --- p.71 / Chapter 4.4 --- The Proposed Method --- p.73 / Chapter 4.4.1 --- Independent-IDSC Feature --- p.75 / Chapter 4.4.2 --- Common Shape Patterns --- p.77 / Chapter 4.4.3 --- Leaf Representation by Counts --- p.80 / Chapter 4.4.4 --- Leaf Recognition by NN Classifier --- p.82 / Chapter 4.5 --- Experiments --- p.82 / Chapter 4.5.1 --- Settings --- p.82 / Chapter 4.5.2 --- Performance --- p.83 / Chapter 4.5.3 --- Shared Dictionaries v.s. Shared Features --- p.88 / Chapter 4.5.4 --- Pooling --- p.89 / Chapter 4.6 --- Discussions --- p.90 / Chapter 4.6.1 --- Time Complexity --- p.90 / Chapter 4.6.2 --- Space Complexity --- p.91 / Chapter 4.6.3 --- System Description --- p.92 / Chapter 4.7 --- Summary --- p.92 / Chapter 4.8 --- Acknowledgement --- p.94 / Chapter Chapter 5 --- Conclusion and Future Work --- p.95 / Chapter 5.1 --- Thesis Contributions --- p.95 / Chapter 5.2 --- Future Work --- p.97 / Chapter 5.2.1 --- Theory Side --- p.98 / Chapter 5.2.2 --- Practice Side --- p.98 / Chapter Appendix-I --- Joint Face Alignment Results --- p.100 / Bibliography --- p.107 Computer vision--Mathematical models Pattern recognition systems Optical pattern recognition Mathematical optimization Convex functions Plants--Identification--Data processing
27	Feature based object rendering from sparse views. / CUHK electronic theses & dissertations collection January 2011 (has links) The first part of this thesis presents a convenient and flexible calibration method to estimate the relative rotation and translation among multiple cameras. A simple planar pattern is used for accurate calibration and is not required to be simultaneously observed by all cameras. Thus the method is especially suitable for widely spaced camera array. In order to fairly evaluate the calibration results for different camera setups, a novel accuracy metric is introduced based on the deflection angles of projection rays, which is insensitive to a number of setup factors. / The objective of this thesis is to develop a multiview system that can synthesize photorealistic novel views of the scene captured by sparse cameras distributed in a wide area. The system cost is largely reduced due to the small number of required cameras, and the image capture is greatly facilitated because the cameras are allowed to be widely spaced and flexibly placed. The key techniques to achieve this goal are investigated in this thesis. / Cui, Chunhui. / "November 2010." / Adviser: Ngan King Ngi. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 140-155). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Cameras--Calibration Computer vision--Mathematical models Image processing--Digital techniques Rendering (Computer graphics) Sparse matrices--Data processing
28	Interpretable Machine Learning and Sparse Coding for Computer Vision Landecker, Will 01 August 2014 (has links) Machine learning offers many powerful tools for prediction. One of these tools, the binary classifier, is often considered a black box. Although its predictions may be accurate, we might never know why the classifier made a particular prediction. In the first half of this dissertation, I review the state of the art of interpretable methods (methods for explaining why); after noting where the existing methods fall short, I propose a new method for a particular type of black box called additive networks. I offer a proof of trustworthiness for this new method (meaning a proof that my method does not "make up" the logic of the black box when generating an explanation), and verify that its explanations are sound empirically. Sparse coding is part of a family of methods that are believed, by many researchers, to not be black boxes. In the second half of this dissertation, I review sparse coding and its application to the binary classifier. Despite the fact that the goal of sparse coding is to reconstruct data (an entirely different goal than classification), many researchers note that it improves classification accuracy. I investigate this phenomenon, challenging a common assumption in the literature. I show empirically that sparse reconstruction is not necessarily the right intermediate goal, when our ultimate goal is classification. Along the way, I introduce a new sparse coding algorithm that outperforms competing, state-of-the-art algorithms for a variety of important tasks. Machine learning -- Mathematical models Computer vision -- Mathematical models Compressed sensing (Telecommunication) Artificial Intelligence and Robotics
29	Representations and matching techniques for 3D free-form object and face recognition Mian, Ajmal Saeed January 2007 (has links) [Truncated abstract] The aim of visual recognition is to identify objects in a scene and estimate their pose. Object recognition from 2D images is sensitive to illumination, pose, clutter and occlusions. Object recognition from range data on the other hand does not suffer from these limitations. An important paradigm of recognition is model-based whereby 3D models of objects are constructed offline and saved in a database, using a suitable representation. During online recognition, a similar representation of a scene is matched with the database for recognizing objects present in the scene . . . The tensor representation is extended to automatic and pose invariant 3D face recognition. As the face is a non-rigid object, expressions can significantly change its 3D shape. Therefore, the last part of this thesis investigates representations and matching techniques for automatic 3D face recognition which are robust to facial expressions. A number of novelties are proposed in this area along with their extensive experimental validation using the largest available 3D face database. These novelties include a region-based matching algorithm for 3D face recognition, a 2D and 3D multimodal hybrid face recognition algorithm, fully automatic 3D nose ridge detection, fully automatic normalization of 3D and 2D faces, a low cost rejection classifier based on a novel Spherical Face Representation, and finally, automatic segmentation of the expression insensitive regions of a face. Computer vision -- Mathematical models 3D object recognition 3D shape representation 3D face recognition 3D modeling
30	Image Based Attitude And Position Estimation Using Moment Functions Mukundan, R 07 1900 (has links) (PDF) No description available. Computer Vision - Mathematical Models Image Processing - Mathematical Models Image Moments Moment Functions Global Image Features Moment Invariants Computer Science

Search results