• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1941
  • 313
  • 150
  • 112
  • 108
  • 69
  • 56
  • 46
  • 25
  • 20
  • 14
  • 13
  • 13
  • 13
  • 13
  • Tagged with
  • 3583
  • 3583
  • 974
  • 871
  • 791
  • 791
  • 646
  • 618
  • 578
  • 539
  • 530
  • 525
  • 479
  • 451
  • 448
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
651

The analytic edge - image reconstruction from edge data via the Cauchy Integral

Hay, Todd 08 April 2016 (has links)
A novel image reconstruction algorithm from edges (image gradients) follows from the Sokhostki-Plemelj Theorem of complex analysis, an elaboration of the standard Cauchy (Singular) Integral. This algorithm demonstrates the use of Singular Integral Equation methods to image processing, extending the more common use of Partial Differential Equations (e.g. based on variants of the Diffusion or Poisson equations). The Cauchy Integral approach has a deep connection to and sheds light on the (linear and non-linear) diffusion equation, the retinex algorithm and energy-based image regularization. It extends the commonly understood local definition of an edge to a global, complex analytic structure - the analytic edge - the contrast weighted kernel of the Cauchy Integral. Superposition of the set of analytic edges provides a "filled-in" image which is the piece-wise analytic image corresponding to the edge (gradient data) supplied. This is a fully parallel operation which avoids the time penalty associated with iterative solutions and thus is compatible with the short time (about 150 milliseconds) that is biologically available for the brain to construct a perceptual image from edge data. Although this algorithm produces an exact reconstruction of a filled-in image from the gradients of that image, slight modifications of it produce images which correspond to perceptual reports of human observers when presented with a wide range of "visual contrast illusion" images.
652

Gaze estimation with graphics

Wood, Erroll William January 2017 (has links)
Gaze estimation systems determine where someone is looking. Gaze is used for a wide range of applications including market research, usability studies, and gaze-based interfaces. Traditional equipment uses special hardware. To bring gaze estimation mainstream, researchers are exploring approaches that use commodity hardware alone. My work addresses two outstanding problems in this field: 1) it is hard to collect good ground truth eye images for machine learning, and 2) gaze estimation systems do not generalize well -- once they are trained with images from one scenario, they do not work in another scenario. In this dissertation I address these problems in two different ways: learning-by-synthesis and analysis-by-synthesis. Learning-by-synthesis is the process of training a machine learning system with synthetic data, i.e. data that has been rendered with graphics rather than collected by hand. Analysis-by-synthesis is a computer vision strategy that couples a generative model of image formation (synthesis) with a perceptive model of scene comparison (analysis). The goal is to synthesize an image that best matches an observed image. In this dissertation I present three main contributions. First, I present a new method for training gaze estimation systems that use machine learning: learning-by-synthesis using 3D head scans and photorealistic rendering. Second, I present a new morphable model of the eye region. I show how this model can be used to generate large amounts of varied data for learning-by-synthesis. Third, I present a new method for gaze estimation: analysis-by-synthesis. I demonstrate how analysis-by-synthesis can generalize to different scenarios, estimating gaze in a device- and person- independent manner.
653

Semantic spaces for video analysis of behaviour

Xu, Xun January 2016 (has links)
There are ever growing interests from the computer vision community into human behaviour analysis based on visual sensors. These interests generally include: (1) behaviour recognition - given a video clip or specific spatio-temporal volume of interest discriminate it into one or more of a set of pre-defined categories; (2) behaviour retrieval - given a video or textual description as query, search for video clips with related behaviour; (3) behaviour summarisation - given a number of video clips, summarise out representative and distinct behaviours. Although countless efforts have been dedicated into problems mentioned above, few works have attempted to analyse human behaviours in a semantic space. In this thesis, we define semantic spaces as a collection of high-dimensional Euclidean space in which semantic meaningful events, e.g. individual word, phrase and visual event, can be represented as vectors or distributions which are referred to as semantic representations. With the semantic space, semantic texts, visual events can be quantitatively compared by inner product, distance and divergence. The introduction of semantic spaces can bring lots of benefits for visual analysis. For example, discovering semantic representations for visual data can facilitate semantic meaningful video summarisation, retrieval and anomaly detection. Semantic space can also seamlessly bridge categories and datasets which are conventionally treated independent. This has encouraged the sharing of data and knowledge across categories and even datasets to improve recognition performance and reduce labelling effort. Moreover, semantic space has the ability to generalise learned model beyond known classes which is usually referred to as zero-shot learning. Nevertheless, discovering such a semantic space is non-trivial due to (1) semantic space is hard to define manually. Humans always have a good sense of specifying the semantic relatedness between visual and textual instances. But a measurable and finite semantic space can be difficult to construct with limited manual supervision. As a result, constructing semantic space from data is adopted to learn in an unsupervised manner; (2) It is hard to build a universal semantic space, i.e. this space is always contextual dependent. So it is important to build semantic space upon selected data such that it is always meaningful within the context. Even with a well constructed semantic space, challenges are still present including; (3) how to represent visual instances in the semantic space; and (4) how to mitigate the misalignment of visual feature and semantic spaces across categories and even datasets when knowledge/data are generalised. This thesis tackles the above challenges by exploiting data from different sources and building contextual semantic space with which data and knowledge can be transferred and shared to facilitate the general video behaviour analysis. To demonstrate the efficacy of semantic space for behaviour analysis, we focus on studying real world problems including surveillance behaviour analysis, zero-shot human action recognition and zero-shot crowd behaviour recognition with techniques specifically tailored for the nature of each problem. Firstly, for video surveillances scenes, we propose to discover semantic representations from the visual data in an unsupervised manner. This is due to the largely availability of unlabelled visual data in surveillance systems. By representing visual instances in the semantic space, data and annotations can be generalised to new events and even new surveillance scenes. Specifically, to detect abnormal events this thesis studies a geometrical alignment between semantic representation of events across scenes. Semantic actions can be thus transferred to new scenes and abnormal events can be detected in an unsupervised way. To model multiple surveillance scenes simultaneously, we show how to learn a shared semantic representation across a group of semantic related scenes through a multi-layer clustering of scenes. With multi-scene modelling we show how to improve surveillance tasks including scene activity profiling/understanding, crossscene query-by-example, behaviour classification, and video summarisation. Secondly, to avoid extremely costly and ambiguous video annotating, we investigate how to generalise recognition models learned from known categories to novel ones, which is often termed as zero-shot learning. To exploit the limited human supervision, e.g. category names, we construct the semantic space via a word-vector representation trained on large textual corpus in an unsupervised manner. Representation of visual instance in semantic space is obtained by learning a visual-to-semantic mapping. We notice that blindly applying the mapping learned from known categories to novel categories can cause bias and deteriorating the performance which is termed as domain shift. To solve this problem we employed techniques including semisupervised learning, self-training, hubness correction, multi-task learning and domain adaptation. All these methods in combine achieve state-of-the-art performance in zero-shot human action task. In the last, we study the possibility to re-use known and manually labelled semantic crowd attributes to recognise rare and unknown crowd behaviours. This task is termed as zero-shot crowd behaviours recognition. Crucially we point out that given the multi-labelled nature of semantic crowd attributes, zero-shot recognition can be improved by exploiting the co-occurrence between attributes. To summarise, this thesis studies methods for analysing video behaviours and demonstrates that exploring semantic spaces for video analysis is advantageous and more importantly enables multi-scene analysis and zero-shot learning beyond conventional learning strategies.
654

Computer vision-based tracking and feature extraction for lingual ultrasound

Al-Hammuri, Khalid 30 April 2019 (has links)
Lingual ultrasound is emerging as an important tool for providing visual feedback to second language learners. In this study, ultrasound videos were recorded in sagittal plane as it provides an image for the full tongue surface in one scan, unlike the transverse plane which provides an information for small portion of the tongue in a single scan. The data were collected from five Arabic speakers as they pronounced fourteen Arabic sounds in three different vowel contexts. The sounds were repeated three times to form 630 ultrasound videos. The thesis algorithm was characterized by four steps. First: denoising the ultrasound image by using the combined curvelet transform and shock filter. Second: automatic selection of the tongue contour area. Third: tongue contour approximation and missing data estimation. Fourth: tongue contour transformation from image space to full concatenated signal and features extraction. The automatic tongue tracking results were validated by measuring the mean sum of distances between automatic and manual tongue contour tracking to give an accuracy of 0.9558mm. The validation for the feature extraction showed that the average mean squared error between the extracted tongue signature for different sound repetitions was 0.000858mm, which means that the algorithm could extract a unique signature for each sound and across different vowel contexts with a high degree of similarity. Unlike other related works, the algorithm showed an efficient and robust approach that could extract the tongue contour and the significant feature for the dynamic tongue movement on the full video frames, not just on the significant single and static video frame as used in the conventional method. The algorithm did not need any training data and had no limitation for the video size or the frame number. The algorithm did not fail during tongue extraction and did not need any manual re-initialization. Even when the ultrasound image recordings missed some tongue contour information, the thesis approach could estimate the missing data with a high degree of accuracy. The usefulness of the thesis approach as it can help the linguistic researchers to replace the manual tongue tracking by an automated tracking to save the time, then extracts the dynamics features for the full speech behavior to give better understanding of the tongue movement during the speech to develop a language learning tool for the second language learners. / Graduate
655

Geometric modeling with primitives

Angles, Baptiste 29 April 2019 (has links)
Both man-made or natural objects contain repeated geometric elements that can be interpreted as primitive shapes. Plants, trees, living organisms or even crystals, showcase primitives that repeat themselves. Primitives are also commonly found in man-made environments because architects tend to reuse the same patterns over a building and typically employ simple shapes, such as rectangular windows and doors. During my PhD I studied geometric primitives from three points of view: their composition, simulation and autonomous discovery. In the first part I present a method to reverse-engineer the function by which some primitives are combined. Our system is based on a composition function template that is represented by a parametric surface. The parametric surface is deformed via a non-rigid alignment of a surface that, once converged, represents the desired operator. This enables the interactive modeling of operators via a simple sketch, solving a major usability gap of composition modeling. In the second part I introduce the use of a novel primitive for real-time physics simulations. This primitive is suitable to efficiently model volume-preserving deformations of rods but also of more complex structures such as muscles. One of the core advantages of our approach is that our primitive can serve as a unified representation to do collision detection, simulation, and surface skinning. In the third part I present an unsupervised deep learning framework to learn and detect primitives. In a signal containing a repetition of elements, the method is able to automatically identify the structure of these elements (i.e. primitives) with minimal supervision. In order to train the network that contains a non-differentiable operation, a novel multi-step training process is presented. / Graduate
656

Pattern Mining and Concept Discovery for Multimodal Content Analysis

Li, Hongzhi January 2016 (has links)
With recent advances in computer vision, researchers have been able to demonstrate impressive performance at near-human-level capabilities in difficult tasks such as image recognition. For example, for images taken under typical conditions, computer vision systems now have the ability to recognize if a dog, cat, or car appears in an image. These advances are made possible by utilizing the massive volume of image datasets and label annotations, which include category labels and sometimes bounding boxes around the objects of interest within the image. However, one major limitation of the current solutions is that when users apply recognition models to new domains, users need to manually define the target classes and label the training data in order to prepare labeled annotations required for the process of training the recognition models. Manually identifying the target classes and constructing the concept ontology for a new domain are time-consuming tasks, as they require the users to be familiar with the content of the image collection, and the manual process of defining target classes is difficult to scale up to generate a large number of classes. In addition, there has been significant interest in developing knowledge bases to improve content analysis and information retrieval. Knowledge base is an object model (ontology) with classes, subclasses, attributes, instances, and relations among them. The knowledge base generation problem is to identify the (sub)classes and their structured relations for a given domain of interest. Similar to ontology construction, Knowledge base is usually generated by human experts manually, and it is usually a time-consuming and difficult task. Thus, it is important and necessary to find a way to explore the semantic concepts and their structural relations that are important for a target data collection or domain of interest, so that we can construct an ontology or knowledge base for visual data or multimodal content automatically or semi-automatically. Visual patterns are the discriminative and representative image content found in objects or local image regions seen in an image collection. Visual patterns can also be used to summarize the major visual concepts in an image collection. Therefore, automatic discovery of visual patterns can help users understand the content and structure of a data collection and in turn help users construct the ontology and knowledge base mentioned earlier. In this dissertation, we aim to answer the following question: given a new target domain and associated data corpora, how do we rapidly discover nameable content patterns that are semantically coherent, visually consistent, and can be automatically named with semantic concepts related to the events of interest in the target domains? We will develop pattern discovery methods that focus on visual content as well as multimodal data including text and visual. Traditional visual pattern mining methods only focus on analysis of the visual content, and do not have the ability to automatically name the patterns. To address this, we propose a new multimodal visual pattern mining and naming method that specifically addresses this shortcoming. The named visual patterns can be used as discovered semantic concepts relevant to the target data corpora. By combining information from multiple modalities, we can ensure that the discovered patterns are not only visually similar, but also have consistent meaning, as well. The capability of accurately naming the visual patterns is also important for finding relevant classes or attributes in the knowledge base construction process mentioned earlier. Our framework contains a visual model and a text model to jointly represent the text and visual content. We use the joint multimodal representation and the association rule mining technique to discover semantically coherent and visually consistent visual patterns. To discover better visual patterns, we further improve the visual model in the multimodal visual pattern mining pipeline, by developing a convolutional neural network (CNN) architecture that allows for the discovery of scale-invariant patterns. In this dissertation, we use news as an example domain and image caption pairs as example multimodal corpora to demonstrate the effectiveness of the proposed methods. However, the overall proposed framework is general and can be easily extended to other domains. The problem of concept discovery is made more challenging if the target application domain involves fine-grained object categories (e.g., highly related dog categories or consumer product categories). In such cases, the content of different classes could be quite similar, making automatic separation of classes difficult. In the proposed multimodal pattern mining framework, representation models for visual and text data play an important role, as they shape the pool of candidates that are fed to the pattern mining process. General models like the CNN models trained on ImageNet, though shown to be generalizable to various domains, are unable to capture the small differences in the fine-grained dataset. To address this problem, we propose a new representation model that uses an end-to-end artificial neural network architecture to discover visual patterns. This model can be fine-tuned on a fine-grained dataset so that the convolutional layers can be optimized to capture the features and patterns from the fine-trained image set. It has the ability to discover visual patterns from fine-grained image datasets because its convolutional layers of the CNN can be optimized to capture the features and patterns from the fine-grained images. Finally, to demonstrate the advantage of the proposed multimodal visual pattern mining and naming framework, we apply the proposed technique to two applications. In the first application, we use the visual pattern mining technique to find visual anchors to summarize video news events. In the second application, we use the visual patterns as important cues to link video news events to social media events. The contributions of this dissertation can be summarized as follows: (1) We develop a novel multimodal mining framework for discovering visual patterns and nameable concepts from a collection of multimodal data and automatically naming the discovered patterns, producing a large pool of semantic concepts specifically relevant to a high-level event. The framework combines visual representation based on CNN and text representation based on embedding. The named visual patterns can be required for construct event schema needed in the knowledge base construction process. (2) We propose a scale-invariant visual pattern mining model to improve the multimodal visual pattern mining framework. The improved visual model leads to better overall performance in discovering and naming concepts. To localize the visual patterns discovered in this framework, we propose a deconvolutional neural network model to localize the visual pattern patterns within the image. (3) To directly learn from data in the target domain, we propose a novel end-to-end neural network architecture called PatternNet for finding high-quality visual patterns even for datsets that consistent of fine-grained classes. (4) We demonstrate novel applications of visual pattern mining in two applications: video news event summarization and video news event linking.
657

Sistema para localização robótica de veículos autônomos baseado em visão computacional por pontos de referência / Autonomous robotic vehicle localization system based on computer vision though distinctive features

Couto, Leandro Nogueira 18 May 2012 (has links)
A integração de sistemas de Visão Computacional com a Robótica Móvel é um campo de grande interesse na pesquisa. Este trabalho demonstra um método de localização global para Robôs Móveis Autônomos baseado na criação de uma memória visual, através da detecção e descrição de pontos de referência de imagens capturadas, com o método SURF, associados a dados de odometria, em um ambiente interno. O procedimento proposto, associado com conhecimento específico sobre o ambiente, permite que a localização seja obtida posteriormente pelo pareamento entre quadros memorizados e a cena atual observada pelo robô. Experimentos são conduzidos para mostrar a efetividade do método na localização robótica. Aprimoramentos para situações difíceis como travessia de portas são apresentados. Os resultados são analisados, e alternativas para navegação e possíveis futuros refinamentos discutidos / Integration of Computer Vision and Mobile Robotics systems is a field of great interest in research. This work demonstrates a method of global localization for Autonomous Mobile Robots based on the creation of a visual memory map, through detection and description of reference points from captured images, using the SURF method, associated to odometer data in a specific environment. The proposed procedure, coupled with specific knowledge of the environment, allows for localization to be achieved at a later stage through pairing of these memorized features with the scene being observed in real time. Experiments are conducted to show the effectiveness of the method for the localization of mobile robots in indoor environments. Improvements aimed at difficult situations such as traversing doors are presented. Results are analyzed and navigation alternatives and possible future refinements are discussed
658

3D Object Understanding from RGB-D Data

Feng, Jie January 2017 (has links)
Understanding 3D objects and being able to interact with them in the physical world are essential for building intelligent computer vision systems. It has tremendous potentials for various applications ranging from augmented reality, 3D printing to robotics. It might seem simple for human to look and make sense of the visual world, it is however a complicated process for machines to accomplish similar tasks. Generally, the system is involved with a series of processes: identify and segment a target object, estimate its 3D shape and predict its pose in an open scene where the target objects may have not been seen before. Although considerable research works have been proposed to tackle these problems, they remain very challenging due to a few key issues: 1) most methods rely solely on color images for interpreting the 3D property of an object; 2) large labeled color images are expensive to get for tasks like pose estimation, limiting the ability to train powerful prediction models; 3) training data for the target object is typically required for 3D shape estimation and pose prediction, making these methods hard to scale and generalize to unseen objects. Recently, several technological changes have created interesting opportunities for solving these fundamental vision problems. Low-cost depth sensors become widely available that provides an additional sensory input as a depth map which is very useful for extracting 3D information of the object and scene. On the other hand, with the ease of 3D object scanning with depth sensors and open access to large scale 3D model database like 3D warehouse and ShapeNet, it is possible to leverage such data to build powerful learning models. Third, machine learning algorithm like deep learning has become powerful that it starts to surpass state-of-the-art or even human performance on challenging tasks like object recognition. It is now feasible to learn rich information from large datasets in a single model. The objective of this thesis is to leverage such emerging tools and data to solve the above mentioned challenging problems for understanding 3D objects with a new perspective by designing machine learning algorithms utilizing RGB-D data. Instead of solely depending on color images, we combine both color and depth images to achieve significantly higher performance for object segmentation. We use large collection of 3D object models to provide high quality training data and retrieve visually similar 3D CAD models from low-quality captured depth images which enables knowledge transfer from database objects to target object in an observed scene. By using content-based 3D shape retrieval, we also significantly improve pose estimation via similar proxy models without the need to create the exact 3D model as a reference.
659

Deep Learning for Action Understanding in Video

Shou, Zheng January 2019 (has links)
Action understanding is key to automatically analyzing video content and thus is important for many real-world applications such as autonomous driving car, robot-assisted care, etc. Therefore, in the computer vision field, action understanding has been one of the fundamental research topics. Most conventional methods for action understanding are based on hand-crafted features. Like the recent advances seen in image classification, object detection, image captioning, etc, deep learning has become a popular approach for action understanding in video. However, there remain several important research challenges in developing deep learning based methods for understanding actions. This thesis focuses on the development of effective deep learning methods for solving three major challenges. Action detection at fine granularities in time: Previous work in deep learning based action understanding mainly focuses on exploring various backbone networks that are designed for the video-level action classification task. These did not explore the fine-grained temporal characteristics and thus failed to produce temporally precise estimation of action boundaries. In order to understand actions more comprehensively, it is important to detect actions at finer granularities in time. In Part I, we study both segment-level action detection and frame-level action detection. Segment-level action detection is usually formulated as the temporal action localization task, which requires not only recognizing action categories for the whole video but also localizing the start time and end time of each action instance. To this end, we propose an effective multi-stage framework called Segment-CNN consisting of three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. In another approach, frame-level action detection is effectively formulated as the per-frame action labeling task. We combine two reverse operations (i.e. convolution and deconvolution) into a joint Convolutional-De-Convolutional (CDC) filter, which simultaneously conducts downsampling in space and upsampling in time to jointly model both high-level semantics and temporal dynamics. We design a novel CDC network to predict actions at frame-level and the frame-level predictions can be further used to detect precise segment boundary for the temporal action localization task. Our method not only improves the state-of-the-art mean Average Precision (mAP) result on THUMOS’14 from 41.3% to 44.4% for the per-frame labeling task, but also improves mAP for the temporal action localization task from 19.0% to 23.3% on THUMOS’14 and from 16.4% to 23.8% on ActivityNet v1.3. Action detection in the constrained scenarios: The usual training process of deep learning models consists of supervision and data, which are not always available in reality. In Part II, we consider the scenarios of incomplete supervision and incomplete data. For incomplete supervision, we focus on the weakly-supervised temporal action localization task and propose AutoLoc which is the first framework that can directly predict the temporal boundary of each action instance with only the video-level annotations available during training. To enable the training of such a boundary prediction model, we design a novel Outer-Inner-Contrastive (OIC) loss to help discover the segment-level supervision and we prove that the OIC loss is differentiable to the underlying boundary prediction model. Our method significantly improves mAP on THUMOS14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. For the scenario of incomplete data, we formulate a novel task called Online Detection of Action Start (ODAS) in streaming videos to enable detecting the action start time on the fly when a live video action is just starting. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. Specifically, we propose three novel methods to address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. Action understanding in the compressed domain: The mainstream action understanding methods including the aforementioned techniques developed by us require first decoding the compressed video into RGB image frames. This may result in significant cost in terms of storage and computation. Recently, researchers started to investigate how to directly perform action understanding in the compressed domain in order to achieve high efficiency while maintaining the state-of-the-art action detection accuracy. The key research challenge is developing effective backbone networks that can directly take data in the compressed domain as input. Our baseline is to take models developed for action understanding in the decoded domain and adapt them to attack the same tasks in the compressed domain. In Part III, we address two important issues in developing the backbone networks that exclusively operate in the compressed domain. First, compressed videos may be produced by different encoders or encoding parameters, but it is impractical to train a different compressed-domain action understanding model for each different format. We experimentally analyze the effect of video encoder variation and develop a simple yet effective training data preparation method to alleviate the sensitivity to encoder variation. Second, motion cues have been shown to be important for action understanding, but the motion vectors in compressed video are often very noisy and not discriminative enough for directly performing accurate action understanding. We develop a novel and highly efficient framework called DMC-Net that can learn to predict discriminative motion cues based on noisy motion vectors and residual errors in the compressed video streams. On three action recognition benchmarks, namely HMDB-51, UCF101 and a subset of Kinetics, we demonstrate that our DMC-Net can significantly shorten the performance gap between state-of-the-art compressed video based methods with and without optical flow, while being two orders of magnitude faster than the methods that use optical flow. By addressing the three major challenges mentioned above, we are able to develop more robust models for video action understanding and improve performance in various dimensions, such as (1) temporal precision, (2) required levels of supervision, (3) live video analysis ability, and finally (4) efficiency in processing compressed video. Our research has contributed significantly to advancing the state of the art of video action understanding and expanding the foundation for comprehensive semantic understanding of video content.
660

Orientation and recognition of both noisy and partially occluded 3-D objects from single 2-D images

Illing, Diane Patricia January 1990 (has links)
This work is concerned with the problem of 3-D object recognition and orientation determination from single 2-D image frames in which objects may be noisy, partially occluded or both. Global descriptors of shape such as moments and Fourier descriptors rely on the whole shape being present. If part of a shape is missing then all of the descriptors will be affected. Consequently, such approaches are not suitable when objects are partially occluded, as results presented here show. Local methods of describing shape, where distortion of part of the object affects only the descriptors associated with that particular region, and nowhere else, are more likely to provide a successful solution to the problem. One such method is to locate points of maximum curvature on object boundaries. These are commonly believed to be the most perceptually significant points on digital curves. However, results presented in this thesis will show that estimators of point curvature become highly unreliable in the presence of noise. Rather than attempting to locate such high curvature points directly, an approach is presented which searches for boundary segments which exhibit significant linearity; curvature discontinuities are then assigned to the junctions between boundary segments. The resulting object descriptions are more stable in the presence of noise. Object orientation and recognition is achieved through a directed search and comparison to a database of similar 2-D model descriptions stored at various object orientations. Each comparison of sensed and model data is realised through a 2-D pose-clustering procedure, solving for the coordinate transformation which maps model features onto image features. Object features are used both to control the amount of computation and to direct the search of the database. In conditions of noise and occlusion objects can be recognised and their orientation determined to within less than 7 degrees of arc, on average.

Page generated in 0.1323 seconds