Spelling suggestions: "subject:"computer disision"" "subject:"computer decisision""
721 |
Probabilistic Darwin Machines: A new approach to develop Evolutionary Object Detection SystemsBaró i Solé, Xavier 03 April 2009 (has links)
Des dels principis de la informàtica, s'ha intentat dotar als ordinadors de la capacitat per realitzar moltes de les tasques quotidianes de les persones. Un dels problemes més estudiats i encara menys entesos actualment és la capacitat d'aprendre a partir de les nostres experiències i generalitzar els coneixements adquirits.Una de les tasques inconscients per a les persones i que més interès està despertant en àmbit científics des del principi, és el que es coneix com a reconeixement de patrons. La creació de models del món que ens envolta, ens serveix per a reconèixer objectes del nostre entorn, predir situacions, identificar conductes, etc. Tota aquesta informació ens permet adaptar-nos i interactuar amb el nostre entorn. S'ha arribat a relacionar la capacitat d'adaptació d'un ésser al seu entorn amb la quantitat de patrons que és capaç d'identificar.Quan parlem de reconeixement de patrons en el camp de la Visió per Computador, ens referim a la capacitat d'identificar objectes a partir de la informació continguda en una o més imatges. En aquest camp s'ha avançat molt en els últims anys, i ara ja som capaços d'obtenir resultats "útils" en entorns reals, tot i que encara estem molt lluny de tenir un sistema amb la mateixa capacitat d'abstracció i tan robust com el sistema visual humà.En aquesta tesi, s'estudia el detector de cares de Viola i Jones, un dels mètode més estesos per resoldre la detecció d'objectes. Primerament, s'analitza la manera de descriure els objectes a partir d'informació de contrastos d'il·luminació en zones adjacents de les imatges, i posteriorment com aquesta informació és organitzada per crear estructures més complexes. Com a resultat d'aquest estudi, i comparant amb altres metodologies, s'identifiquen dos punts dèbils en el mètode de detecció de Viola i Jones. El primer fa referència a la descripció dels objectes, i la segona és una limitació de l'algorisme d'aprenentatge, que dificulta la utilització de millors descriptors.La descripció dels objectes utilitzant les característiques de Haar, limita la informació extreta a zones connexes de l'objecte. En el cas de voler comparar zones distants, s'ha d'optar per grans mides de les característiques, que fan que els valors obtinguts depenguin més del promig de valors d'il·luminació de l'objecte, que de les zones que es volen comparar. Amb l'objectiu de poder utilitzar aquest tipus d'informacions no locals, s'intenta introduir els dipols dissociats en l'esquema de detecció d'objectes.El problema amb el que ens trobem en voler utilitzar aquest tipus de descriptors, és que la gran cardinalitat del conjunt de característiques, fa inviable la utilització de l'Adaboost, l'algorisme utilitzat per a l'aprenentatge. El motiu és que durant el procés d'aprenentatge, es fa un anàlisi exhaustiu de tot l'espai d'hipòtesis, i al ser tant gran, el temps necessari per a l'aprenentatge esdevé prohibitiu. Per eliminar aquesta limitació, s'introdueixen mètodes evolutius dins de l'esquema de l'Adaboost i s'estudia els efectes d'aquest canvi en la capacitat d'aprenentatge. Les conclusions extretes són que no només continua essent capaç d'aprendre, sinó que la velocitat de convergència no és afectada significativament.Aquest nou Adaboost amb estratègies evolutives obre la porta a la utilització de conjunts de característiques amb cardinalitats arbitràries, el que ens permet indagar en noves formes de descriure els nostres objectes, com per exemple utilitzant els dipols dissociats. El primer que fem és comparar la capacitat d'aprenentatge del mètode utilitzant les característiques de Haar i els dipols dissociats. Com a resultat d'aquesta comparació, el que veiem és que els dos tipus de descriptors tenen un poder de representació molt similar, i depenent del problema en que s'apliquen, uns s'adapten una mica millor que els altres. Amb l'objectiu d'aconseguir un sistema de descripció capaç d'aprofitar els punts forts tant de Haar com dels dipols, es proposa la utilització d'un nou tipus de característiques, els dipols dissociats amb pesos, els quals combinen els detectors d'estructures que fan robustes les característiques de Haar amb la capacitat d'utilitzar informació no local dels dipols dissociats. A les proves realitzades, aquest nou conjunt de característiques obté millors resultats en tots els problemes en que s'ha comparat amb les característiques de Haar i amb els dipols dissociats.Per tal de validar la fiabilitat dels diferents mètodes, i poder fer comparatives entre ells, s'ha utilitzat un conjunt de bases de dades públiques per a diferents problemes, tals com la detecció de cares, la detecció de texts, la detecció de vianants i la detecció de cotxes. A més a més, els mètodes també s'han provat sobre una base de dades més extensa, amb la finalitat de detectar senyals de trànsit en entorns de carretera i urbans. / Ever since computers were invented, we have wondered whether they might perform some of the human quotidian tasks. One of the most studied and still nowadays less understood problem is the capacity to learn from our experiences and how we generalize the knowledge that we acquire.One of that unaware tasks for the persons and that more interest is awakening in different scientific areas since the beginning, is the one that is known as pattern recognition. The creation of models that represent the world that surrounds us, help us for recognizing objects in our environment, to predict situations, to identify behaviors... All this information allows us to adapt ourselves and to interact with our environment. The capacity of adaptation of individuals to their environment has been related to the amount of patterns that are capable of identifying.When we speak about pattern recognition in the field of Computer Vision, we refer to the ability to identify objects using the information contained in one or more images. Although the progress in the last years, and the fact that nowadays we are already able to obtain "useful" results in real environments, we are still very far from having a system with the same capacity of abstraction and robustness as the human visual system.In this thesis, the face detector of Viola & Jones is studied as the paradigmatic and most extended approach to the object detection problem. Firstly, we analyze the way to describe the objects using comparisons of the illumination values in adjacent zones of the images, and how this information is organized later to create more complex structures. As a result of this study, two weak points are identified in this family of methods: The first makes reference to the description of the objects, and the second is a limitation of the learning algorithm, which hampers the utilization of best descriptors.Describing objects using Haar-like features limits the extracted information to connected regions of the object. In the case we want to compare distant zones, large contiguous regions must be used, which provokes that the obtained values depend more on the average of lighting values of the object than in the regions we are wanted to compare. With the goal to be able to use this type of non local information, we introduce the Dissociated Dipoles into the outline of objects detection.The problem using this type of descriptors is that the great cardinality of this feature set makes unfeasible the use of Adaboost as learning algorithm. The reason is that during the learning process, an exhaustive search is made over the space of hypotheses, and since it is enormous, the necessary time for learning becomes prohibitive. Although we studied this phenomenon on the Viola & Jones approach, it is a general problem for most of the approaches, where learning methods introduce a limitation on the descriptors that can be used, and therefore, on the quality of the object description. In order to remove this limitation, we introduce evolutionary methods into the Adaboost algorithm, studying the effects of this modification on the learning ability. Our experiments conclude that not only it continues being able to learn, but its convergence speed is not significantly altered.This new Adaboost with evolutionary strategies opens the door to the use of feature sets with an arbitrary cardinality, which allows us to investigate new ways to describe our objects, such as the use of Dissociated Dipoles. We first compare the learning ability of this evolutionary Adaboost using Haar-like features and Dissociated Dipoles, and from the results of this comparison, we conclude that both types of descriptors have similar representation power, but depends on the problem they are applied, one adapts a little better than the other. With the aim of obtaining a descriptor capable of share the strong points from both Haar-like and Dissociated Dipoles, we propose a new type of feature, the Weighted Dissociated Dipoles, which combines the robustness of the structure detectors present in the Haar-like features, with the Dissociated Dipoles ability to use non local information. In the experiments we carried out, this new feature set obtains better results in all problems we test, compared with the use of Haar-like features and Dissociated Dipoles.In order to test the performance of each method, and compare the different methods, we use a set of public databases, which covers face detection, text detection, pedestrian detection, and cars detection. In addition, our methods are tested to face a traffic sign detection problem, over large databases containing both, road and urban scenes.
|
722 |
3-D Reconstruction from Single Projections, with Applications to Astronomical ImagesCormier, Michael January 2013 (has links)
A variety of techniques exist for three-dimensional reconstruction when multiple views are available, but less attention has been given to reconstruction when only a single view is available. Such a situation is normal in astronomy, when a galaxy (for example) is so distant that it is impossible to obtain views from significantly different angles. In this thesis I examine the problem of reconstructing the three-dimensional structure of a galaxy from this single viewpoint. I accomplish this by taking advantage of the image formation process, symmetry relationships, and other structural assumptions that may be made about galaxies.
Most galaxies are approximately symmetric in some way. Frequently, this symmetry corresponds to symmetry about an axis of rotation, which allows strong statements to be made about the relationships between luminosity at each point in the galaxy. It is through these relationships that the number of unknown values needed to describe the structure of the galaxy can be reduced to the number of constraints provided by the image so the optimal reconstruction is well-defined. Other structural properties can also be described under this framework.
I provide a mathematical framework and analyses that prove the uniqueness of solutions under certain conditions and to show how uncertainty may be precisely and explicitly expressed. Empirical results are shown using real and synthetic data. I also show a comparison to a state-of-the-art two-dimensional modelling technique to demonstrate the contrasts between the two frameworks and show the important advantages of the three-dimensional approach. In combination, the theoretical and experimental aspects of this thesis demonstrate that the proposed framework is versatile, practical, and novel---a contribution to both computer science and astronomy.
|
723 |
Multiple feature temporal models for the characterization of semantic video contentsSánchez Secades, Juan María 11 December 2003 (has links)
La estructura de alto nivel del vídeo se puede obtener a partir de conocimiento sobre el dominio más una representación de los contenidos que proporcione información semántica. En este contexto, las representaciones de la semántica de nivel medio vienen dadas en términos de características de bajo nivel y de la información que expresan acerca de los contenidos del vídeo. Las representaciones de nivel medio permiten obtener de forma automática agrupamientos semánticamente significativos de los shots, que son posteriormente utilizados conjuntamente con conocimientos de alto nivel específicos del dominio para obtener la estructura del vídeo. En general, las representaciones de nivel medio también dependen del dominio. Los descriptores que forman parte de la representación están específicamente diseñados para una aplicación concreta, teniendo en cuenta los requisitos del dominio y el conocimiento que tenemos del mismo. En esta tesis se propone una representación de nivel medio de los contenidos videográficos que permite obtener agrupamientos de shots que son semánticamente significativos. Esta representación no depende del dominio, y sin embargo aporta la información necesaria para obtener la estructura de alto nivel del vídeo, gracias a la combinación de las contribuciones de diferentes características de bajo nivel de las imágenes a la semántica de nivel medio.La semántica de nivel medio se encuentra implícita en las características de bajo nivel, dado que un concepto semántico concreto genera una combinación específica de valores de las mismas. El problema consiste en "tender un puente sobre el vacío" entre las características de bajo nivel que se observan y sus correspondientes conceptos semánticos de nivel medio ocultos. Para establecer relaciones entre estos dos niveles, se utilizan técnicas de visión por computador y procesamiento de imágenes. Otras disciplinas como la cinematografía y la semiótica también proporcionan pistas importantes para determinar como se usan las características de bajo nivel para crear conceptos semánticos. Una descripción adecuada de las características de bajo nivel puede proporcionar una representación de sus correspondientes contenidos semánticos. Más en concreto, el color resumido en un histograma se utiliza para representar la apariencia de los objetos. Cuando el objeto es el fondo de la escena, su color aporta información sobre la localización. De la misma manera, en esta tesis se analiza la semántica que transmite una descripción del movimiento. Las características de movimiento resumidas en una matriz de coocurrencias temporales proporcionan información sobre las operaciones de la cámara y el tipo de toma (primer plano, etc.) en función de la distancia relativa entre la cámara y los objetos filmados.La principal contribución de esta tesis es una representación de los contenidos visuales del vídeo basada en el resumen del comportamiento dinámico de las características de bajo nivel como procesos temporales descritos por cadenas de Markov. Los estados de la cadena de Markov vienen dados por los valores observados de una característica de bajo nivel. A diferencia de las representaciones de los shots basadas en keyframes, el modelo de cadena de Markov considera información de todos los frames del shot en la misma representación. Las medidas de similitud naturales en un marco probabilístico, como la divergencia de Kullback-Leibler, pueden ser utilizadas para comparar cadenas de Markov y, por tanto, el contenido de los shots que representan. En la misma representación se pueden combinar múltiples características de las imágenes mediante el acoplamiento de sus correspondientes cadenas. Esta tesis presenta diferentes formas de acoplar cadenas de Markov, y en particular la llamada Cadenas Acopladas de Markov (Coupled Markov Chains, CMC). También se detalla un método para encontrar la estructura de acoplamiento óptima en términos de coste mínimo y mínima pérdida de información, ya que esta merma se relaciona directamente con la pérdida de precisión de la estructura acoplada para representar contenidos de vídeo. Durante el proceso de cálculo de las representaciones de los shots se detectan las fronteras entre éstos usando el mismo modelo y medidas de similitud.Cuando las características de color y movimiento se combinan, la representación en cadenas acopladas de Markov proporciona un descriptor semántico de nivel medio que contiene información implícita sobre objetos (sus identidades, tamaños y patrones de movimiento), movimiento de cámara, localización, tipo de toma, relaciones temporales entre los elementos que componen la escena y actividad global, entendida como la cantidad de acción. Conceptos semánticos más complejos emergen de la unión de estos descriptores de nivel medio, tales como "cabeza parlante", que surge de la combinación de un primer plano con el color de la piel de la cara. Añadiendo el componente de localización en el dominio de Noticiarios, las cabezas parlantes se pueden subclasificar en "presentadores" (localizados en estudio) y "corresponsales" (localizados en exteriores). Estas y otras categorías semánticamente significativas aparecen cuando los shots representados usando el modelo CMC se agrupan de forma no supervisada. Los conceptos mejor definidos se corresponden con grupos compactos, que pueden ser detectados usando una medida de densidad. Conocimiento de alto nivel sobre el dominio se puede definir mediante simples reglas basadas en estos conceptos, que establecen fronteras en la estructura semántica del vídeo. El modelado de contenidos de vídeo por cadenas acopladas de Markov unifica los primeros pasos del proceso de análisis semántico de vídeo y proporciona una representación de nivel medio semánticamente significativa sin necesidad de detectar previamente las fronteras entre shots. / The high-level structure of a video can be obtained once we have knowledge about the domain plus a representation of the contents that provides semantic information. In this context, intermediate-level semantic representations are defined in terms of low-level features and the information they convey about the contents of the video. Intermediate-level representations allow us to obtain semantically meaningful clusterings of shots, which are then used together with high-level domain-specific knowledge in order to obtain the structure of the video. Intermediate-level representations are usually domain-dependent as well. The descriptors involved in the representation are specifically tailored for the application, taking into account the requirements of the domain and the knowledge we have about it. This thesis proposes an intermediate-level representation of video contents that allows us to obtain semantically meaningful clusterings of shots. This representation does not depend on the domain, but still provides enough information to obtain the high-level structure of the video by combining the contributions of different low-level image features to the intermediate-level semantics.Intermediate-level semantics are implicitly supplied by low-level features, given that a specific semantic concept generates some particular combination of feature values. The problem is to bridge the gap between observed low-level features and their corresponding hidden intermediate-level semantic concepts. Computer vision and image processing techniques are used to establish relationships between them. Other disciplines such as filmmaking and semiotics also provide important clues to discover how low-level features are used to create semantic concepts. A proper descriptor of low-level features can provide a representation of their corresponding semantic contents. Particularly, color summarized as a histogram is used to represent the appearance of objects. When this object is the background, color provides information about location. In the same way, the semantics conveyed by a description of motion have been analyzed in this thesis. A summary of motion features as a temporal cooccurrence matrix provides information about camera operation and the type of shot in terms of relative distance of the camera to the subject matter.The main contribution of this thesis is a representation of visual contents in video based on summarizing the dynamic behavior of low-level features as temporal processes described by Markov chains (MC). The states of the MC are given by the values of an observed low-level feature. Unlike keyframe-based representations of shots, information from all the frames is considered in the MC modeling. Natural similarity measures such as likelihood ratios and Kullback-Leibler divergence are used to compare MC's, and thus the contents of the shots they are representing. In this framework, multiple image features can be combined in the same representation by coupling their corresponding MC's. Different ways of coupling MC's are presented, particularly the one called Coupled Markov Chains (CMC). A method to find the optimal coupling structure in terms of minimal cost and minimal loss of information is detailed in this dissertation. The loss of information is directly related to the loss of accuracy of the coupled structure to represent video contents. During the same process of computing shot representations, the boundaries between shots are detected using the same modeling of contents and similarity measures.When color and motion features are combined, the CMC representation provides an intermediate-level semantic descriptor that implicitly contains information about objects (their identities, sizes and motion patterns), camera operation, location, type of shot, temporal relationships between elements of the scene and global activity understood as the amount of action. More complex semantic concepts emerge from the combination of these intermediate-level descriptors, such as a "talking head" that combines a close-up with the skin color of a face. Adding the location component in the News domain, talking heads can be further classified into "anchors" (located in the studio) and "correspondents" (located outdoors). These and many other semantically meaningful categories are discovered when shots represented using the CMC model are clustered in an unsupervised way. Well-defined concepts are given by compact clusters, which can be determined by a measure of their density. High-level domain knowledge can then be defined by simple rules on these salient concepts, which will establish boundaries in the semantic structure of the video. The CMC modeling of video shots unifies the first steps of the video analysis process providing an intermediate-level semantically meaningful representation of contents without prior shot boundary detection.
|
724 |
Generative Models for Video Analysis and 3D Range Data ApplicationsOrriols Majoral, Xavier 27 February 2004 (has links)
La mayoría de problemas en Visión por computador no contienen una relación directa entre el estímulo que proviene de sensores de tipo genérico y su correspondiente categoría perceptual. Este tipo de conexión requiere de una tarea de aprendizaje compleja. De hecho, las formas básicas de energía, y sus posibles combinaciones, son un número reducido en comparación a las infinitas categorías perceptuales correspondientes a objetos, acciones, relaciones entre objetos, etc. Dos factores principales determinan el nivel de dificultad de cada problema específico: i) los diferentes niveles de información que se utilizan, y ii) la complejidad del modelo que se emplea con el objetivo de explicar las observaciones. La elección de una representación adecuada para los datos toma una relevancia significativa cuando se tratan invariancias, dado que estas siempre implican una reducción del los grados de libertad del sistema, i.e., el número necesario de coordenadas para la representación es menor que el empleado en la captura de datos. De este modo, la descomposición en unidades básicas y el cambio de representación dan lugar a que un problema complejo se pueda transformar en uno de manejable. Esta simplificación del problema de la estimación debe depender del mecanismo propio de combinación de estas primitivas con el fin de obtener una descripción óptima del modelo complejo global. Esta tesis muestra como los Modelos de Variables Latentes reducen dimensionalidad, que teniendo en cuenta las simetrías internas del problema, ofrecen una manera de tratar con datos parciales y dan lugar a la posibilidad de predicciones de nuevas observaciones.Las líneas de investigación de esta tesis están dirigidas al manejo de datos provinentes de múltiples fuentes. Concretamente, esta tesis presenta un conjunto de nuevos algoritmos aplicados a dos áreas diferentes dentro de la Visión por Computador: i) video análisis y sumarización y ii) datos range 3D. Ambas áreas se han enfocado a través del marco de los Modelos Generativos, donde se han empleado protocolos similares para representar datos. / The majority of problems in Computer Vision do not contain a direct relation between the stimuli provided by a general purpose sensor and its corresponding perceptual category. A complex learning task must be involved in order to provide such a connection. In fact, the basic forms of energy, and their possible combinations are a reduced number compared to the infinite possible perceptual categories corresponding to objects, actions, relations among objects... Two main factors determine the level of difficulty of a specific problem: i) The different levels of information that are employed and ii) The complexity of the model that is intended to explain the observations.The choice of an appropriate representation for the data takes a significant relevance when it comes to deal with invariances, since these usually imply that the number of intrinsic degrees offreedom in the data distribution is lower than the coordinates used to represent it. Therefore, the decomposition into basic units (model parameters) and the change of representation, make that a complex problem can be transformed into a manageable one. This simplification of the estimation problem has to rely on a proper mechanism of combination of those primitives in order to give an optimal description of the global complex model. This thesis shows how Latent Variable Models reduce dimensionality, taking into account the internal symmetries of a problem, provide a manner of dealing with missing data and make possible predicting new observations. The lines of research of this thesis are directed to the management of multiple data sources. More specifically, this thesis presents a set of new algorithms applied to two different areas in Computer Vision: i) video analysis and summarization, and ii) 3D range data. Both areas have been approached through the Generative Models framework, where similar protocols for representing data have been employed.
|
725 |
Multiple Object Tracking with Occlusion HandlingSafri, Murtaza 16 February 2010 (has links)
Object tracking is an important problem with wide ranging applications. The purpose is to detect object contours and track their motion in a video. Issues of concern are to be able to map objects correctly between two frames, and to be able to track through occlusion. This thesis discusses a novel framework for the purpose of object tracking which is inspired from image registration and segmentation models. Occlusion of objects is also detected and handled in this framework in an appropriate manner.
The main idea of our tracking framework is to reconstruct the sequence of images
in the video. The process involves deforming all the objects in a given image frame,
called the initial frame. Regularization terms are used to govern the deformation of
the shape of the objects. We use elastic and viscous fluid model as the regularizer. The reconstructed frame is formed by combining the deformed objects with respect to the depth ordering. The correct reconstruction is selected by parameters that minimize
the difference between the reconstruction and the consecutive frame, called the target frame. These parameters provide the required tracking information, such as the contour of the objects in the target frame including the occluded regions. The regularization term restricts the deformation of the object shape in the occluded region and thus gives an estimate of the object shape in this region. The other idea is to use a segmentation model as a measure in place of the frame difference measure.
This is separate from image segmentation procedure, since we use the segmentation
model in a tracking framework to capture object deformation. Numerical examples are
presented to demonstrate tracking in simple and complex scenes, alongwith occlusion
handling capability of our model. Segmentation measure is shown to be more robust with regard to accumulation of tracking error.
|
726 |
Visual-inertial tracking using Optical Flow measurementsLarsson, Olof January 2010 (has links)
Visual-inertial tracking is a well known technique to track a combination of a camera and an inertial measurement unit (IMU). An issue with the straight-forward approach is the need of known 3D points. To by-pass this, 2D information can be used without recovering depth to estimate the position and orientation (pose) of the camera. This Master's thesis investigates the feasibility of using Optical Flow (OF) measurements and indicates the benifits using this approach. The 2D information is added using OF measurements. OF describes the visual flow of interest points in the image plane. Without the necessity to estimate depth of these points, the computational complexity is reduced. With the increased 2D information, the 3D information required for the pose estimate decreases. The usage of 2D points for the pose estimation has been verified with experimental data gathered by a real camera/IMU-system. Several data sequences containing different trajectories are used to estimate the pose. It is shown that OF measurements can be used to improve visual-inertial tracking with reduced need of 3D-point registrations.
|
727 |
A Probabilistic Approach to Image Feature Extraction, Segmentation and InterpretationPal, Chris January 2000 (has links)
This thesis describes a probabilistic approach to imagesegmentation and interpretation. The focus of the investigation is the development of a systematic way of combining color, brightness, texture and geometric features extracted from an image to arrive at a consistent interpretation for each pixel in the image. The contribution of this thesis is thus the presentation of a novel framework for the fusion of extracted image features producing a segmentation of an image into relevant regions. Further, a solution to the sub-pixel mixing problem is presented based on solving a probabilistic linear program. This work is specifically aimed at interpreting and digitizing multi-spectral aerial imagery of the Earth's surface. The features of interest for extraction are those of relevance to environmental management, monitoring and protection. The presented algorithms are suitable for use within a larger interpretive system. Some results are presented and contrasted with other techniques. The integration of these algorithms into a larger system is based firmly on a probabilistic methodology and the use of statistical decision theory to accomplish uncertain inference within the visual formalism of a graphical probability model.
|
728 |
Multiple Object Tracking with Occlusion HandlingSafri, Murtaza 16 February 2010 (has links)
Object tracking is an important problem with wide ranging applications. The purpose is to detect object contours and track their motion in a video. Issues of concern are to be able to map objects correctly between two frames, and to be able to track through occlusion. This thesis discusses a novel framework for the purpose of object tracking which is inspired from image registration and segmentation models. Occlusion of objects is also detected and handled in this framework in an appropriate manner.
The main idea of our tracking framework is to reconstruct the sequence of images
in the video. The process involves deforming all the objects in a given image frame,
called the initial frame. Regularization terms are used to govern the deformation of
the shape of the objects. We use elastic and viscous fluid model as the regularizer. The reconstructed frame is formed by combining the deformed objects with respect to the depth ordering. The correct reconstruction is selected by parameters that minimize
the difference between the reconstruction and the consecutive frame, called the target frame. These parameters provide the required tracking information, such as the contour of the objects in the target frame including the occluded regions. The regularization term restricts the deformation of the object shape in the occluded region and thus gives an estimate of the object shape in this region. The other idea is to use a segmentation model as a measure in place of the frame difference measure.
This is separate from image segmentation procedure, since we use the segmentation
model in a tracking framework to capture object deformation. Numerical examples are
presented to demonstrate tracking in simple and complex scenes, alongwith occlusion
handling capability of our model. Segmentation measure is shown to be more robust with regard to accumulation of tracking error.
|
729 |
Recursive Estimation of Structure and Motion from Monocular ImagesFakih, Adel January 2010 (has links)
The determination of the 3D motion of a camera and the 3D structure of the scene in which the camera
is moving, known as the Structure from Motion (SFM) problem, is a central problem in computer
vision. Specifically, the recursive (online) estimation is of major interest for robotics applications such as navigation and mapping. Many problems still hinder the deployment of SFM in real-life applications namely, (1) the robustness to noise, outliers and ambiguous
motions, (2) the numerical tractability with a large number of features and (3) the cases of rapidly varying camera velocities. Towards solving those problems, this research presents the following four contributions that can be used individually, together, or combined with other approaches.
A motion-only filter is devised by capitalizing on algebraic threading constraints. This filter efficiently integrates information over multiple frames achieving a performance comparable to the best state of the art filters. However, unlike other filter based approaches, it is not affected by large baselines (displacement between camera centers).
An approach is introduced to incorporate, with only a small computational overhead, a large number of frame-to-frame features (i.e., features that are matched only in pairs of consecutive frames) in any analytic filter. The computational overhead grows linearly with the number of added frame-to-frame features and the experimental results show an increased accuracy and consistency.
A novel filtering approach scalable to accommodate a large number of features is proposed. This approach achieves both the scalability of the state of the art filter in scalability and the accuracy of the state of the art filter in accuracy.
A solution to the problem of prediction over large baselines in monocular Bayesian filters is presented. This problem is due to the fact that a simple prediction, using constant velocity models for example, is not suitable for large baselines, and the projections of the 3D points that are in the state vector can not be used in the prediction due to the need of preserving the statistical independence of the prediction and update steps.
|
730 |
Monocular Vision-Based Obstacle Detection for Unmanned SystemsWang, Carlos January 2011 (has links)
Many potential indoor applications exist for autonomous vehicles, such as automated surveillance, inspection, and document delivery. A key requirement for autonomous operation is for the vehicles to be able to detect and map obstacles in order to avoid collisions. This work develops a comprehensive 3D scene reconstruction algorithm based on known vehicle motion and vision data that is specifically tailored to the indoor environment. Visible light cameras are one of the many sensors available for capturing information from the environment, and their key advantages over other sensors are that they are light weight, power efficient, cost effective, and provide abundant information about the scene. The emphasis on 3D indoor mapping enables the assumption that a large majority of the area to be mapped is comprised of planar surfaces such as floors, walls and ceilings, which can be exploited to simplify the complex task of dense reconstruction of the environment from monocular vision data.
In this thesis, the Planar Surface Reconstruction (PSR) algorithm is presented. It extracts surface information from images and combines it with 3D point estimates in order to generate a reliable and complete environment map. It was designed to be used for single cameras with the primary assumptions that the objects in the environment are flat, static and chromatically unique. The algorithm finds and tracks Scale Invariant Feature Transform (SIFT) features from a sequence of images to calculate 3D point estimates. The individual surface information is extracted using a combination of the Kuwahara filter and mean shift segmentation, which is then coupled with the 3D point estimates to fit these surfaces in the environment map. The resultant map consists of both surfaces and points that are assumed to represent obstacles in the scene. A ground vehicle platform was developed for the real-time implementation of the algorithm and experiments were done to assess the PSR algorithm. Both clean and cluttered scenarios were used to evaluate the quality of the surfaces generated from the algorithm. The clean scenario satisfies the primary assumptions underlying the PSR algorithm, and as a result produced accurate surface details of the scene, while the cluttered scenario generated lower quality, but still promising, results. The significance behind these findings is that it is shown that incorporating object surface recognition into dense 3D reconstruction can significantly improve the overall quality of the environment map.
|
Page generated in 0.1359 seconds