22 August 2006
Image/video segmentation is a basic but important step in image processing. In some basic image processing works such as video analysis, video object recognition, etc., or some high level applications such as military surveillance, content-based video retrieval, etc., all the frames have to be segmented into meaningful parts at first. And then those parts can further be processed. MPEG-4 multimedia communication standard enables the content-based functionalities by using the video objects plane as the basic coding element. From the point of view of human vision system, video segmentation segments meaningful parts from the video stream that conform to what human vision feels. Because while seeing a scene by human naked eye, the scene is composed of many objects, not pixel by pixel. In this thesis, we will focus on the image/video segmentation and its applications. One of our goals in this thesis is to design and implement an image/video segmentation system based on existing methods, which are widely used in image/video segmentation nowadays. We decompose the system into several stages, each of which performs a specific task. Then, based on the output of each stage, we can refine the algorithms in that stage to obtain a better result. We can retrieve areas from image data which more accurately conform to what human vision system feels. In other words, we retrieve the moving part, say, foreground, from the static background. After obtaining the segmentation results, a compression algorithm such as MPEG-4 can be used to compress these retrieved regions, which is referred to as content-based coding. Besides, other image processing applications can be further developed. For example, remote surveillance and monitoring system can be developed for detecting the moving objects using the segmentation algorithms described in this thesis.
People often have difficulties finding specific information in video because of its linear and unstructured nature. Segmenting long videos into small clips by topics and providing browsing and search functionalities is beneficial for information searching. However, manual segmentation is labor intensive and existing automated segmentation methods are not effective for plenty of amateur made and unedited lecture videos. The objectives of this dissertation are to develop 1) automated segmentation algorithms to extract the topic structure of a lecture video, and 2) retrieval algorithms to identify the relevant video segments for user queries.Based on an extensive literature review, existing segmentation features and approaches are summarized and research challenges and questions are presented. Manual segmentation studies are conducted to understand the content structure of a lecture video and a set of potential segmentation features and methods are extracted to facilitate the design of automated segmentation approaches. Two static algorithms are developed to segment a lecture video into a list of topics. Features from multimodalities and various knowledge sources (e.g. electronic slides) are used in the segmentation algorithms. A dynamic segmentation method is also developed to retrieve relevant video segments of appropriate sizes based on the questions asked by users. A series of evaluation studies are conducted and results are presented to demonstrate the effectiveness and usefulness of the automated segmentation approaches.
Modern computer vision has seen recently significant progress in learning visual concepts from examples. This progress has been fuelled by recent models of visual appearance as well as recently collected large-scale datasets of manually annotated still images. Video is a promising alternative, as it inherently contains much richer information compared to still images. For instance, in video we can observe an object move which allows us to differentiate it from its surroundings, or we can observe a smooth transition between different viewpoints of the same object instance. This richness in information allows us to effectively tackle tasks that would otherwise be very difficult if we only considered still images, or even adress tasks that are video-specific. Our first contribution is a computationally efficient technique for video object segmentation. Our method relies solely on motion in order to rapidly create a rough initial estimate of the foreground object. This rough initial estimate is then refined through an energy formulation to be spatio-temporally smooth. The method is able to handle rapidly moving backgrounds and objects, as well as non-rigid deformations and articulations without having prior knowledge about the objects appearance, size or location. In addition to this class-agnostic method, we present a class-specific method that incorporates additional class-specific appearance cues when the class of the foreground object is known in advance (e.g. a video of a car). For our second contribution, we propose a novel model for temporal video alignment with regard to the viewpoint of the foreground object (i.e., a pair of aligned frames shows the same object viewpoint) Our work relies on our video object segmentation technique to automatically localise the foreground objects and extract appearance measurements solely from them instead of the background. Our model is able to temporally align realistic videos, where events may occur in a different order, or occur only in one of the videos. This is in contrast to previous works that typically assume that the videos show a scripted sequence of events and can simply be aligned by stretching or compressing one of the videos. As a final contribution, we once again use our video object segmentation technique as a basis for automatic visual aspect discovery from videos of an object class. Compared to previous works, we use a broader definition of an aspect that considers four factors of variation: viewpoint, articulated pose, occlusions and cropping by the image border. We pose the aspect discovery task as a clustering problem and provide an extensive experimental exploration on the benefits of object segmentation for this task.
Chenaoua, Kamal S.
No description available.
Lee, Yong Jae, 1984-
12 July 2012
The current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori. I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization. To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions. I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision. / text
Saliency Cut: an Automatic Approach for Video Object Segmentation Based on Saliency Energy MinimizationJanuary 2013 (has links)
abstract: Video object segmentation (VOS) is an important task in computer vision with a lot of applications, e.g., video editing, object tracking, and object based encoding. Different from image object segmentation, video object segmentation must consider both spatial and temporal coherence for the object. Despite extensive previous work, the problem is still challenging. Usually, foreground object in the video draws more attention from humans, i.e. it is salient. In this thesis we tackle the problem from the aspect of saliency, where saliency means a certain subset of visual information selected by a visual system (human or machine). We present a novel unsupervised method for video object segmentation that considers both low level vision cues and high level motion cues. In our model, video object segmentation can be formulated as a unified energy minimization problem and solved in polynomial time by employing the min-cut algorithm. Specifically, our energy function comprises the unary term and pair-wise interaction energy term respectively, where unary term measures region saliency and interaction term smooths the mutual effects between object saliency and motion saliency. Object saliency is computed in spatial domain from each discrete frame using multi-scale context features, e.g., color histogram, gradient, and graph based manifold ranking. Meanwhile, motion saliency is calculated in temporal domain by extracting phase information of the video. In the experimental section of this thesis, our proposed method has been evaluated on several benchmark datasets. In MSRA 1000 dataset the result demonstrates that our spatial object saliency detection is superior to the state-of-art methods. Moreover, our temporal motion saliency detector can achieve better performance than existing motion detection approaches in UCF sports action analysis dataset and Weizmann dataset respectively. Finally, we show the attractive empirical result and quantitative evaluation of our approach on two benchmark video object segmentation datasets. / Dissertation/Thesis / M.S. Computer Science 2013
Intelligent Rotoscoping: A Semi-Automated Interactive Boundary Tracking Approach to Video SegmentationHolladay, Seth R. 13 June 2007 (has links) (PDF)
Video segmentation is an application of computer vision aimed at automating the extraction of an object from a series of video frames. However, it is a difficult problem, especially to compute at real-time, interactive rates. Although general application to video is difficult because of the wide range of image scenarios, user interaction can help to reduce the problem space and speed up the computation. This thesis presents a fast object-tracking tool that selects an object from a series of frames based on minimal user input. Our Intelligent Rotoscoping tool aims for increased speed and accuracy over other video segmentation tools, while maintaining reproducibility of results. For speed, the tool stays ahead of the user in selecting frames and responding to feedback. For accuracy, it interprets user input such that the user does not have to edit in every frame. For reproducibility, it maintains results for multiple iterations. Realization of these goals comes from the following process. After selecting a single frame, the user watches a speedy propagation of the initial selection with minor nudges where the selection misses its mark. This allows the user to “mold” the selection in certain frames while the tool is propagating the fixes to neighboring frames. It has a simple interface, minimal preprocessing, and minimal user input. It takes in any sort of film and exploits the spatial-temporal coherence of the object to be segmented. The tool allows artistic freedom without demanding intensive sequential processing. This thesis includes three specific extensions to Intelligent Scissors for application to video: 1. Leapfrogging, a robust method to propagate a user's single-frame selection over multiple frames by snapping each selection to its neighboring frame. 2. Histogram snapping, a method for training each frame's cost map based on previous user selections by measuring proximity to pixels in a training set and snapping to the most similar pixel's cost. 3. A real-time feedback and correction loop that provides an intuitive interface for a user to watch and control the selection propagation, with which input the algorithm updates the training data.
Price, Brian L.
10 August 2010
(has links) (PDF)
Video segmentation, the process of selecting an object out of a video sequence, is a fundamentally important process for video editing and special effects. However, it remains an unsolved problem due to many difficulties such as large or rapid motions, motion blur, lighting and shadow changes, complex textures, similar colors in the foreground and background, and many others. While the human vision system relies on multiple visual cues and higher-order understanding of the objects involved in order to perceive the segmentation, current algorithms usually depend on a small amount of information to assist a user in selecting a desired object. This causes current methods to often fail for common cases. Because of this, industry still largely relies on humans to trace the object in each frame, a tedious and expensive process. This dissertation investigates methods of segmenting video by propagating the segmentation from frame to frame using multiple cues to maximize the amount of information gained from each user interaction. New and existing methods are incorporated in propagating as much information as possible to a new frame, leveraging multiple cues such as object colors or mixes of colors, color relationships, temporal and spatial coherence, motion, shape, and identifiable points. The cues are weighted and applied on a local basis depending on the reliability of the cue in each region of the image. The reliability of the cues is learned from any corrections the user makes. In this framework, every action of the user is examined and leveraged in an attempt to provide as much information as possible to guarantee a correct segmentation. Propagating segmentation information from frame to frame using multiple cues and learning from the user interaction allows users to more quickly and accurately extract objects from video while exerting less effort.
Segmentação de cenas em telejornais: uma abordagem multimodal / Scene segmentation in news programs: a multimodal approachCoimbra, Danilo Barbosa 11 April 2011 (has links)
Este trabalho tem como objetivo desenvolver um método de segmentação de cenas em vídeos digitais que trate segmentos semânticamente complexos. Como prova de conceito, é apresentada uma abordagem multimodal que utiliza uma definição mais geral para cenas em telejornais, abrangendo tanto cenas onde âncoras aparecem quanto cenas onde nenhum âncora aparece. Desse modo, os resultados obtidos da técnica multimodal foram signifiativamente melhores quando comparados com os resultados obtidos das técnicas monomodais aplicadas em separado. Os testes foram executados em quatro grupos de telejornais brasileiros obtidos de duas emissoras de TV diferentes, cada qual contendo cinco edições, totalizando vinte telejornais / This work aims to develop a method for scene segmentation in digital video which deals with semantically complex segments. As proof of concept, we present a multimodal approach that uses a more general definition for TV news scenes, covering both: scenes where anchors appear on and scenes where no anchor appears. The results of the multimodal technique were significantly better when compared with the results from monomodal techniques applied separately. The tests were performed in four groups of Brazilian news programs obtained from two different television stations, containing five editions each, totaling twenty newscasts
Vector Flow Model in Video Estimation and Effects of Network Congestion in Low Bit-Rate Compression StandardsRamadoss, Balaji 16 October 2003 (has links)
The use of digitized information is rapidly gaining acceptance in bio-medical applications. Video compression plays an important role in the archiving and transmission of different digital diagnostic modalities. The present scheme of video compression for low bit-rate networks is not suitable for medical video sequences. The instability is the result of block artifacts resulting from the block based DCT coefficient quantization. The possibility of applying deformable motion estimation techniques to make the video compression standard (H.263) more adaptable for bio-medial applications was studied in detail. The study on the network characteristics and the behavior of various congestion control mechanisms was used to analyze the complete characteristics of existing low bit rate video compression algorithms. The study was conducted in three phases. The first phase involved the implementation and study of the present H.263 compression standard and its limitations. The second phase dealt with the analysis of an external force for active contours which was used to obtain estimates for deformable objects. The external force, which is termed Gradient Vector Flow (GVF), was computed as a diffusion of the gradient vectors associated with a gray-level or binary edge map derived from the image. The mathematical aspect of a multi-scale framework based on a medial representation for the segmentation and shape characterization of anatomical objects in medical imagery was derived in detail. The medial representations were based on a hierarchical representation of linked figural models such as protrusions, indentations, neighboring figures and included figures--which represented solid regions and their boundaries. The third phase dealt with the vital parameters for effective video streaming over the internet in the bottleneck bandwidth, which gives the upper limit for the speed of data delivery from one end point to the other in a network. If a codec attempts to send data beyond this limit, all packets above the limit will be lost. On the other hand, sending under this limit will clearly result in suboptimal video quality. During this phase the packet-drop-rate (PDR) performance of TCP(1/2) was investigated in conjunction with a few representative TCP-friendly congestion control protocols (CCP). The CCPs were TCP(1/256), SQRT(1/256) and TFRC (256), with and without self clocking. The CCPs were studied when subjected to an abrupt reduction in the available bandwidth. Additionally, the investigation studied the effect on the drop rates of TCP-Compatible algorithms by changing the queuing scheme from Random Early Detection (RED) to DropTail.
Page generated in 0.1624 seconds