1 |
Per-exemplar analysis with MFoM fusion learning for multimedia retrieval and recountingKim, Ilseo 27 August 2014 (has links)
As a large volume of digital video data becomes available, along with revolutionary advances in multimedia technologies, demand related to efficiently retrieving and recounting multimedia data has grown. However, the inherent complexity in representing and recognizing multimedia data, especially for large-scale and unconstrained consumer videos, poses significant challenges. In particular, the following challenges are major concerns in the proposed research.
One challenge is that consumer-video data (e.g., videos on YouTube) are mostly unstructured; therefore, evidence for a targeted semantic category is often sparsely located across time. To address the issue, a segmental multi-way local feature pooling method by using scene concept analysis is proposed. In particular, the proposed method utilizes scene concepts that are pre-constructed by clustering video segments into categories in an unsupervised manner. Then, a video is represented with multiple feature descriptors with respect to scene concepts. Finally, multiple kernels are constructed from the feature descriptors, and then, are combined into a final kernel that improves the discriminative power for multimedia event detection.
Another challenge is that most semantic categories used for multimedia retrieval have inherent within-class diversity that can be dramatic and can raise the question as to whether conventional approaches are still successful and scalable. To consider such huge variability and further improve recounting capabilities, a per-exemplar learning scheme is proposed with a focus on fusing multiple types of heterogeneous features for video retrieval. While the conventional approach for multimedia retrieval involves learning a single classifier per category, the proposed scheme learns multiple detection models, one for each training exemplar. In particular, a local distance function is defined as a linear combination of element distance measured by each features. Then, a weight vector of the local distance function is learned in a discriminative learning method by taking only neighboring samples around an exemplar as training samples. In this way, a retrieval problem is redefined as an association problem, i.e., test samples are retrieved by association-based rules.
In addition, the quality of a multimedia-retrieval system is often evaluated by domain-specific performance metrics that serve sophisticated user needs. To address such criteria for evaluating a multimedia-retrieval system, in MFoM learning, novel algorithms were proposed to explicitly optimize two challenging metrics, AP and a weighted sum of the probabilities of false alarms and missed detections at a target error ratio. Most conventional learning schemes attempt to optimize their own learning criteria, as opposed to domain-specific performance measures. By addressing this discrepancy, the proposed learning scheme approximates the given performance measure, which is discrete and makes it difficult to apply conventional optimization schemes, with a continuous and differentiable loss function which can be directly optimized. Then, a GPD algorithm is applied to optimizing this loss function.
|
2 |
Video retrieval based on fractal orthogonal bases and temporal graphChang, Min-luen 26 January 2010 (has links)
In this paper, we present a structural video for video retrieval with fractal orthogonal bases composed of the five steps: video summarization (extract key-frames from video), normalized group cuts (classify key-frames), temporal graph (according to key-frames time in video), transformation of a directed graph into string (the process of transformation is one-to-one mapping), and comparison of string similarity (contain of sting architecture and content), to establish the framework of the video contents. With the above-mentioned information, the structure of the video and its complementary knowledge can be built up according to main line and branch line. Therefore, users can not only browse the video efficiently but also focus on the structure what they are interest.
In order to construct the fundamental system, we employ distortion metric that extract key-frames from video and classify key-frames according to normalized group cuts that shot are linked together based on their content. After constructing the relation graph, the graph is transformed into string that has enriched structure. The result clusters form a directed graph and a shortest path algorithm is proposed to find main structure of video. In string similarity, it divides into string architecture and content. In string architecture, we adopt edit distance in main structure and recursive branch line. After comparison of string similarity in architecture, it gets the high similarity string comparing with fractal orthogonal bases that guarantee the similar index has the similar image the characteristic union support vector clustering. The results demonstrate that our system can achieve better performance and information coverage.
|
3 |
Interactive video retrieval using implicit user feedbackVrochidis, Stefanos January 2013 (has links)
In the recent years, the rapid development of digital technologies and the low cost of recording media have led to a great increase in the availability of multimedia content worldwide. This availability places the demand for the development of advanced search engines. Traditionally, manual annotation of video was one of the usual practices to support retrieval. However, the vast amounts of multimedia content make such practices very expensive in terms of human effort. At the same time, the availability of low cost wearable sensors delivers a plethora of user-machine interaction data. Therefore, there is an important challenge of exploiting implicit user feedback (such as user navigation patterns and eye movements) during interactive multimedia retrieval sessions with a view to improving video search engines. In this thesis, we focus on automatically annotating video content by exploiting aggregated implicit feedback of past users expressed as click-through data and gaze movements. Towards this goal, we have conducted interactive video retrieval experiments, in order to collect click-through and eye movement data in not strictly controlled environments. First, we generate semantic relations between the multimedia items by proposing a graph representation of aggregated past interaction data and exploit them to generate recommendations, as well as to improve content-based search. Then, we investigate the role of user gaze movements in interactive video retrieval and propose a methodology for inferring user interest by employing support vector machines and gaze movement-based features. Finally, we propose an automatic video annotation framework, which combines query clustering into topics by constructing gaze movement-driven random forests and temporally enhanced dominant sets, as well as video shot classification for predicting the relevance of viewed items with respect to a topic. The results show that exploiting heterogeneous implicit feedback from past users is of added value for future users of interactive video retrieval systems.
|
4 |
50,000 Tiny Videos: A Large Dataset for Non-parametric Content-based Retrieval and RecognitionKarpenko, Alexandre 22 September 2009 (has links)
This work extends the tiny image data-mining techniques developed by Torralba et al. to videos. A large dataset of over 50,000 videos was collected from YouTube. This is the largest user-labeled research database of videos available to date. We demonstrate that a large dataset of tiny videos achieves high classification precision in a variety of content-based retrieval and recognition tasks using very simple similarity metrics. Content-based copy detection (CBCD) is evaluated on a standardized dataset, and the results are applied to related video retrieval within tiny videos. We use our similarity metrics to improve text-only video retrieval results. Finally, we apply our large labeled video dataset to various classification tasks. We show that tiny videos are better suited for classifying activities than tiny images. Furthermore, we demonstrate that classification can be improved by combining the tiny images and tiny videos datasets.
|
5 |
50,000 Tiny Videos: A Large Dataset for Non-parametric Content-based Retrieval and RecognitionKarpenko, Alexandre 22 September 2009 (has links)
This work extends the tiny image data-mining techniques developed by Torralba et al. to videos. A large dataset of over 50,000 videos was collected from YouTube. This is the largest user-labeled research database of videos available to date. We demonstrate that a large dataset of tiny videos achieves high classification precision in a variety of content-based retrieval and recognition tasks using very simple similarity metrics. Content-based copy detection (CBCD) is evaluated on a standardized dataset, and the results are applied to related video retrieval within tiny videos. We use our similarity metrics to improve text-only video retrieval results. Finally, we apply our large labeled video dataset to various classification tasks. We show that tiny videos are better suited for classifying activities than tiny images. Furthermore, we demonstrate that classification can be improved by combining the tiny images and tiny videos datasets.
|
6 |
Video analytics system for surveillance videosBai, Yannan 03 July 2018 (has links)
Developing an intelligent inspection system that can enhance the public safety is challenging. An efficient video analytics system can help monitor unusual events and mitigate possible damage or loss. This thesis aims to analyze surveillance video data, report abnormal activities and retrieve corresponding video clips. The surveillance video dataset used in this thesis is derived from ALERT Dataset, a collection of surveillance videos at airport security checkpoints.
The video analytics system in this thesis can be thought as a pipelined process. The system takes the surveillance video as input, and passes it through a series of processing such as object detection, multi-object tracking, person-bin association and re-identification. In the end, we can obtain trajectories of passengers and baggage in the surveillance videos. Abnormal events like taking away other's belongings will be detected and trigger the alarm automatically. The system could also retrieve the corresponding video clips based on user-defined query.
|
7 |
Semantic Video Retrieval Using High Level ContextAytar, Yusuf 01 January 2008 (has links)
Video retrieval - searching and retrieving videos relevant to a user defined query - is one of the most popular topics in both real life applications and multimedia research. This thesis employs concepts from Natural Language Understanding in solving the video retrieval problem. Our main contribution is the utilization of the semantic word similarity measures for video retrieval through the trained concept detectors, and the visual co-occurrence relations between such concepts. We propose two methods for content-based retrieval of videos: (1) A method for retrieving a new concept (a concept which is not known to the system and no annotation is available) using semantic word similarity and visual co-occurrence, which is an unsupervised method. (2) A method for retrieval of videos based on their relevance to a user defined text query using the semantic word similarity and visual content of videos. For evaluation purposes, we mainly used the automatic search and the high level feature extraction test set of TRECVID'06 and TRECVID'07 benchmarks. These two data sets consist of 250 hours of multilingual news video captured from American, Arabic, German and Chinese TV channels. Although our method for retrieving a new concept is an unsupervised method, it outperforms the trained concept detectors (which are supervised) on 7 out of 20 test concepts, and overall it performs very close to the trained detectors. On the other hand, our visual content based semantic retrieval method performs more than 100% better than the text-based retrieval method. This shows that using visual content alone we can have significantly good retrieval results.
|
8 |
Feature based dynamic intra-video indexingAsghar, Muhammad Nabeel January 2014 (has links)
With the advent of digital imagery and its wide spread application in all vistas of life, it has become an important component in the world of communication. Video content ranging from broadcast news, sports, personal videos, surveillance, movies and entertainment and similar domains is increasing exponentially in quantity and it is becoming a challenge to retrieve content of interest from the corpora. This has led to an increased interest amongst the researchers to investigate concepts of video structure analysis, feature extraction, content annotation, tagging, video indexing, querying and retrieval to fulfil the requirements. However, most of the previous work is confined within specific domain and constrained by the quality, processing and storage capabilities. This thesis presents a novel framework agglomerating the established approaches from feature extraction to browsing in one system of content based video retrieval. The proposed framework significantly fills the gap identified while satisfying the imposed constraints of processing, storage, quality and retrieval times. The output entails a framework, methodology and prototype application to allow the user to efficiently and effectively retrieved content of interest such as age, gender and activity by specifying the relevant query. Experiments have shown plausible results with an average precision and recall of 0.91 and 0.92 respectively for face detection using Haar wavelets based approach. Precision of age ranges from 0.82 to 0.91 and recall from 0.78 to 0.84. The recognition of gender gives better precision with males (0.89) compared to females while recall gives a higher value with females (0.92). Activity of the subject has been detected using Hough transform and classified using Hiddell Markov Model. A comprehensive dataset to support similar studies has also been developed as part of the research process. A Graphical User Interface (GUI) providing a friendly and intuitive interface has been integrated into the developed system to facilitate the retrieval process. The comparison results of the intraclass correlation coefficient (ICC) shows that the performance of the system closely resembles with that of the human annotator. The performance has been optimised for time and error rate.
|
9 |
Content based video retrieval via spatial-temporal information discoveryWang, Lei January 2013 (has links)
Content based video retrieval (CBVR) has been strongly motivated by a variety of realworld applications. Most state-of-the-art CBVR systems are built based on Bag-of-visual- Words (BovW) framework for visual resources representation and access. The framework, however, ignores spatial and temporal information contained in videos, which plays a fundamental role in unveiling semantic meanings. The information includes not only the spatial layout of visual content on a still frame (image), but also temporal changes across the sequential frames. Specially, spatially and temporally co-occurring visual words, which are extracted under the BovW framework, often tend to collaboratively represent objects, scenes, or events in the videos. The spatial and temporal information discovery would be useful to advance the CBVR technology. In this thesis, we propose to explore and analyse the spatial and temporal information from a new perspective: i) co-occurrence of the visual words is formulated as a correlation matrix, ii) spatial proximity and temporal coherence are analytically and empirically studied to re ne this correlation. Following this, a quantitative spatial and temporal correlation (STC) model is de ned. The STC discovered from either the query example (denoted by QC) or the data collection (denoted by DC) are assumed to determine speci- city of the visual words in the retrieval model, i:e: selected Words-Of-Interest are found more important for certain topics. Based on this hypothesis, we utilized the STC matrix to establish a novel visual content similarity measurement method and a query reformulation scheme for the retrieval model. Additionally, the STC also characterizes the context of the visual words, and accordingly a STC-Based context similarity measurement is proposed to detect the synonymous visual words. The method partially solves an inherent error of visual vocabulary under the BovW framework. Systematic experimental evaluations on public TRECVID and CC WEB VIDEO video collections demonstrate that the proposed methods based on the STC can substantially improve retrieval e ectiveness of the BovW framework. The retrieval model based on STC outperforms state-of-the-art CBVR methods on the data collections without storage and computational expense. Furthermore, the rebuilt visual vocabulary in this thesis is more compact and e ective. Above methods can be incorporated together for e ective and e cient CBVR system implementation. Based on the experimental results, it is concluded that the spatial-temporal correlation e ectively approximates the semantical correlation. This discovered correlation approximation can be utilized for both visual content representation and similarity measurement, which are key issues for CBVR technology development.
|
10 |
Semantics of Video Shots for Content-based RetrievalVolkmer, Timo, timovolkmer@gmx.net January 2007 (has links)
Content-based video retrieval research combines expertise from many different areas, such as signal processing, machine learning, pattern recognition, and computer vision. As video extends into both the spatial and the temporal domain, we require techniques for the temporal decomposition of footage so that specific content can be accessed. This content may then be semantically classified - ideally in an automated process - to enable filtering, browsing, and searching. An important aspect that must be considered is that pictorial representation of information may be interpreted differently by individual users because it is less specific than its textual representation. In this thesis, we address several fundamental issues of content-based video retrieval for effective handling of digital footage. Temporal segmentation, the common first step in handling digital video, is the decomposition of video streams into smaller, semantically coherent entities. This is usually performed by detecting the transitions that separate single camera takes. While abrupt transitions - cuts - can be detected relatively well with existing techniques, effective detection of gradual transitions remains difficult. We present our approach to temporal video segmentation, proposing a novel algorithm that evaluates sets of frames using a relatively simple histogram feature. Our technique has been shown to range among the best existing shot segmentation algorithms in large-scale evaluations. The next step is semantic classification of each video segment to generate an index for content-based retrieval in video databases. Machine learning techniques can be applied effectively to classify video content. However, these techniques require manually classified examples for training before automatic classification of unseen content can be carried out. Manually classifying training examples is not trivial because of the implied ambiguity of visual content. We propose an unsupervised learning approach based on latent class modelling in which we obtain multiple judgements per video shot and model the users' response behaviour over a large collection of shots. This technique yields a more generic classification of the visual content. Moreover, it enables the quality assessment of the classification, and maximises the number of training examples by resolving disagreement. We apply this approach to data from a large-scale, collaborative annotation effort and present ways to improve the effectiveness for manual annotation of visual content by better design and specification of the process. Automatic speech recognition techniques along with semantic classification of video content can be used to implement video search using textual queries. This requires the application of text search techniques to video and the combination of different information sources. We explore several text-based query expansion techniques for speech-based video retrieval, and propose a fusion method to improve overall effectiveness. To combine both text and visual search approaches, we explore a fusion technique that combines spoken information and visual information using semantic keywords automatically assigned to the footage based on the visual content. The techniques that we propose help to facilitate effective content-based video retrieval and highlight the importance of considering different user interpretations of visual content. This allows better understanding of video content and a more holistic approach to multimedia retrieval in the future.
|
Page generated in 0.0722 seconds