Global ETD Search

Return to search

Supervised and Unsupervised Learning for Semantics Distillation in Multimedia Processing

In linguistic, "semantics" stands for the intended meaning in natural language, such as in words, phrases and sentences. In this dissertation, the concept "semantics" is defined more generally: the intended meaning of information in all multimedia forms. The multimedia forms include language domain text, as well as vision domain stationary images and dynamic videos. Specifically, semantics in multimedia are the media content of cognitive information, knowledge and idea that can be represented in text, images and video clips. A narrative story, for example, can be semantics summary of a novel book, or semantics summary of the movie originated from that book. Thus, semantic is a high level abstract knowledge that is independent from multimedia forms. Indeed, the same amount of semantics can be represented either redundantly or concisely, due to diversified levels of expression ability of multimedia. The process of a redundantly represented semantics evolving into a concisely represented one is called "semantic distillation". And this evolving process can happen either in between different multimedia forms, or within the same form. The booming growth of unorganized and unfiltered information is bringing to people an unwanted issue, information overload, where techniques of semantic distillation are in high demand. However, as opportunities always be side with challenges, machine learning and Artificial Intelligence (AI) today become far more advanced than that in the past, and provide with us powerful tools and techniques. Large varieties of learning methods has made countless of impossible tasks come to reality. Thus in this dissertation, we take advantages of machine learning techniques, with both supervised learning and unsupervised learning, to empower the solving of semantics distillation problems. Despite the promising future and powerful machine learning techniques, the heterogeneous forms of multimedia involving many domains still impose challenges to semantics distillation approaches. A major challenge is the definition of "semantics" and the related processing techniques can be entirely different from one problem to another. Varying types of multimedia resources can introduce varying kinds of domain-specific limitations and constraints, where the obtaining of semantics also becomes domain-specific. Therefore, in this dissertation, with text language and vision as the two major domains, we approach four problems of all combinations of the two domains: • Language to Vision Domain: In this study, Presentation Storytelling is formulated as a problem that suggesting the most appropriate images from online sources for storytelling purpose given a text query. Particularly, we approach the problem with a two-step semantics processing method, where the semantics from a simple query is first expanded to a diverse semantic graph, and then distilled from a large number of searched web photos to a few representative ones. This two-step method is empowered by Conditional Random Field (CRF) model, and learned in supervised manner with human-labeled examples. • Vision to Language Domain: The second study, Visual Storytelling, formulates a problem of generating a coherent paragraph from a photo stream. Different from presentation storytelling, visual storytelling goes in opposite way: the semantics extracted from a handful photos are distilled into text. In this dissertation, we address this problem by revealing the semantics relationships in visual domain, and distilled into language domain with a new designed Bidirectional Attention Recurrent Neural Network (BARNN) model. Particularly, an attention model is embedded to the RNN so that the coherence can be preserved in language domain at the output being a human-like story. The model is trained with deep learning and supervised learning with public datasets. • Dedicated Vision Domain: To directly approach the information overload issue in vision domain, Image Semantic Extraction formulates a problem that selects a subset from multimedia user's photo albums. In the literature, this problem has mostly been approached with unsupervised learning process. However, in this dissertation, we develop a novel supervised learning method to attach the same problem. We specify visual semantics as a quantizable variables and can be measured, and build an encoding-decoding pipeline with Long-Short-Term-Memory (LSTM) to model this quantization process. The intuition of encoding-decoding pipeline is to imitate human: read-think-and-retell. That is, the pipeline first includes an LSTM encoder scanning all photos for "reading" comprised semantics, then concatenates with an LSTM decoder selecting the most representative ones for "thinking" the gist semantics, finally adds a dedicated residual layer revisiting the unselected ones for "verifying" if the semantics are complete enough. • Dedicated Language Domain: Distinct from above issues, in this part, we introduce a different genre of machine learning method, unsupervised learning. We will address a semantics distillation problem in language domain, Text Semantic Extraction, where the semantics in a letter sequence are extracted from printed images. (Abstract shortened by ProQuest.)

http://pqdtopen.proquest.com/#viewpdf?dispub=10932367

Computer engineering|Computer science

Identifer	oai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:10932367
Date	19 October 2018
Creators	Liu, Yu
Publisher	State University of New York at Buffalo
Source Sets	ProQuest.com
Language	English
Detected Language	English
Type	thesis

Page generated in 0.0295 seconds

Supervised and Unsupervised Learning for Semantics Distillation in Multimedia Processing

Description

Links & Downloads

Tags

Additional Fields