Return to search

Deep Learning for Unstructured Data by Leveraging Domain Knowledge

Unstructured data such as texts, strings, images, audios, videos are everywhere due to the social interaction on the Internet and the high-throughput technology in sciences, e.g., chemistry and biology. However, for traditional machine learning algorithms, classifying a text document is far more difficult than classifying a data entry in a spreadsheet. We have to convert the unstructured data into some numeric vectors which can then be understood by machine learning algorithms. For example, a sentence is first converted to a vector of word counts, and then fed into a classification algorithm such as logistic regression and support vector machine. The creation of such numerical vectors is very challenging and difficult. Recent progress in deep learning provides us a new way to jointly learn features and train classifiers for unstructured data. For example, recurrent neural networks proved successful at learning from a sequence of word indices; convolutional neural networks are effective to learn from videos, which are sequences of pixel matrices. Our research focuses on developing novel deep learning approaches for text and graph data. Breakthroughs using deep learning have been made during the last few years for many core tasks in natural language processing, such as machine translation, POS tagging, named entity recognition, etc. However, when it comes to informal and noisy text data, such as tweets, HTMLs, OCR, there are two major issues with modern deep learning technologies. First, deep learning requires large amount of labeled data to train an effective model; second, neural network architectures that work with natural language are not proper with informal text. In this thesis, we address the two important issues and develop new deep learning approaches in four supervised and unsupervised tasks with noisy text. We first present a deep feature engineering approach for informative tweets discovery during the emerging disasters. We propose to use unlabeled microblogs to cluster words into a limited number of clusters and use the word clusters as features for tweets discovery. Our results indicate that when the number of labeled tweets is 100 or less, the proposed approach is superior to the standard classification based on the bag or words feature representation. We then introduce a human-in-the-loop (HIL) framework for entity identification from noisy web text. Our work explores ways to combine the expressive power of REs, ability of deep learning to learn from large data into a new integrated framework for entity identification from web data. The evaluation on several entity identification problems shows that the proposed framework achieves very high accuracy while requiring only a modest human involvement. We further extend the framework of entity identification to an iterative HIL framework that addresses the entity recognition problem. We particularly investigate how human invest their time when a user is allowed to choose between regex construction and manual labeling. Finally, we address a fundamental problem in the text mining domain, i.e, embedding of rare and out-of-vocabulary (OOV) words, by refining word embedding models and character embedding models in an iterative way. We illustrate the simplicity but effectiveness of our method when applying it to online professional profiles allowing noisy user input. Graph neural networks have been shown great success in the domain of drug design and material sciences, where organic molecules and crystal structures of materials are represented as attributed graphs. A deep learning architecture that is capable of learning from graph nodes and graph edges is crucial for property estimation of molecules. In this dissertation, We propose a simple graph representation for molecules and three neural network architectures that is able to directly learn predictive functions from graphs. We discover that, it is true graph networks are superior than feature-driven algorithms for formation energy prediction. However, the superiority can not be reproduced on band gap prediction. We also discovered that our proposed simple shallow neural networks perform comparably with the state-of-the-art deep neural networks. / Computer and Information Science

Identiferoai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/3931
Date January 2019
CreatorsZhang, Shanshan
ContributorsVucetic, Slobodan, Kant, Krishna, Dragut, Eduard Constantin, Yan, Qimin
PublisherTemple University. Libraries
Source SetsTemple University
LanguageEnglish
Detected LanguageEnglish
TypeThesis/Dissertation, Text
Format113 pages
RightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relationhttp://dx.doi.org/10.34944/dspace/3913, Theses and Dissertations

Page generated in 0.0021 seconds