541 |
Forced Attention for Image CaptioningHemanth Devarapalli (5930603) 17 January 2019 (has links)
<div>
<div>
<div>
<p>Automatic generation of captions for a given image is an active research area in Artificial
Intelligence. The architectures have evolved from using metadata of the images on which classical
machine learning was employed to neural networks. Two different styles of architectures evolved
in the neural network space for image captioning: Encoder-Attention-Decoder architecture, and
the transformer architecture. This study is an attempt to modify the attention to allow any object
to be specified. An archetypical Encoder-Attention-Decoder architecture (Show, Attend, and Tell
(Xu et al., 2015)) is employed as a baseline for this study, and a modification of the Show, Attend,
and Tell architecture is proposed. Both the architectures are evaluated on the MSCOCO (Lin et al.,
2014) dataset, and seven metrics: BLEU – 1, 2, 3, 4 (Papineni, Roukos, Ward & Zhu, 2002),
METEOR (Banerjee & Lavie, 2005), ROGUE L (Lin, 2004), and CIDer (Vedantam, Lawrence &
Parikh, 2015) are calculated. Finally, the statistical significance of the results is evaluated by
performing paired t tests.
</p>
</div>
</div>
</div>
|
542 |
Model-Based Iterative Reconstruction and Direct Deep Learning for One-Sided Ultrasonic Non-Destructive EvaluationHani A. Almansouri (5929469) 16 January 2019 (has links)
<p></p><p>One-sided ultrasonic non-destructive evaluation (UNDE) is extensively
used to characterize structures that need to be inspected and maintained from
defects and flaws that could affect the performance of power plants, such as
nuclear power plants. Most UNDE systems send acoustic pulses into the structure
of interest, measure the received waveform and use an algorithm to reconstruct
the quantity of interest. The most widely used algorithm in UNDE systems is the
synthetic aperture focusing technique (SAFT) because it produces acceptable
results in real time. A few regularized inversion techniques with linear models
have been proposed which can improve on SAFT, but they tend to make simplifying
assumptions that show artifacts and do not address how to obtain
reconstructions from large real data sets. In this thesis, we present two
studies. The first study covers the model-based iterative reconstruction (MBIR)
technique which is used to resolve some of the issues in SAFT and the current
linear regularized inversion techniques, and the second study covers the direct
deep learning (DDL) technique which is used to further resolve issues related
to non-linear interactions between the ultrasound signal and the specimen.</p>
<p>In the first study, we propose a model-based iterative
reconstruction (MBIR) algorithm designed for scanning UNDE systems. MBIR
reconstructs the image by optimizing a cost function that contains two terms:
the forward model that models the measurements and the prior model that models
the object. To further reduce some of the artifacts in the results, we enhance
the forward model of MBIR to account for the direct arrival artifacts and the
isotropic artifacts. The direct arrival signals are the signals received
directly from the transmitter without being reflected. These signals contain no
useful information about the specimen and produce high amplitude artifacts in
regions close to the transducers. We resolve this issue by modeling these direct
arrival signals in the forward model to reduce their artifacts while
maintaining information from reflections of other objects. Next, the isotropic
artifacts appear when the transmitted signal is assumed to propagate in all
directions equally. Therefore, we modify our forward model to resolve this issue
by modeling the anisotropic propagation. Next, because of the significant
attenuation of the transmitted signal as it propagates through deeper regions,
the reconstruction of deeper regions tends to be much dimmer than closer
regions. Therefore, we combine the forward model with a spatially variant prior
model to account for the attenuation by reducing the regularization as the
pixel gets deeper. Next, for scanning large structures, multiple scans are
required to cover the whole field of view. Typically, these scans are performed
in raster order which makes adjacent scans share some useful correlations.
Reconstructing each scan individually and performing a conventional stitching
method is not an efficient way because this could produce stitching artifacts
and ignore extra information from adjacent scans. We present an algorithm to
jointly reconstruct measurements from large data sets that reduces the
stitching artifacts and exploits useful information from adjacent scans. Next,
using simulated and extensive experimental data, we show MBIR results and
demonstrate how we can improve over SAFT as well as existing regularized
inversion techniques. However, even with this improvement, MBIR still results
in some artifacts caused by the inherent non-linearity of the interaction
between the ultrasound signal and the specimen.</p>
<p>In the second study, we propose DDL, a non-iterative model-based
reconstruction method for inverting measurements that are based on non-linear
forward models for ultrasound imaging. Our approach involves obtaining an
approximate estimate of the reconstruction using a simple linear back-projection
and training a deep neural network to refine this to the actual reconstruction.
While the technique we are proposing can show significant enhancement compared
to the current techniques with simulated data, one issue appears with the
performance of this technique when applied to experimental data. The issue is a
modeling mismatch between the simulated training data and the real data. We
propose an effective solution that can reduce the effect of this modeling
mismatch by adding noise to the simulation input of the training set before
simulation. This solution trains the neural network on the general features of
the system rather than specific features of the simulator and can act as a
regularization to the neural network. Another issue appears similar to the
issue in MBIR caused by the attenuation of deeper reflections. Therefore, we
propose a spatially variant amplification technique applied to the
back-projection to amplify deeper regions. Next, to reconstruct from a large
field of view that requires multiple scans, we propose a joint deep neural
network technique to jointly reconstruct an image from these multiple scans.
Finally, we apply DDL to simulated and experimental ultrasound data to
demonstrate significant improvements in image quality compared to the
delay-and-sum approach and the linear model-based reconstruction approach.</p><br><p></p>
|
543 |
Low-Cost and Scalable Visual Drone Detection System Based on Distributed Convolutional Neural NetworkHyun Hwang (5930672) 20 December 2018 (has links)
<div>Recently, with the advancement in drone technology, more and more hobby drones are being manufactured and sold across the world. However, these drones can be repurposed</div><div>for the use in illicit activities such as hostile-load delivery. At the moment there are not many systems readily available for detecting and intercepting those hostile drones. Although there is a prototype of a working drone interceptor system built by the researchers of Purdue University, the system was not ready for the general public due to its nature of proof-of-concept and the high price range of the military-grade RADAR used in the prototype. It is essential to substitute such high-cost elements with low-cost ones, to make such drone interception system affordable enough for large-scale deployment.</div><div><br></div><div><div>This study aims to provide an alternative, affordable way to substitute an expensive, high-precision RADAR system with Convolutional Neural Network based drone detection system, which can be built using multiple low-cost single board computers. The experiment will try to find the feasibility of the proposed system and will evaluate the accuracy of the drone detection in a controlled environment.</div></div>
|
544 |
Deep learning based approaches for imitation learningHussein, Ahmed January 2018 (has links)
Imitation learning refers to an agent's ability to mimic a desired behaviour by learning from observations. The field is rapidly gaining attention due to recent advances in computational and communication capabilities as well as rising demand for intelligent applications. The goal of imitation learning is to describe the desired behaviour by providing demonstrations rather than instructions. This enables agents to learn complex behaviours with general learning methods that require minimal task specific information. However, imitation learning faces many challenges. The objective of this thesis is to advance the state of the art in imitation learning by adopting deep learning methods to address two major challenges of learning from demonstrations. Firstly, representing the demonstrations in a manner that is adequate for learning. We propose novel Convolutional Neural Networks (CNN) based methods to automatically extract feature representations from raw visual demonstrations and learn to replicate the demonstrated behaviour. This alleviates the need for task specific feature extraction and provides a general learning process that is adequate for multiple problems. The second challenge is generalizing a policy over unseen situations in the training demonstrations. This is a common problem because demonstrations typically show the best way to perform a task and don't offer any information about recovering from suboptimal actions. Several methods are investigated to improve the agent's generalization ability based on its initial performance. Our contributions in this area are three fold. Firstly, we propose an active data aggregation method that queries the demonstrator in situations of low confidence. Secondly, we investigate combining learning from demonstrations and reinforcement learning. A deep reward shaping method is proposed that learns a potential reward function from demonstrations. Finally, memory architectures in deep neural networks are investigated to provide context to the agent when taking actions. Using recurrent neural networks addresses the dependency between the state-action sequences taken by the agent. The experiments are conducted in simulated environments on 2D and 3D navigation tasks that are learned from raw visual data, as well as a 2D soccer simulator. The proposed methods are compared to state of the art deep reinforcement learning methods. The results show that deep learning architectures can learn suitable representations from raw visual data and effectively map them to atomic actions. The proposed methods for addressing generalization show improvements over using supervised learning and reinforcement learning alone. The results are thoroughly analysed to identify the benefits of each approach and situations in which it is most suitable.
|
545 |
Towards Developing Computer Vision Algorithms and Architectures for Real-world ApplicationsJanuary 2018 (has links)
abstract: Computer vision technology automatically extracts high level, meaningful information from visual data such as images or videos, and the object recognition and detection algorithms are essential in most computer vision applications. In this dissertation, we focus on developing algorithms used for real life computer vision applications, presenting innovative algorithms for object segmentation and feature extraction for objects and actions recognition in video data, and sparse feature selection algorithms for medical image analysis, as well as automated feature extraction using convolutional neural network for blood cancer grading.
To detect and classify objects in video, the objects have to be separated from the background, and then the discriminant features are extracted from the region of interest before feeding to a classifier. Effective object segmentation and feature extraction are often application specific, and posing major challenges for object detection and classification tasks. In this dissertation, we address effective object flow based ROI generation algorithm for segmenting moving objects in video data, which can be applied in surveillance and self driving vehicle areas. Optical flow can also be used as features in human action recognition algorithm, and we present using optical flow feature in pre-trained convolutional neural network to improve performance of human action recognition algorithms. Both algorithms outperform the state-of-the-arts at their time.
Medical images and videos pose unique challenges for image understanding mainly due to the fact that the tissues and cells are often irregularly shaped, colored, and textured, and hand selecting most discriminant features is often difficult, thus an automated feature selection method is desired. Sparse learning is a technique to extract the most discriminant and representative features from raw visual data. However, sparse learning with \textit{L1} regularization only takes the sparsity in feature dimension into consideration; we improve the algorithm so it selects the type of features as well; less important or noisy feature types are entirely removed from the feature set. We demonstrate this algorithm to analyze the endoscopy images to detect unhealthy abnormalities in esophagus and stomach, such as ulcer and cancer. Besides sparsity constraint, other application specific constraints and prior knowledge may also need to be incorporated in the loss function in sparse learning to obtain the desired results. We demonstrate how to incorporate similar-inhibition constraint, gaze and attention prior in sparse dictionary selection for gastroscopic video summarization that enable intelligent key frame extraction from gastroscopic video data. With recent advancement in multi-layer neural networks, the automatic end-to-end feature learning becomes feasible. Convolutional neural network mimics the mammal visual cortex and can extract most discriminant features automatically from training samples. We present using convolutinal neural network with hierarchical classifier to grade the severity of Follicular Lymphoma, a type of blood cancer, and it reaches 91\% accuracy, on par with analysis by expert pathologists.
Developing real world computer vision applications is more than just developing core vision algorithms to extract and understand information from visual data; it is also subject to many practical requirements and constraints, such as hardware and computing infrastructure, cost, robustness to lighting changes and deformation, ease of use and deployment, etc.The general processing pipeline and system architecture for the computer vision based applications share many similar design principles and architecture. We developed common processing components and a generic framework for computer vision application, and a versatile scale adaptive template matching algorithm for object detection. We demonstrate the design principle and best practices by developing and deploying a complete computer vision application in real life, building a multi-channel water level monitoring system, where the techniques and design methodology can be generalized to other real life applications. The general software engineering principles, such as modularity, abstraction, robust to requirement change, generality, etc., are all demonstrated in this research. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2018
|
546 |
Algorithm and Hardware Design for Efficient Deep Learning InferenceJanuary 2018 (has links)
abstract: Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data, robustness to noise in previously unseen data and high inference accuracy. With the ability to learn useful features from raw sensor data, deep learning algorithms have out-performed tradinal AI algorithms and pushed the boundaries of what can be achieved with AI. In this work, we demonstrate the power of deep learning by developing a neural network to automatically detect cough instances from audio recorded in un-constrained environments. For this, 24 hours long recordings from 9 dierent patients is collected and carefully labeled by medical personel. A pre-processing algorithm is proposed to convert event based cough dataset to a more informative dataset with start and end of coughs and also introduce data augmentation for regularizing the training procedure. The proposed neural network achieves 92.3% leave-one-out accuracy on data captured in real world.
Deep neural networks are composed of multiple layers that are compute/memory intensive. This makes it difficult to execute these algorithms real-time with low power consumption using existing general purpose computers. In this work, we propose hardware accelerators for a traditional AI algorithm based on random forest trees and two representative deep convolutional neural networks (AlexNet and VGG). With the proposed acceleration techniques, ~ 30x performance improvement was achieved compared to CPU for random forest trees. For deep CNNS, we demonstrate that much higher performance can be achieved with architecture space exploration using any optimization algorithms with system level performance and area models for hardware primitives as inputs and goal of minimizing latency with given resource constraints. With this method, ~30GOPs performance was achieved for Stratix V FPGA boards.
Hardware acceleration of DL algorithms alone is not always the most ecient way and sucient to achieve desired performance. There is a huge headroom available for performance improvement provided the algorithms are designed keeping in mind the hardware limitations and bottlenecks. This work achieves hardware-software co-optimization for Non-Maximal Suppression (NMS) algorithm. Using the proposed algorithmic changes and hardware architecture
With CMOS scaling coming to an end and increasing memory bandwidth bottlenecks, CMOS based system might not scale enough to accommodate requirements of more complicated and deeper neural networks in future. In this work, we explore RRAM crossbars and arrays as compact, high performing and energy efficient alternative to CMOS accelerators for deep learning training and inference. We propose and implement RRAM periphery read and write circuits and achieved ~3000x performance improvement in online dictionary learning compared to CPU.
This work also examines the realistic RRAM devices and their non-idealities. We do an in-depth study of the effects of RRAM non-idealities on inference accuracy when a pretrained model is mapped to RRAM based accelerators. To mitigate this issue, we propose Random Sparse Adaptation (RSA), a novel scheme aimed at tuning the model to take care of the faults of the RRAM array on which it is mapped. Our proposed method can achieve inference accuracy much higher than what traditional Read-Verify-Write (R-V-W) method could achieve. RSA can also recover lost inference accuracy 100x ~ 1000x faster compared to R-V-W. Using 32-bit high precision RSA cells, we achieved ~10% higher accuracy using fautly RRAM arrays compared to what can be achieved by mapping a deep network to an 32 level RRAM array with no variations. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2018
|
547 |
Learning space-time structures for action recognition and localizationMa, Shugao 12 August 2016 (has links)
In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection.
|
548 |
Cost-Sensitive Selective Classification and its Applications to Online Fraud ManagementJanuary 2019 (has links)
abstract: Fraud is defined as the utilization of deception for illegal gain by hiding the true nature of the activity. While organizations lose around $3.7 trillion in revenue due to financial crimes and fraud worldwide, they can affect all levels of society significantly. In this dissertation, I focus on credit card fraud in online transactions. Every online transaction comes with a fraud risk and it is the merchant's liability to detect and stop fraudulent transactions. Merchants utilize various mechanisms to prevent and manage fraud such as automated fraud detection systems and manual transaction reviews by expert fraud analysts. Many proposed solutions mostly focus on fraud detection accuracy and ignore financial considerations. Also, the highly effective manual review process is overlooked. First, I propose Profit Optimizing Neural Risk Manager (PONRM), a selective classifier that (a) constitutes optimal collaboration between machine learning models and human expertise under industrial constraints, (b) is cost and profit sensitive. I suggest directions on how to characterize fraudulent behavior and assess the risk of a transaction. I show that my framework outperforms cost-sensitive and cost-insensitive baselines on three real-world merchant datasets. While PONRM is able to work with many supervised learners and obtain convincing results, utilizing probability outputs directly from the trained model itself can pose problems, especially in deep learning as softmax output is not a true uncertainty measure. This phenomenon, and the wide and rapid adoption of deep learning by practitioners brought unintended consequences in many situations such as in the infamous case of Google Photos' racist image recognition algorithm; thus, necessitated the utilization of the quantified uncertainty for each prediction. There have been recent efforts towards quantifying uncertainty in conventional deep learning methods (e.g., dropout as Bayesian approximation); however, their optimal use in decision making is often overlooked and understudied. Thus, I present a mixed-integer programming framework for selective classification called MIPSC, that investigates and combines model uncertainty and predictive mean to identify optimal classification and rejection regions. I also extend this framework to cost-sensitive settings (MIPCSC) and focus on the critical real-world problem, online fraud management and show that my approach outperforms industry standard methods significantly for online fraud management in real-world settings. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019
|
549 |
GeoAI-enhanced Techniques to Support Geographical Knowledge Discovery from Big Geospatial DataJanuary 2019 (has links)
abstract: Big data that contain geo-referenced attributes have significantly reformed the way that I process and analyze geospatial data. Compared with the expected benefits received in the data-rich environment, more data have not always contributed to more accurate analysis. “Big but valueless” has becoming a critical concern to the community of GIScience and data-driven geography. As a highly-utilized function of GeoAI technique, deep learning models designed for processing geospatial data integrate powerful computing hardware and deep neural networks into various dimensions of geography to effectively discover the representation of data. However, limitations of these deep learning models have also been reported when People may have to spend much time on preparing training data for implementing a deep learning model. The objective of this dissertation research is to promote state-of-the-art deep learning models in discovering the representation, value and hidden knowledge of GIS and remote sensing data, through three research approaches. The first methodological framework aims to unify varied shadow into limited number of patterns, with the convolutional neural network (CNNs)-powered shape classification, multifarious shadow shapes with a limited number of representative shadow patterns for efficient shadow-based building height estimation. The second research focus integrates semantic analysis into a framework of various state-of-the-art CNNs to support human-level understanding of map content. The final research approach of this dissertation focuses on normalizing geospatial domain knowledge to promote the transferability of a CNN’s model to land-use/land-cover classification. This research reports a method designed to discover detailed land-use/land-cover types that might be challenging for a state-of-the-art CNN’s model that previously performed well on land-cover classification only. / Dissertation/Thesis / Doctoral Dissertation Geography 2019
|
550 |
An Investigation into Modern Facial Expressions Recognition by a ComputerJanuary 2019 (has links)
abstract: Facial Expressions Recognition using the Convolution Neural Network has been actively researched upon in the last decade due to its high number of applications in the human-computer interaction domain. As Convolution Neural Networks have the exceptional ability to learn, they outperform the methods using handcrafted features. Though the state-of-the-art models achieve high accuracy on the lab-controlled images, they still struggle for the wild expressions. Wild expressions are captured in a real-world setting and have natural expressions. Wild databases have many challenges such as occlusion, variations in lighting conditions and head poses. In this work, I address these challenges and propose a new model containing a Hybrid Convolutional Neural Network with a Fusion Layer. The Fusion Layer utilizes a combination of the knowledge obtained from two different domains for enhanced feature extraction from the in-the-wild images. I tested my network on two publicly available in-the-wild datasets namely RAF-DB and AffectNet. Next, I tested my trained model on CK+ dataset for the cross-database evaluation study. I prove that my model achieves comparable results with state-of-the-art methods. I argue that it can perform well on such datasets because it learns the features from two different domains rather than a single domain. Last, I present a real-time facial expression recognition system as a part of this work where the images are captured in real-time using laptop camera and passed to the model for obtaining a facial expression label for it. It indicates that the proposed model has low processing time and can produce output almost instantly. / Dissertation/Thesis / Masters Thesis Computer Science 2019
|
Page generated in 0.1005 seconds