1 |
Modeling Language, Social, and Behavioral Abstractions for Microblog Political Discourse ClassificationKristen M Johnson (7047653) 14 August 2019 (has links)
<div>Politicians are increasingly using social media platforms, specifically the microblog Twitter, to interact with the public and express their stances on current policy issues. Due to this nearly one-on-one communication between politician and citizen, it is imperative to develop automatic tools for analyzing how politicians express their stances and frame issues in order to understand how they influence the public. Prior to my work, researchers have focused on supervised, linguistic-based approaches for the prediction of stance or agreement of the content of tweets and classification of the frames and moral foundations used to express a single tweet. The generalizability of these approaches, however, is limited by the need for direct supervision, dependency on current language, and lack of use of social and behavioral context available on Twitter. My works are among the first to study these general political strategies specifically for politicians on Twitter. This requires techniques capable of abstracting the textual content of multiple tweets in order to generalize across politicians, specific policy issues, and time. In this dissertation, I propose breaking from traditional linguistic baselines to leverage the rich social and behavioral features present in tweets and the Twitter network as a form of weak supervision for studying political discourse strategies on microblogs. My approach designs weakly supervised models for the identification, extraction, and modeling of the relevant linguistic, social, and behavioral patterns of Twitter. These models help shed light on the interconnection of ideological stances, framing strategies, and moral viewpoints which underlie the relationship between a politician's behavior on social media and in the real world. <br></div>
|
2 |
APPLICATIONS OF DATA MINING IN HEALTHCAREBo Peng (6618929) 10 June 2019 (has links)
With increases in the quantity and quality of healthcare related data, data mining tools have the potential to improve people’s standard of living through personalized and pre-<br>dictive medicine. In this thesis we improve the state-of-the-art in data mining for several problems in the healthcare domain. In problems such as drug-drug interaction prediction<br>and Alzheimer’s Disease (AD) biomarkers discovery and prioritization, current methods either require tedious feature engineering or have unsatisfactory performance. New effective computational tools are needed that can tackle these complex problems.<br>In this dissertation, we develop new algorithms for two healthcare problems: high-order drug-drug interaction prediction and amyloid imaging biomarker prioritization in<br>Alzheimer’s Disease. Drug-drug interactions (DDIs) and their associated adverse drug reactions (ADRs) represent a significant detriment to the public h ealth. Existing research on DDIs primarily focuses on pairwise DDI detection and prediction. Effective computational methods for high-order DDI prediction are desired. In this dissertation, I present a deep learning based model D3I for cardinality-invariant and order-invariant high-order DDI prediction. The proposed models achieve 0.740 F1 value and 0.847 AUC value on high-order DDI prediction, and outperform classical methods on order-2 DDI prediction. These results demonstrate the strong potential of D 3 I and deep learning based models in tackling the prediction problems of high-order DDIs and their induced ADRs.<br>The second problem I consider in this thesis is amyloid imaging biomarkers discovery, for which I propose an innovative machine learning paradigm enabling precision medicine in this domain. The paradigm tailors the imaging biomarker discovery process to individual characteristics of a given patient. I implement this paradigm using a newly developed learning-to-rank method PLTR. The PLTR model seamlessly integrates two objectives for joint optimization: pushing up relevant biomarkers and ranking among relevant biomarkers. The empirical study of PLTR conducted on the ADNI data yields promising results to identify and prioritize individual-specific amyloid imaging biomarkers based on the individual’s structural MRI data. The resulting top ranked imaging biomarkers have the potential to aid personalized diagnosis and disease subtyping.
|
3 |
EFFECTIVE AND EFFICIENT COMPUTATION SYSTEM PROVENANCE TRACKINGShiqing Ma (7036475) 02 August 2019 (has links)
<div><div><div><p>Provenance collection and analysis is one of the most important techniques used in analyzing computation system behaviors. For forensic analysis in enterprise environment, existing provenance systems are limited. On one hand, they tend to log many redundant and irrelevant events causing high runtime and space overhead as well as long investigation time. On the other hand, they lack the application specific provenance data, leading to ineffective investigation process. Moreover, emerging machine learning especially deep learning based artificial intelligence systems are hard to interpret and vulnerable to adversarial attacks. Using provenance information to analyze such systems and defend adversarial attacks is potentially very promising but not well-studied yet.</p><p><br></p><div><div><div><p>In this dissertation, I try to address the aforementioned challenges. I present an effective and efficient operating system level provenance data collector, ProTracer. It features the idea of alternating between logging and tainting to perform on-the-fly log filtering and reduction to achieve low runtime and storage overhead. Tainting is used to track the dependence relationships between system call events, and logging is performed only when useful dependencies are detected. I also develop MPI, an LLVM based analysis and instrumentation framework which automatically transfers existing applications to be provenance-aware. It requires the programmers to annotate the desired data structures used for partitioning, and then instruments the program to actively emit application specific semantics to provenance collectors which can be used for multiple perspective attack investigation. In the end, I propose a new technique named NIC, a provenance collection and analysis technique for deep learning systems. It analyzes deep learning system internal variables to generate system invariants as provenance for such systems, which can be then used to as a general way to detect adversarial attacks.</p></div></div></div></div></div></div>
|
4 |
AN ITERATIVE METHOD OF SENTIMENT ANALYSIS FOR RELIABLE USER EVALUATIONJingyi Hui (7023500) 16 August 2019 (has links)
<div>
<div>
<p>Benefited from the booming social network, reading posts from other users overinternet is becoming one of commonest ways for people to intake information. Onemay also have noticed that sometimes we tend to focus on users provide well-foundedanalysis, rather than those merely who vent their emotions. This thesis aims atfinding a simple and efficient way to recognize reliable information sources amongcountless internet users by examining the sentiments from their past posts.<br></p><p>To achieve this goal, the research utilized a dataset of tweets about Apples stockprice retrieved from Twitter. Key features we studied include post-date, user name,number of followers of that user, and the sentiment of that tweet. Prior to makingfurther use of the dataset, tweets from users who do not have sufficient posts arefiltered out. To compare user sentiments and the derivative of Apples stock price, weuse Pearson correlation between them for to describe how well each user performs.Then we iteratively increase the weight of reliable users and lower the weight ofuntrustworthy users, the correlation between overall sentiment and the derivative ofstock price will finally converge. The final correlations for individual users are theirperformance scores. Due to the chaos of real world data, manual segmentation viadata visualization is also proposed as a denoise method to improve performance.Besides our method, other metrics can also be considered as user trust index, suchas numbers of followers of each user. Experiments are conducted to prove that ourmethod out performs others. With simple input, this method can be applied on awide range of topics including election, economy, and job market.<br></p>
</div>
</div>
|
5 |
Statistical Steganalysis of ImagesMin Huang (7036661) 13 August 2019 (has links)
<div>Steganalysis is the study of detecting secret information hidden in objects such as
images, videos, texts, time series and games via steganography. Among those objects,
the image is the most widely used object to hide secret messages. Detection of possible
secret information hidden in images has attracted a lot of attention over the past ten
years. People may conduct covert communications by exchanging images in which
secret messages may be embedded in bits. One of main advantages of steganography
over cryptography is that the former makes this communication insensible for human
beings. So statistical methods or tools are needed to help distinguish cover images
from stego images. <br></div><div><br></div><div>In this thesis, we start with a discussion of image steganography. Different kinds
of embedding schemes for hiding secret information in images are investigated. We
also propose a hiding scheme using a reference matrix to lower the distortion caused
by embedding. As a result, we obtain Peak Signal-to-Noise Ratios (PSNRs) of stego
images that are higher than those given by a Sudoku-based embedding scheme. Next,
we consider statistical steganalysis of images in two different frameworks. We first
study staganalysis in the framework of statistical hypothesis testing. That is, we
cast a cover/stego image detection problem as a hypothesis testing problem. For this
purpose, we employ different statistical models for cover images and simulate the
effects caused by secret information embedding operations on cover images. Then
the staganalysis can be characterized by a hypothesis testing problem in terms of
the embedding rate. Rao’s score statistic is used to help make a decision. The
main advantage of using Rao’s score test for this problem is that it eliminates an assumption used in the previous work where approximated log likelihood ratio (LR)
statistics were commonly employed for the hypothesis testing problems.<br></div><div><br></div><div>We also investigate steganalysis using the deep learning framework. Motivated
by neural network architectures applied in computer vision and other tasks, we propose a carefully designed a deep convolutional neural network architecture to classify the cover and stego images. We empirically show the proposed neural network
outperforms the state-of-the-art ensemble classifier using a rich model, and is also
comparable to other convolutional neural network architectures used for steganalysis.<br></div><div><br></div>The image databases used in the thesis are available on websites cited in the thesis. The stego images are generated from the image databases using source code from the website. <a href="http://dde.binghamton.edu/download/">http://dde.binghamton.edu/download/</a>
|
6 |
Using a Scalable Feature Selection Approach For Big Data RegressionsQingdong Cheng (6922766) 13 August 2019 (has links)
Logistic regression is a widely used statistical method in data analysis and machine learning. When the capacity of data is large, it is time-consuming and even infeasible to perform big data machine learning using the traditional approach. Therefore, it is crucial to come up with an efficient way to evaluate feature combinations and update learning models. With the approach proposed by Yang, Wang, Xu, and Zhang (2018) a system can be represented using small enough matrices, which can be hosted in memory. These working sufficient statistics matrices can be applied in updating models in logistic regression. This study applies the working sufficient statistics approach in logistic regression machine learning to examine how this new method improves the performance. This study investigated the difference between the performance of this new working sufficient statistics approach and performance of the traditional approach on Spark\rq s machine learning package. The experiments showed that the working sufficient statistics method could improve the performance of training the logistic regression models when the input size was large.
|
7 |
IMPROVING PERFORMANCE OF DATA-CENTRIC SYSTEMS THROUGH FINE-GRAINED CODE GENERATIONGregory M Essertel (8158032) 20 December 2019 (has links)
<div>The availability of modern hardware with large amounts of memory created a shift in the development of data-centric software; from optimizing I/O operations to optimizing computation. As a result, the main challenge has become using the memory hierarchy (cache, RAM, distributed, etc) efficiently. In order to overcome this difficulty, programmers of data-centric programs need to use low-level APIs such as Pthreads or MPI to manually optimize their software because of the intrinsic difficulties and the low productivity of these APIs. Data-centric systems such as Apache Spark are becoming more and more popular. These kinds of systems offer a much simpler interface and allow programmers and scientists to write in a few lines what would have been thousands of lines of low-level MPI code. The core benefit of these systems comes from the introduction of deferred APIs; the code written by the programmer is actually building a graph representation of the computation that has to be executed. This graph can then be optimized and compiled to achieve higher performance.</div><div><br></div><div>In this dissertation, we analyze the limitations of current data-centric systems such as Apache Spark, on relational and heterogeneous workloads interacting with machine learning frameworks. We show that the compilation of queries in multiples stages and the interfacing with external systems is a key impediment to performance because of their inability to optimize across code boundaries. We present Flare, an accelerator for data-centric software, which provides performance comparable to the state of the art relational systems while keeping the expressiveness of high-level deferred APIs. Flare displays order of magnitude speed up on programs combining relational processing and machine learning frameworks such as TensorFlow. We look at the impact of compilation on short-running jobs and propose an on-stack-replacement mechanism for generative programming to decrease the overhead introduced by the compilation step. We show that this mechanism can also be used in a more generic way within source-to-source compilers. We develop a new kind of static analysis that allows the reverse engineering of legacy codes in order to optimize them with Flare. The novelty of the analysis is also useful for more generic problems such as formal verification of programs using dynamic allocation. We have implemented a prototype that successfully verifies programs within the SV-COMP benchmark suite.</div>
|
8 |
Performance Models For Distributed Memory HPC Systems And Deep Neural NetworksDavid William Cardwell (8037125) 26 November 2019 (has links)
Performance models are useful as mathematical models to reason about the behavior of different computer systems while running various applications. In this thesis, we aim to provide two distinct performance models: one for distributed-<br>memory high performance computing systems with network communication, and one for deep neural networks. Our main goal for the first model is insight and simplicity, while for the second we aim for accuracy in prediction. The first model is generalized for networked multi-core computer systems, while the second is specific to deep neural networks on a shared-memory system.<br>
|
9 |
A 4/3-approximation for Minimum Weight Edge CoverSteven Alec Gallagher (8708778) 17 April 2020 (has links)
This paper addresses the minimum weight edge cover problem (MEC), which is stated as follows: Given a graph <i>G= (V,E)</i>, find a set of edges <i>S:S⊆E </i>and ∑<sub>e∈S</sub><sup>w(e) </sup></∑<sub>e∈Q<sup>w(e)</sup>∀Q: Q is an edge cover. Where an edge cover <i>P</i> is a set of edges such that ∀v∈V <i>v</i> is incident to at least one edge in <i>P</i>. An efficient implementation of a 4/3-approximation for MEC is provided. Empirical results obtained experimentally from practical data sets are reported and compared against various other approximation algorithms for MEC.<br>
|
10 |
Privacy-Preserving Facial Recognition Using Biometric-CapsulesTyler Stephen Phillips (8782193) 04 May 2020 (has links)
<div>In recent years, developers have used the proliferation of biometric sensors in smart devices, along with recent advances in deep learning, to implement an array of biometrics-based recognition systems. Though these systems demonstrate remarkable performance and have seen wide acceptance, they present unique and pressing security and privacy concerns. One proposed method which addresses these concerns is the elegant, fusion-based Biometric-Capsule (BC) scheme. The BC scheme is provably secure, privacy-preserving, cancellable and interoperable in its secure feature fusion design. </div><div><br></div><div>In this work, we demonstrate that the BC scheme is uniquely fit to secure state-of-the-art facial verification, authentication and identification systems. We compare the performance of unsecured, underlying biometrics systems to the performance of the BC-embedded systems in order to directly demonstrate the minimal effects of the privacy-preserving BC scheme on underlying system performance. Notably, we demonstrate that, when seamlessly embedded into a state-of-the-art FaceNet and ArcFace verification systems which achieve accuracies of 97.18% and 99.75% on the benchmark LFW dataset, the BC-embedded systems are able to achieve accuracies of 95.13% and 99.13% respectively. Furthermore, we also demonstrate that the BC scheme outperforms or performs as well as several other proposed secure biometric methods.</div>
|
Page generated in 0.0807 seconds