Spelling suggestions: "subject:"computerscience"" "subject:"composerscience""
101 |
Identifying and Resolving Entities in TextDurrett, Gregory Christopher 02 February 2017 (has links)
<p>When automated systems attempt to deal with unstructured text, a key subproblem is identifying the relevant actors in that text---answering the "who" of the narrative being presented. This thesis is concerned with developing tools to solve this NLP subproblem, which we call entity analysis. We focus on two tasks in particular: first, coreference resolution, which consists of within-document identification of entities, and second, entity linking, which involves identifying each of those entities with an entry in a knowledge base like Wikipedia.
One of the challenges of coreference is that it requires dealing with many different linguistic phenomenon: constraints in reference resolution arise from syntax, semantics, discourse, and pragmatics. This diversity of effects to handle makes it difficult to build effective learning-based coreference resolution systems rather than relying on handcrafted features. We show that a set of simple features inspecting surface lexical properties of a document is sufficient to capture a range of these effects, and that these can power an efficient, high-performing coreference system.
Our analysis of our base coreference system shows that some examples can only be resolved successfully by exploiting world knowledge or deeper knowledge of semantics. Therefore, we turn to the task of entity linking and tackle it not in isolation, but instead jointly with coreference. By doing so, our coreference module can draw upon knowledge from a resource like Wikipedia, and our entity linking module can draw on information from multiple mentions of the entity we are attempting to resolve. Our joint model of these tasks, which additionally models semantic types of entities, gives strong performance across the board and shows that effectively exploiting these interactions is a natural way to build better NLP systems.
Having developed these tools, we show that they can be useful for a downstream NLP task, namely automatic summarization. We develop an extractive and compressive automatic summarization system, and argue that one deficiency it has is its inability to use pronouns coherently in generated summaries, as we may have deleted content that contained a pronoun's antecedent. Our entity analysis machinery allows us to place constraints on summarization that guarantee pronoun interpretability: each pronoun must have a valid antecedent included in the summary or it must be expanded into a reference that makes sense in isolation. We see improvements in our system's ability to produce summaries with coherent pronouns, which suggests that deeper integration of various parts of the NLP stack promises to yield better systems for text understanding.
|
102 |
Efficient genetic k-means clustering algorithm and its application to data mining on different domainsAlsayat, Ahmed Mosa 02 February 2017 (has links)
<p> Because of the massive increase for streams available and being produced, the areas of data mining and machine learning have become increasingly popular. This takes place as companies, organizations and industries seek out optimal methods and techniques for processing these large data sets. Machine learning is a branch of artificial intelligence that involves creating programs that autonomously perform different data mining techniques when exposed to data streams. The study evaluates at two very different domains in an effort to provide a better and more optimized applicable method of clustering than is currently being used. We examine the use of data mining in healthcare, as well as the use of these techniques in the social media domain. Testing the proposed technique on these two drastically different domains offers us valuable insights into the performance of the proposed technique across domains. </p><p> This study aims at reviewing the existing methods of clustering and presenting an enhanced k-means clustering algorithm by using a novel method called Optimize Cluster Distance (OCD) applied to social media domain. This (OCD) method maximizes the distance between clusters by pair-wise re-clustering to enhance the quality of the clusters. For the healthcare domain, the k-means was applied along with Self Organizing Map (SOM) to get an optimal number of clusters. The possibility of getting bad positions of centroids in k-means was solved by applying the Genetic algorithm to the k-means in social media and healthcare domains. The OCD was applied again to enhance the quality of the produced clusters. In both domains, compared to the conventional k-means, the analysis shows that the proposed k-means is accurate and achieves better clustering performance along with valuable insights for each cluster. The approach is unsupervised, scalable and can be applied to various domains.</p><p>
|
103 |
Probabilistic and Deep Learning Algorithms for the Analysis of Imagery DataBasu, Saikat 23 August 2016 (has links)
Accurate object classification is a challenging problem for various low to high resolution imagery data. This applies to both natural as well as synthetic image datasets. However, each object recognition dataset poses its own distinct set of domain-specific problems. In order to address these issues, we need to devise intelligent learning algorithms which require a deep understanding and careful analysis of the feature space. In this thesis, we introduce three new learning frameworks for the analysis of both airborne images (NAIP dataset) and handwritten digit datasets without and with noise (MNIST and n-MNIST respectively).
First, we propose a probabilistic framework for the analysis of the NAIP dataset which includes (1) an unsupervised segmentation module based on the Statistical Region Merging algorithm, (2) a feature extraction module that extracts a set of standard hand-crafted texture features from the images, (3) a supervised classification algorithm based on Feedforward Backpropagation Neural Networks, and (4) a structured prediction framework using Conditional Random Fields that integrates the results of the segmentation and classification modules into a single composite model to generate the final class labels.
Next, we introduce two new datasets SAT-4 and SAT-6 sampled from the NAIP imagery and use them to evaluate a multitude of Deep Learning algorithms including Deep Belief Networks (DBN), Convolutional Neural Networks (CNN) and Stacked Autoencoders (SAE) for generating class labels. Finally, we propose a learning framework by integrating hand-crafted texture features with a DBN. A DBN uses an unsupervised pre-training phase to perform initialization of the parameters of a Feedforward Backpropagation Neural Network to a global error basin which can then be improved using a round of supervised fine-tuning using Feedforward Backpropagation Neural Networks. These networks can subsequently be used for classification. In the following discussion, we show that the integration of hand-crafted features with DBN shows significant improvement in performance as compared to traditional DBN models which take raw image pixels as input. We also investigate why this integration proves to be particularly useful for aerial datasets using a statistical analysis based on Distribution Separability Criterion.
Then we introduce a new dataset called noisy-MNIST (n-MNIST) by adding (1) additive white gaussian noise (AWGN), (2) motion blur and (3) Reduced contrast and AWGN to the MNIST dataset and present a learning algorithm by combining probabilistic quadtrees and Deep Belief Networks. This dynamic integration of the Deep Belief Network with the probabilistic quadtrees provide significant improvement over traditional DBN models on both the MNIST and the n-MNIST datasets.
Finally, we extend our experiments on aerial imagery to the class of general texture images and present a theoretical analysis of Deep Neural Networks applied to texture classification. We derive the size of the feature space of textural features and also derive the Vapnik-Chervonenkis dimension of certain classes of Neural Networks. We also derive some useful results on intrinsic dimension and relative contrast of texture datasets and use these to highlight the differences between texture datasets and general object recognition datasets.
|
104 |
Scalable Unsupervised Dense Objects Discovery, Detection, Tracking and ReconstructionMa, Lu 03 November 2016 (has links)
<p> This dissertation proposes a novel scalable framework that unifies unsupervised object discovery, detection, tracking and reconstruction (DDTR) by using dense visual simultaneous localization and mapping (SLAM) approaches. Related applications for both indoor and outdoor environments are presented. </p><p> The dissertation starts by presenting the indoor scenario (Chapter 3), where DDTR simultaneously localizes a moving time-of-flight camera and discovers a set of shape and appearance models for multiple objects, including the scene background. The proposed framework represents object models with both a 2D and 3D level-set, which used to improve detection, 2D-tracking, 3D-registration and importantly subsequent updates to the level-set itself. An example of the proposed framework in simultaneous appearance-based DDTR using the time-of-flight camera and a robot manipulator is also presented (Chapter 4). </p><p> After presenting the indoor experiments, we extend DDTR to the outdoor environments. Chapter 5 presents a dense visual-inertial SLAM framework, in which inertial measurements are combined with dense stereovision for pose tracking. A rolling grid scheme is used for large-scale mapping. Chapter 6 proposes a scalable dense mapping pipeline that uses range data from various range sensors (e.g. the time-of-flight camera, stereo camera and multiple lasers) to generate a very high resolution, dense citywide map in real-time (700Hz on average). </p><p> Finally, Chapter 7 presents the application of DDTR in autonomous driving, including city-wide dense SLAM, truncated signed distance function based vehicle six degree of freedom localization and object discovery, and the simultaneous tracking and reconstruction of vehicles. The results demonstrate a scalable and unsupervised framework for object discovering, detection, tracking and reconstruction that can be used for both indoor and outdoor applications. </p>
|
105 |
Secure storage via information dispersal across network overlaysJohnston, Reece G. 09 November 2016 (has links)
<p> In this paper, we describe a secure distributed storage model to be used especially with untrusted devices, most notably cloud storage devices. The model does so through a peer-to-peer overlay and storage protocol designed to run on existing networked systems. We utilize a structured overlay that is organized in a layered, hierarchical manner based on the underlying network structure. These layers are used as storage sites for pieces of data near the layer at which that data is needed. This data is generated and distributed via a technique called an information dispersal algorithm (IDA) which utilizes an erasure code such as Cauchy Reed-Solomon (RS). Through the use of this IDA, the data pieces are organized across neighboring layers to maximize locality and prevent a compromise within one layer from compromising the data of that layer. Speci?cally, for a single datum to become compromised, a minimum of two layers would have to become compromised. As a result, security, survivability, and availability of the data is improved compared to other distributed storage systems. We present signi?cant background in this area followed by an analysis of similar distributed storage systems. Then, an overview of our proposed model is given along with an in-depth analysis, including both experimental results and theoretical analysis. The recorded overhead (encoding/decoding times and associated data sizes) shows that such a scheme can be utilized with little increase in overall latency. Making the proposed model an ideal choice for any distributed storage needs.</p>
|
106 |
Frequency based advertisement blocking on Android mobile devices using a local VPN serverOzsoy, Metehan 23 December 2016 (has links)
<p> Ads (advertisements) are the main economic source for most free web content. Modern ads are not only a large nuisance to end users but also an often violation of their privacy via tracking methods. There has been a rise in the use of ad blocking software. The major problem with these ad blocking software is that they rely on manually generated blacklists. In other words, humans need to detect ad URLs (Uniform Resource Locator) and add them to a blacklist so that it can be used by the ad blocking software. The purpose of this project is to design and implement an automated ad blocking software for Android mobile devices that does not rely on manually generated blacklists. The hypothesis to automate the generation of the blacklist is that URLs, which are not present in a given comprehensive whitelist and are visited more than a certain threshold number, are likely to be ad URLs. </p><p> In order to test the hypothesis, an Android mobile application is developed without requiring the root access on the device. The mobile application uses a local VPN (Virtual Private Network) server to capture the entire network traffic of the mobile device. The application is installed on a number of Android devices to collect the data to test the hypothesis. </p><p> The experiments illustrate that a false positive rate of 0.1% and a false negative rate of 0.26% can be achieved by an optimal frequency threshold number. It is concluded that the URLs that are not present in a given comprehensive whitelist and that are visited with higher frequencies are more likely to be ad URLs.</p>
|
107 |
Hybrid database| Dynamic selection of database infrastructure to improve query performanceWilliams, Michael 23 December 2016 (has links)
<p> Distributed file systems have enabled storage and parsing of arbitrarily large datasets with linearly scaling to hardware resources, however the latency created for minor queries of large datasets becomes untenable in a production environment. By utilizing data storage on both a distributed file system and a traditional relational database, this product will achieve low latency data service to users while maintaining complete archiving.</p><p> The software stack will be utilizing the Apache Hadoop Distributed File System for distributed storage. Apache Hive will be used for queries of the distributed file system. A MySQL database backend will be used for the traditional database service. A J2EE web application will serve as the user interface. </p><p> Decisions on which data service will provide the requested data with the lowest latency will be determined by evaluating the query.</p>
|
108 |
Mining discriminating patterns in data with confidenceKamra, Varun 28 December 2016 (has links)
<p> There are many pattern mining algorithms available for classifying data. The main drawback of most of the algorithms is that they always focus on mining frequent patterns in data that may not always be discriminative enough for classification. There could exist patterns that are not frequent, but are efficient discriminators. In such cases these algorithms might not perform well. This project proposes the MDP algorithm, which aims to search for patterns that are good at discriminating between classes rather than searching for frequent patterns. The MDP ensures that there is at least one most discriminative pattern (MDP) per record. The purpose of the project is to investigate how a structural approach to classification compares to a functional approach. The project has been developed in Java programming language.</p>
|
109 |
Characterizing Broadband Services in a Broader Context| Vantage Points, Measurements, and ExperimentationBischof, Zachary Scott 29 December 2016 (has links)
<p> Broadband networks are one of the most economically significant and fastest growing sectors of the Internet. Recent studies have shown that providing broadband Internet access is instrumental for social and economic development. Several governments, as well as the UN, have gone so far as to label broadband access a basic human right, similar to education and water. Motivated by the increased importance of broadband access, recent efforts are shedding light on the availability of broadband services. However, these works tend to focus on measuring service capacity. As a result, we still lack an understanding of how factors such as a link's capacity, quality, dependability, or cost affect user behavior and network demand.</p><p> We believe that realizing the full benefits of broadband access requires an understanding of how these services are being used by subscribers. The thesis of this work is that broadband service characterization must take a user-centric perspective, understanding how different aspects of the service impact its users' experiences, and thus should be done in a broader context. It should include an analysis of factors such as link quality, service dependability, and market factors (e.g. monthly income and cost of broadband access) and an understanding of how each affects user behavior.</p><p> To achieve this, we need to look beyond controlled experiments and regression analysis, two techniques commonly used in the field of networking. Controlled experiments, where subjects in the study are assigned randomly to "treated" and "untreated" groups for comparisons, are not feasible for studying the effect of complex treatments such as market and economic factors at scale. On the other hand, regression analysis is insufficient for causal inference. A key contribution of this work is the application of natural experiments and related experiment designs, techniques common in fields such as epidemiology and the social sciences, in the context of broadband services.</p><p> In this dissertation, we study broadband services in this broader context. We present the results of our empirical study on the relationship between service characteristics (capacity, latency, loss rates, and reliability), price, time and user demand. We find a strong correlation between capacity and user demand, but note a law of diminishing returns with lower increases in relative demand as service capacity increases. We also find that subscribing to unreliable broadband services tends to result in users generating less network traffic, even during periods of normal operation. These findings suggest that service dependability is becoming more important to subscribers as service capacities increase globally.</p><p> We include a characterization of broadband services in terms of bandwidth, latency, and loss. For bandwidth, we find that a number of providers struggle to consistently meet their advertised throughput rates and identify multiple instances where service throughput is correlated with the time of day. We also show that access latencies can vary widely by region, even within the same provider. In terms of service reliability, we find that broadband service providers in the US are able to provide at most two nines (99%) of availability. </p><p> Motivated by our findings on both the importance and current state of service reliability, we present an approach to improving service reliability using broadband multihoming and describe our prototype implementation. Our evaluation shows that in many cases, users could add up to two nines to service availability (from 99% to 99.99%) by multihoming with a neighbor's connection. Due to the fact that an individual subscriber may experience a wide range of performance, we then explore the possibility of adopting broadband service level agreements (SLAs). We argue that the use of broadband SLAs could help service providers to better differentiate their retail services from competitors and better inform both customers and policymakers of the broadband services offered in their communities. Using four years of data collected from residential gateways, we show that many ISPs could offer meaningful service level agreements immediately at little to no cost.</p>
|
110 |
A web application based on the MVC architecture using the Spring FrameworkPanchal, Hardikkumar B. 04 January 2017 (has links)
<p> In the field of software engineering, the Model-View-Controller (MVC) architecture was a great breakthrough that was introduced in the 1970s. It is an architectural pattern that is useful for structuring software applications. MVC divides an application into three components: 1) Model, 2) View, and 3) Controller. This separation provides great modularity, which results in ease of maintenance for systems employing this architecture. A software framework is a collection of modules and connectors that embody a given software architecture. In the spirit of software reuse, the Spring Framework has been integrated for implementing the MVC architecture. </p><p> In this document, a Web Application is presented that demonstrates the features and characteristics of the MVC architecture and Spring Framework. All the development technologies that are used in the Web-application will be discussed along with the Object-Oriented-Analysis and Design methodology. </p>
|
Page generated in 0.0657 seconds