Global ETD Search

91	Scaling Analytics via Approximate and Distributed Computing Chakrabarti, Aniket 12 December 2017 (has links) No description available. Computer Science Approximate Computing Distributed Computing Locality Sensitive Hashing Kernel Learning Markov Random Field Pareto Frontier Analytics Frameworks
92	Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data Carraher, Lee A. 22 May 2018 (has links) No description available. Computer Engineering data clustering distributed data mining streaming data algorithms locality sensitive hashing count-min cut tree random projection
93	A Parallel Algorithm for Query Adaptive, Locality Sensitive Hash Search Carraher, Lee A. 17 September 2012 (has links) No description available. Computer Science Locality Sensitive Hashing Approximate Nearest Neighbors CUDA GPU Image Search Distance Adaptive LSH Parallel Computing
94	REGION-BASED GEOMETRIC ACTIVE CONTOUR FOR CLASSIFICATION USING HYPERSPECTRAL REMOTE SENSING IMAGES Yan, Lin 20 October 2011 (has links) No description available. Remote Sensing spectral-spatail classification region-based active contour hyperspectral dimensionality reduction locality-sensitive hashing Laplacian eigenmaps
95	Tackling the current limitations of bacterial taxonomy with genome-based classification and identification on a crowdsourcing Web service Tian, Long 25 October 2019 (has links) Bacterial taxonomy is the science of classifying, naming, and identifying bacteria. The scope and practice of taxonomy has evolved through history with our understanding of life and our growing and changing needs in research, medicine, and industry. As in animal and plant taxonomy, the species is the fundamental unit of taxonomy, but the genetic and phenotypic diversity that exists within a single bacterial species is substantially higher compared to animal or plant species. Therefore, the current "type"-centered classification scheme that describes a species based on a single type strain is not sufficient to classify bacterial diversity, in particular in regard to human, animal, and plant pathogens, for which it is necessary to trace disease outbreaks back to their source. Here we discuss the current needs and limitations of classic bacterial taxonomy and introduce LINbase, a Web service that not only implements current species-based bacterial taxonomy but complements its limitations by providing a new framework for genome sequence-based classification and identification independently of the type-centric species. LINbase uses a sequence similarity-based framework to cluster bacteria into hierarchical taxa, which we call LINgroups, at multiple levels of relatedness and crowdsources users' expertise by encouraging them to circumscribe these groups as taxa from the genus-level to the intraspecies-level. Circumscribing a group of bacteria as a LINgroup, adding a phenotypic description, and giving the LINgroup a name using the LINbase Web interface allows users to instantly share new taxa and complements the lengthy and laborious process of publishing a named species. Furthermore, unknown isolates can be identified immediately as members of a newly described LINgroup with fast and precise algorithms based on their genome sequences, allowing species- and intraspecies-level identification. The employed algorithms are based on a combination of the alignment-based algorithm BLASTN and the alignment-free method Sourmash, which is based on k-mers, and the MinHash algorithm. The potential of LINbase is shown by using examples of plant pathogenic bacteria. / Doctor of Philosophy / Life is always easier when people talk to each other in the same language. Taxonomy is the language that biologists use to communicate about life by 1. classifying organisms into groups, 2. giving names to these groups, and 3. identifying individuals as members of these named groups. When most scientists and the general public think of taxonomy, they think of the hierarchical structure of “Life”, “Domain”, “Kingdom”, “Phylum”, “Class”, “Order”, “Family”, “Genus” and “Species”. However, the basic goal of taxonomy is to allow the identification of an organism as a member of a group that is predictive of its characteristics and to provide a name to communicate about that group with other scientists and the public. In the world of micro-organism, taxonomy is extremely important since there are an estimated 10,000,000 to 1,000,000,000 different bacteria species. Moreover, microbiologists and pathologists need to consider differences among bacterial isolates even within the same species, a level, that the current taxonomic system does not even cover. Therefore, we developed a Web service, LINbase, which uses genome sequences to classify individual microbial isolates. The database at the backend of LINbase assigns Life Identification Numbers (LINs) that express how individual microbial isolates are related to each other above, at, and below the species level. The LINbase Web service is designed to be an interactive web-based encyclopedia of microorganisms where users can share everything they know about micro-organisms, be it individual isolates or groups of isolates, for professional and scientific purposes. To develop LINbase, efficient computer programs were developed and implemented. To show how LINbase can be used, several groups of bacteria that cause plant diseases were classified and described. Bacterial taxonomy average nucleotide identity ANI min-wise independent permutations locality sensitive hashing MinHash Web service crowdsourcing
96	Bases de Datos NoSQL: escalabilidad y alta disponibilidad a través de patrones de diseño Antiñanco, Matías Javier 09 June 2014 (has links) Este trabajo presenta un catálogo de técnicas y patrones de diseño aplicados actualmente en bases de datos NoSQL. El enfoque propuesto consiste en una presentación del estado del arte de las bases de datos NoSQL, una exposición de los conceptos claves relacionados y una posterior exhibición de un conjunto de técnicas y patrones de diseño orientados a la escalabilidad y alta disponibilidad. Para tal fin, • Se describen brevemente las características principales de los bases de datos NoSQL, cuales son los factores que motivaron su aparición, sus diferencias con sus pares relacionales, se presenta el teorema CAP y se contrasta las propiedades ACID contra las BASE. • Se introducen las problemáticas que motivan las técnicas y patrones de diseño a describir. • Se presentan técnicas y patrones de diseños que solucionen las problemáticas. • Finalmente, se concluye con un análisis integrador, y se indican otros temas de investigación pertinentes. ACID, BASE, NoSQL, CAP HintedHandoff SloopyQuorum árboles Merkle gossip vector clock consistent hashing bloqueo optimista sharding Memcache escalabilidad consistencia eventual alta disponibilidad Software Ciencias Informáticas
97	Space-efficient data sketching algorithms for network applications Hua, Nan 06 July 2012 (has links) Sketching techniques are widely adopted in network applications. Sketching algorithms “encode” data into succinct data structures that can later be accessed and “decoded” for various purposes, such as network measurement, accounting, anomaly detection and etc. Bloom filters and counter braids are two well-known representatives in this category. Those sketching algorithms usually need to strike a tradeoff between performance (how much information can be revealed and how fast) and cost (storage, transmission and computation). This dissertation is dedicated to the research and development of several sketching techniques including improved forms of stateful Bloom Filters, Statistical Counter Arrays and Error Estimating Codes. Bloom filter is a space-efficient randomized data structure for approximately representing a set in order to support membership queries. Bloom filter and its variants have found widespread use in many networking applications, where it is important to minimize the cost of storing and communicating network data. In this thesis, we propose a family of Bloom Filter variants augmented by rank-indexing method. We will show such augmentation can bring a significant reduction of space and also the number of memory accesses, especially when deletions of set elements from the Bloom Filter need to be supported. Exact active counter array is another important building block in many sketching algorithms, where storage cost of the array is of paramount concern. Previous approaches reduce the storage costs while either losing accuracy or supporting only passive measurements. In this thesis, we propose an exact statistics counter array architecture that can support active measurements (real-time read and write). It also leverages the aforementioned rank-indexing method and exploits statistical multiplexing to minimize the storage costs of the counter array. Error estimating coding (EEC) has recently been established as an important tool to estimate bit error rates in the transmission of packets over wireless links. In essence, the EEC problem is also a sketching problem, since the EEC codes can be viewed as a sketch of the packet sent, which is decoded by the receiver to estimate bit error rate. In this thesis, we will first investigate the asymptotic bound of error estimating coding by viewing the problem from two-party computation perspective and then investigate its coding/decoding efficiency using Fisher information analysis. Further, we develop several sketching techniques including Enhanced tug-of-war(EToW) sketch and the generalized EEC (gEEC)sketch family which can achieve around 70% reduction of sketch size with similar estimation accuracies. For all solutions proposed above, we will use theoretical tools such as information theory and communication complexity to investigate how far our proposed solutions are away from the theoretical optimal. We will show that the proposed techniques are asymptotically or empirically very close to the theoretical bounds. Sketching algorithms Bloom filter Statistics counter Error estimating codes Rank-indexing Anomaly detection (Computer security) Hashing (Computer science) Algorithms
98	Towards Efficient Delivery of Dynamic Web Content Ramaswamy, Lakshmish Macheeri 26 August 2005 (has links) Advantages of cache cooperation on edge cache networks serving dynamic web content were studied. Design of cooperative edge cache grid a large-scale cooperative edge cache network for delivering highly dynamic web content with varying server update frequencies was presented. A cache clouds-based architecture was proposed to promote low-cost cache cooperation in cooperative edge cache grid. An Internet landmarks-based scheme, called selective landmarks-based server-distance sensitive clustering scheme, for grouping edge caches into cooperative clouds was presented. Dynamic hashing technique for efficient, load-balanced, and reliable documents lookups and updates was presented. Utility-based scheme for cooperative document placement in cache clouds was proposed. The proposed architecture and techniques were evaluated through trace-based simulations using both real-world and synthetic traces. Results showed that the proposed techniques provide significant performance benefits. A framework for automatically detecting cache-effective fragments in dynamic web pages was presented. Two types of fragments in web pages, namely, shared fragments and lifetime-personalization fragments were identified and formally defined. A hierarchical fragment-aware web page model called the augmented-fragment tree model was proposed. An efficient algorithm to detect maximal fragments that are shared among multiple documents was proposed. A practical algorithm for detecting fragments based on their lifetime and personalization characteristics was designed. The proposed framework and algorithms were evaluated through experiments on real web sites. The effect of adopting the detected fragments on web-caches and origin-servers is experimentally studied. Automatic fragment detection Fragment-based caching Cooperative edge caching Edge computing Dynamic content caching Dynamic web content World Wide Web Cache memory Hashing (Computer science)
99	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code Jang, Jiyong 01 August 2013 (has links) Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper. Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering
100	Development of a technique to identify advertisements in a video signal / Ruan Moolman Moolman, Ruan January 2012 (has links) In recent years Content Based Information Retrieval (CBIR) has received a lot of research attention, starting with audio, followed by images and video. Video ngerprinting is a CBIR technique that creates a digital descriptor, also known as a ngerprint, for videos based on its content. These ngerprints are then saved to a database and used to detect unknown videos by comparing the unknown video's ngerprint to the ngerprints in the database to get a match. Many techniques have already been proposed with various levels of success, but most of the existing techniques focus mainly on robustness and neglect the speed of implementation. In this dissertation a novel video ngerprinting technique will be developed with the main focus on detecting advertisements in a television broadcast. Therefore the system must be able to process the incoming video stream in real-time and detect all the advertisements that are present. Even though the algorithm has to be fast, it still has to be robust enough to handle a moderate amount of distortions. These days video ngerprinting still holds many challenges as it involves characterizing videos, made up of sequences of images, e ectively. This means the algorithm must somehow imitate the inherent ability of humans to recognize a video almost instantly. The technique uses the content of the video to derive a ngerprint, thus the features used by the ngerprinting algorithm should be robust to distortions that don't a ect content according to humans. / Thesis (MIng (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013 Video ngerprinting Advertisement tracking Broadcast monitoring Copy detection Automatic video recognition Perceptual frame hashing Content-based video identifi cation Robust matching

Search results