Global ETD Search

1	Performance of IR Models on Duplicate Bug Report Detection: A Comparative Study Kaushik, Nilam 23 December 2011 (has links) Open source projects incorporate bug triagers to help with the task of bug report assignment to developers. One of the tasks of a triager is to identify whether an incoming bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports, a triager either relies on his memory and experience or on the search capabilties of the bug repository. Both these approaches can be time consuming for the triager and may also lead to the misidentication of duplicates. It has also been suggested that duplicate bug reports are not necessarily harmful, instead they can complement each other to provide additional information for developers to investigate the defect at hand. This motivates the need for automated or semi-automated techniques for duplicate bug detection. In the literature, two main approaches have been proposed to solve this problem. The first approach is to prevent duplicate reports from reaching developers by automatically filtering them while the second approach deals with providing the triager a list of top-N similar bug reports, allowing the triager to compare the incoming bug report with the ones provided in the list. Previous works have tried to enhance the quality of the suggested lists, but the approaches either suffered a poor Recall Rate or they incurred additional runtime overhead, making the deployment of a retrieval system impractical. To the extent of our knowledge, there has been little work done to do an exhaustive comparison of the performance of different Information Retrieval Models (especially using more recent techniques such as topic modeling) on this problem and understanding the effectiveness of different heuristics across various application domains. In this thesis, we compare the performance of word based models (derivatives of the Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing (RI). We leverage heuristics that incorporate exception stack frames, surface features, summary and long description from the free-form text in the bug report. We perform experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of 60% and 58% respectively. We find that word based models, in particular a Log-Entropy based weighting scheme, outperform topic based ones such as LSI and LDA. Using historical bug data from Eclipse and NetBeans, we determine the optimal time frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate Detection Framework that uses a sliding window of a constant time frame as a first step towards simulating incoming bug reports and recommending duplicates to the end user. duplicate bug Electrical and Computer Engineering
2	Performance of IR Models on Duplicate Bug Report Detection: A Comparative Study Kaushik, Nilam 23 December 2011 (has links) Open source projects incorporate bug triagers to help with the task of bug report assignment to developers. One of the tasks of a triager is to identify whether an incoming bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports, a triager either relies on his memory and experience or on the search capabilties of the bug repository. Both these approaches can be time consuming for the triager and may also lead to the misidentication of duplicates. It has also been suggested that duplicate bug reports are not necessarily harmful, instead they can complement each other to provide additional information for developers to investigate the defect at hand. This motivates the need for automated or semi-automated techniques for duplicate bug detection. In the literature, two main approaches have been proposed to solve this problem. The first approach is to prevent duplicate reports from reaching developers by automatically filtering them while the second approach deals with providing the triager a list of top-N similar bug reports, allowing the triager to compare the incoming bug report with the ones provided in the list. Previous works have tried to enhance the quality of the suggested lists, but the approaches either suffered a poor Recall Rate or they incurred additional runtime overhead, making the deployment of a retrieval system impractical. To the extent of our knowledge, there has been little work done to do an exhaustive comparison of the performance of different Information Retrieval Models (especially using more recent techniques such as topic modeling) on this problem and understanding the effectiveness of different heuristics across various application domains. In this thesis, we compare the performance of word based models (derivatives of the Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing (RI). We leverage heuristics that incorporate exception stack frames, surface features, summary and long description from the free-form text in the bug report. We perform experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of 60% and 58% respectively. We find that word based models, in particular a Log-Entropy based weighting scheme, outperform topic based ones such as LSI and LDA. Using historical bug data from Eclipse and NetBeans, we determine the optimal time frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate Detection Framework that uses a sliding window of a constant time frame as a first step towards simulating incoming bug reports and recommending duplicates to the end user. duplicate bug Electrical and Computer Engineering
3	Efficient External-Memory Graph Search for Model Checking Lamborn, Peter C 17 May 2014 (has links) Model checking problems suffer from state space explosion. State space explosion is the number of states in the graph increases exponentially with the number of variables in the state description. Searching the large graphs required in model checking requires an efficient algorithm. This dissertation explores several methods to improve an externalmemory search algorithm for model checking problems. A tool implementing these methods is built on top of the Murphi model checker. One improvement is a state cache for immediate detection leveraging the properties of state locality. A novel type of locality, intralayer locality is explained and shown to exist in a variety of search spaces. Another improvement, partial delayed duplicate detection, exploits interlayer locality to reduce search times. An automatic partitioning function is described that allows hash-based delayed duplicate detection to be used without domain knowledge of the state space. A phased delayed duplicate detection algorithm combining features of hash-based delayed duplicate detection and sorting-based delayed duplicate detection is explained and compared to the other methods. Model Checking External Memory Search Delayed Duplicate Detection Interlayer Locality Hash-based Delayed Duplicate Detection State Cache Intralayer Locality Immediate Duplicate Detection Phased Delayed Duplicate Detection Partial Delayed Duplicate Detection Automatic Hash Function
4	The Impact of Near-Duplicate Documents on Information Retrieval Evaluation Khoshdel Nikkhoo, Hani 18 January 2011 (has links) Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track. near-duplicate detection MapReduce shingles Computer Science
5	The Impact of Near-Duplicate Documents on Information Retrieval Evaluation Khoshdel Nikkhoo, Hani 18 January 2011 (has links) Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track. near-duplicate detection MapReduce shingles Computer Science
6	Near-Duplicate Detection Using Instance Level Constraints Patel, Vishal 08 1900 (has links) (PDF) For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms. Latent Dirichlet Allocation Information Retrieval Near-Duplicate Detection Constrained Clustering Group LDA Duplicate Bug Report Detection Near-Duplicate Document Detection Computer Science
7	Adaptive division of feature space for rapid detection of near-duplicate video segments Ide, Ichiro, Suzuki, Shugo, Takahashi, Tomokazu, Murase, Hiroshi 28 June 2009 (has links) No description available. Near-duplicate video detection feature space division video structuring
8	Computational analyses of small silencing RNAs Fu, Yu 11 December 2018 (has links) High-throughput sequencing is a powerful tool to study diverse aspects of biology and applies to genome, transcriptome, and small RNA profiling. Ever increasing sequencing throughput and more specialized sequencing assays demand more sophisticated bioinformatics approaches. In this thesis, I present 4 studies for which I developed computational methods to handle high-throughput sequencing data to gain insights into biology. The first study describes the genome of High Five (Hi5) cells, originally derived from Trichoplusia ni eggs. The chromosome-level assembly (scaffold N50 = 14.2 Mb) contains 14,037 predicted protein-coding genes. Examination and curation of multiple gene families, pathways, and small RNA-producing loci reveal species- and order-specific features. The availability of the genome sequence, together with genome editing and single-cell cloning protocols, enables Hi5 cells as a new tool for studying small RNAs. The second study focuses on just one type of piRNAs that are produced at the pachytene stage of mammalian spermatogenesis. Despite their abundance, pachytene piRNAs are poorly understood. I find that pachytene piRNAs cleave transcripts of protein-coding genes and further target transcripts from other pachytene piRNA loci. Subsequently, systematic investigation of piRNA targeting by integrating different types of sequencing data uncovers the piRNA targeting rule. The third study describes computational procedures to map splicing branchpoints using high-throughput sequencing data. Screening >1.2 trillion RNA-seq reads determines >140,000 BPs for both human and mouse. Such branchpoints are compiled into BPDB (BranchPoint DataBase) to provide a comprehensive branchpoint catalog. The final study combines novel experimental and computational procedures to handle PCR duplicates that are prevalent in high-throughput sequencing data. Incorporation of unique molecular identifiers (UMIs) to tag each read enables unambiguous identification of PCR duplicates. Both simulated and experimental datasets demonstrate that UMI incorporation increases the reproducibility of RNA-seq and small RNA-seq. Surveying 7 common variables in high-throughput sequencing reveals that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine the PCR duplicate frequency. Finally, I show that removing PCR duplicates without UMIs leads to substantial bias into data analysis. / 2020-12-11T00:00:00Z Bioinformatics PCR duplicate Trichoplusia ni Genome High-throughput sequencing Transcriptome
9	Classification of Near-Duplicate Video Segments Based on their Appearance Patterns Murase, Hiroshi, Takahashi, Tomokazu, Deguchi, Daisuke, Shamoto, Yuji, Ide, Ichiro January 2010 (has links) No description available. near-duplicate video segments broadcast video archive Video classification
10	Using Anycast to Improve Fast Handover Performance Chu, Kuang-ning 09 September 2006 (has links) There are two critical issues involved as a mobile node moving across two different network sub-domains. One of them is to minimize the possible packet loss and the other is to shorten the handover time. Fast handover is a remedy to these problems. It minimizes the packet loss by making use of buffers, and speed up the handover procedure by L2 triggering. There are two components contributing to the handover delay, namely L2 handover delay and L3 handover delay. The L3 handover delay consists of movement detection delay, duplicate address detection delay, as well as registration delay. With fast handover, the movement detection delay can be lowered by using L2 trigger, and the registration delay can be decreased by buffering and tunneling. However, the problem of out-of-order packets is still in its existence. A novel handover scheme incorporating the anycast technology is developed and presented in this thesis. With refined buffer control scheme and the switching between unicast and anycast addressing, the handover performance can be greatly improved by the proposed approach. Mobile IP Fast handover Anycast Duplicate Address Detection

Search results