Spelling suggestions: "subject:"duplication""
1 |
Performance of IR Models on Duplicate Bug Report Detection: A Comparative StudyKaushik, Nilam 23 December 2011 (has links)
Open source projects incorporate bug triagers to help with the task of bug report
assignment to developers. One of the tasks of a triager is to identify whether an incoming
bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports,
a triager either relies on his memory and experience or on the search capabilties of the bug
repository. Both these approaches can be time consuming for the triager and may also
lead to the misidentication of duplicates. It has also been suggested that duplicate bug
reports are not necessarily harmful, instead they can complement each other to provide
additional information for developers to investigate the defect at hand. This motivates the
need for automated or semi-automated techniques for duplicate bug detection.
In the literature, two main approaches have been proposed to solve this problem. The
first approach is to prevent duplicate reports from reaching developers by automatically
filtering them while the second approach deals with providing the triager a list of top-N
similar bug reports, allowing the triager to compare the incoming bug report with the ones
provided in the list. Previous works have tried to enhance the quality of the suggested
lists, but the approaches either suffered a poor Recall Rate or they incurred additional
runtime overhead, making the deployment of a retrieval system impractical. To the extent
of our knowledge, there has been little work done to do an exhaustive comparison of
the performance of different Information Retrieval Models (especially using more recent
techniques such as topic modeling) on this problem and understanding the effectiveness of
different heuristics across various application domains.
In this thesis, we compare the performance of word based models (derivatives of the
Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as
Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing
(RI). We leverage heuristics that incorporate exception stack frames, surface features,
summary and long description from the free-form text in the bug report. We perform
experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of
60% and 58% respectively. We find that word based models, in particular a Log-Entropy
based weighting scheme, outperform topic based ones such as LSI and LDA.
Using historical bug data from Eclipse and NetBeans, we determine the optimal time
frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate
Detection Framework that uses a sliding window of a constant time frame as a first step
towards simulating incoming bug reports and recommending duplicates to the end user.
|
2 |
Performance of IR Models on Duplicate Bug Report Detection: A Comparative StudyKaushik, Nilam 23 December 2011 (has links)
Open source projects incorporate bug triagers to help with the task of bug report
assignment to developers. One of the tasks of a triager is to identify whether an incoming
bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports,
a triager either relies on his memory and experience or on the search capabilties of the bug
repository. Both these approaches can be time consuming for the triager and may also
lead to the misidentication of duplicates. It has also been suggested that duplicate bug
reports are not necessarily harmful, instead they can complement each other to provide
additional information for developers to investigate the defect at hand. This motivates the
need for automated or semi-automated techniques for duplicate bug detection.
In the literature, two main approaches have been proposed to solve this problem. The
first approach is to prevent duplicate reports from reaching developers by automatically
filtering them while the second approach deals with providing the triager a list of top-N
similar bug reports, allowing the triager to compare the incoming bug report with the ones
provided in the list. Previous works have tried to enhance the quality of the suggested
lists, but the approaches either suffered a poor Recall Rate or they incurred additional
runtime overhead, making the deployment of a retrieval system impractical. To the extent
of our knowledge, there has been little work done to do an exhaustive comparison of
the performance of different Information Retrieval Models (especially using more recent
techniques such as topic modeling) on this problem and understanding the effectiveness of
different heuristics across various application domains.
In this thesis, we compare the performance of word based models (derivatives of the
Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as
Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing
(RI). We leverage heuristics that incorporate exception stack frames, surface features,
summary and long description from the free-form text in the bug report. We perform
experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of
60% and 58% respectively. We find that word based models, in particular a Log-Entropy
based weighting scheme, outperform topic based ones such as LSI and LDA.
Using historical bug data from Eclipse and NetBeans, we determine the optimal time
frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate
Detection Framework that uses a sliding window of a constant time frame as a first step
towards simulating incoming bug reports and recommending duplicates to the end user.
|
3 |
Efficient External-Memory Graph Search for Model CheckingLamborn, Peter C 17 May 2014 (has links)
Model checking problems suffer from state space explosion. State space explosion is the number of states in the graph increases exponentially with the number of variables in the state description. Searching the large graphs required in model checking requires an efficient algorithm. This dissertation explores several methods to improve an externalmemory search algorithm for model checking problems. A tool implementing these methods is built on top of the Murphi model checker. One improvement is a state cache for immediate detection leveraging the properties of state locality. A novel type of locality, intralayer locality is explained and shown to exist in a variety of search spaces. Another improvement, partial delayed duplicate detection, exploits interlayer locality to reduce search times. An automatic partitioning function is described that allows hash-based delayed duplicate detection to be used without domain knowledge of the state space. A phased delayed duplicate detection algorithm combining features of hash-based delayed duplicate detection and sorting-based delayed duplicate detection is explained and compared to the other methods.
|
4 |
The Impact of Near-Duplicate Documents on Information Retrieval EvaluationKhoshdel Nikkhoo, Hani 18 January 2011 (has links)
Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.
|
5 |
The Impact of Near-Duplicate Documents on Information Retrieval EvaluationKhoshdel Nikkhoo, Hani 18 January 2011 (has links)
Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.
|
6 |
Near-Duplicate Detection Using Instance Level ConstraintsPatel, Vishal 08 1900 (has links) (PDF)
For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.
|
7 |
Adaptive division of feature space for rapid detection of near-duplicate video segmentsIde, Ichiro, Suzuki, Shugo, Takahashi, Tomokazu, Murase, Hiroshi 28 June 2009 (has links)
No description available.
|
8 |
Computational analyses of small silencing RNAsFu, Yu 11 December 2018 (has links)
High-throughput sequencing is a powerful tool to study diverse aspects of biology and applies to genome, transcriptome, and small RNA profiling. Ever increasing sequencing throughput and more specialized sequencing assays demand more sophisticated bioinformatics approaches. In this thesis, I present 4 studies for which I developed computational methods to handle high-throughput sequencing data to gain insights into biology.
The first study describes the genome of High Five (Hi5) cells, originally derived from Trichoplusia ni eggs. The chromosome-level assembly (scaffold N50 = 14.2 Mb) contains 14,037 predicted protein-coding genes. Examination and curation of multiple gene families, pathways, and small RNA-producing loci reveal species- and order-specific features. The availability of the genome sequence, together with genome editing and single-cell cloning protocols, enables Hi5 cells as a new tool for studying small RNAs.
The second study focuses on just one type of piRNAs that are produced at the pachytene stage of mammalian spermatogenesis. Despite their abundance, pachytene piRNAs are poorly understood. I find that pachytene piRNAs cleave transcripts of protein-coding genes and further target transcripts from other pachytene piRNA loci. Subsequently, systematic investigation of piRNA targeting by integrating different types of sequencing data uncovers the piRNA targeting rule.
The third study describes computational procedures to map splicing branchpoints using high-throughput sequencing data. Screening >1.2 trillion RNA-seq reads determines >140,000 BPs for both human and mouse. Such branchpoints are compiled into BPDB (BranchPoint DataBase) to provide a comprehensive branchpoint catalog.
The final study combines novel experimental and computational procedures to handle PCR duplicates that are prevalent in high-throughput sequencing data. Incorporation of unique molecular identifiers (UMIs) to tag each read enables unambiguous identification of PCR duplicates. Both simulated and experimental datasets demonstrate that UMI incorporation increases the reproducibility of RNA-seq and small RNA-seq. Surveying 7 common variables in high-throughput sequencing reveals that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine the PCR duplicate frequency. Finally, I show that removing PCR duplicates without UMIs leads to substantial bias into data analysis. / 2020-12-11T00:00:00Z
|
9 |
Classification of Near-Duplicate Video Segments Based on their Appearance PatternsMurase, Hiroshi, Takahashi, Tomokazu, Deguchi, Daisuke, Shamoto, Yuji, Ide, Ichiro January 2010 (has links)
No description available.
|
10 |
Using Anycast to Improve Fast Handover PerformanceChu, Kuang-ning 09 September 2006 (has links)
There are two critical issues involved as a mobile node moving across two different network sub-domains. One of them is to minimize the possible packet loss and the other is to shorten the handover time. Fast handover is a remedy to these problems. It minimizes the packet loss by making use of buffers, and speed up the handover procedure by L2 triggering. There are two components contributing to the handover delay, namely L2 handover delay and L3 handover delay. The L3 handover delay consists of movement detection delay, duplicate address detection delay, as well as registration delay. With fast handover, the movement detection delay can be lowered by using L2 trigger, and the registration delay can be decreased by buffering and tunneling. However, the problem of out-of-order packets is still in its existence. A novel handover scheme incorporating the anycast technology is developed and presented in this thesis. With refined buffer control scheme and the switching between unicast and anycast addressing, the handover performance can be greatly improved by the proposed approach.
|
Page generated in 0.0355 seconds