• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 1
  • Tagged with
  • 11
  • 11
  • 8
  • 6
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Efficient External-Memory Graph Search for Model Checking

Lamborn, Peter C 17 May 2014 (has links)
Model checking problems suffer from state space explosion. State space explosion is the number of states in the graph increases exponentially with the number of variables in the state description. Searching the large graphs required in model checking requires an efficient algorithm. This dissertation explores several methods to improve an externalmemory search algorithm for model checking problems. A tool implementing these methods is built on top of the Murphi model checker. One improvement is a state cache for immediate detection leveraging the properties of state locality. A novel type of locality, intralayer locality is explained and shown to exist in a variety of search spaces. Another improvement, partial delayed duplicate detection, exploits interlayer locality to reduce search times. An automatic partitioning function is described that allows hash-based delayed duplicate detection to be used without domain knowledge of the state space. A phased delayed duplicate detection algorithm combining features of hash-based delayed duplicate detection and sorting-based delayed duplicate detection is explained and compared to the other methods.
2

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Khoshdel Nikkhoo, Hani 18 January 2011 (has links)
Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.
3

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Khoshdel Nikkhoo, Hani 18 January 2011 (has links)
Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.
4

Adaptive windows for duplicate detection

Draisbach, Uwe, Naumann, Felix, Szott, Sascha, Wonneberg, Oliver January 2012 (has links)
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons). / Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).
5

Automatic Identification of Duplicates in Literature in Multiple Languages

Klasson Svensson, Emil January 2018 (has links)
As the the amount of books available online the sizes of each these collections are at the same pace growing larger and more commonly in multiple languages. Many of these cor- pora contain duplicates in form of various editions or translations of books. The task of finding these duplicates is usually done manually but with the growing sizes making it time consuming and demanding. The thesis set out to find a method in the field of Text Mining and Natural Language Processing that can automatize the process of manually identifying these duplicates in a corpora mainly consisting of fiction in multiple languages provided by Storytel. The problem was approached using three different methods to compute distance measures between books. The first approach was comparing titles of the books using the Levenstein- distance. The second approach used extracting entities from each book using Named En- tity Recognition and represented them using tf-idf and cosine dissimilarity to compute distances. The third approach was using a Polylingual Topic Model to estimate the books distribution of topics and compare them using Jensen Shannon Distance. In order to es- timate the parameters of the Polylingual Topic Model 8000 books were translated from Swedish to English using Apache Joshua a statistical machine translation system. For each method every book written by an author was pairwise tested using a hypothesis test where the null hypothesis was that the two books compared is not an edition or translation of the others. Since there is no known distribution to assume as the null distribution for each book a null distribution was estimated using distance measures of books not written by the author. The methods were evaluated on two different sets of manually labeled data made by the author of the thesis. One randomly sampled using one-stage cluster sampling and one consisting of books from authors that the corpus provider prior to the thesis be considered more difficult to label using automated techniques. Of the three methods the Title Matching was the method that performed best in terms of accuracy and precision based of the sampled data. The entity matching approach was the method with the lowest accuracy and precision but with a almost constant recall at around 50 %. It was concluded that there seems to be a set of duplicates that are clearly distin- guished from the estimated null-distributions, with a higher significance level a better pre- cision and accuracy could have been made with a similar recall for the specific method. For topic matching the result was worse than the title matching and when studied the es- timated model was not able to create quality topics the cause of multiple factors. It was concluded that further research is needed for the topic matching approach. None of the three methods were deemed be complete solutions to automatize detection of book duplicates.
6

Near-Duplicate Detection Using Instance Level Constraints

Patel, Vishal 08 1900 (has links) (PDF)
For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.
7

Grouping Similar Bug Reports from Crash Dumps with Unsupervised Learning / Gruppering av liknande felrapporter med oövervakat lärande av kraschdumpar

Vestergren, Sara January 2021 (has links)
Quality software usually means high reliability, which in turn has two main components; the software should provide correctness, which means it should perform the specified task, and robustness in the sense that it should be able to manage unexpected situations. In other words, reliable systems are systems without bugs. Because of this, testing and debugging are recurrent and resource expensive tasks in software development, notably in large software systems. This thesis investigate the potential of using unsupervised machine learning on Ericsson bug reports to avoid unnecessary debugging by identifying duplicate bug reports. The bug report data that is considered are crash dumps from software crashes. The data is clustered using the clustering algorithms k-modes, k-prototypes and expectation maximization where-after the generated clusters are used to assign new incoming bug reports to the previously generated clusters, thus indicating whether an old bug report is similar to the newly submitted one. Due to the dataset only being partially labeled both internal and external validity indices are applied to evaluate the clustering. The results indicate that many, small clusters can be identified using the applied method. However, for the results to have high validity the methods could be applied on a larger data set. / Mjukvara av hög kvalitet innebär ofta hög tillförlitlighet, vilket i sin tur har två huvudkomponenter; mjukvaran bör vara korrekt, den ska alltså uppfylla dom specifierade kraven, och dessutom robust vilket innebär att den ska kunna hantera oväntade situationer. Med andra ord, tillförlitliga system är system utan buggar. På grund av detta är testning och felsökning återkommande och resurskrävande uppgifter inom mjukvaruutveckling, i synnerhet för stora mjukvarusystem. Detta arbete utforskar vilken potential oövervakad maskininlärning på Ericssons felrapporter har för att undvika onödig felsökning genom att identifiera felrapporter som är dubletter. Felrapporterna som används i detta arbete innehåller data som sparats i minnet vid en mjukvarukrasch. Data klustras sedan med klustringsalgoritmerna k-modes, k-prototypes och expectation maximization varpå dom genererade klustren används för att tilldela nya inkommande felrapporter till de tidigare generade klustren, för att på så sätt kunna identifiera om en gammal felrapport är lik en ny felrapport. Då de felrapporter som behandlas endast till viss del redan är märkta som dubletter används både externa och interna valideringsmått för att utvärdera klustringen. Resultaten tyder på att många, små kluster kunde identifieras med de använda metoderna. Dock skulle metoderna behöva appliceras på ett dataset med större antal felrapporter för att resultaten ska få hög validitet.
8

Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak

Öhrström, Fredrik January 2018 (has links)
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
9

Free-text Informed Duplicate Detection of COVID-19 Vaccine Adverse Event Reports

Turesson, Erik January 2022 (has links)
To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically.  Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings.  We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders.
10

Duplicate detection of multimodal and domain-specific trouble reports when having few samples : An evaluation of models using natural language processing, machine learning, and Siamese networks pre-trained on automatically labeled data / Dublettdetektering av multimodala och domänspecifika buggrapporter med få träningsexempel : En utvärdering av modeller med naturlig språkbehandling, maskininlärning, och siamesiska nätverk förtränade på automatiskt märkt data

Karlstrand, Viktor January 2022 (has links)
Trouble and bug reports are essential in software maintenance and for identifying faults—a challenging and time-consuming task. In cases when the fault and reports are similar or identical to previous and already resolved ones, the effort can be reduced significantly making the prospect of automatically detecting duplicates very compelling. In this work, common methods and techniques in the literature are evaluated and compared on domain-specific and multimodal trouble reports from Ericsson software. The number of samples is few, which is a case not so well-studied in the area. On this basis, both traditional and more recent techniques based on deep learning are considered with the goal of accurately detecting duplicates. Firstly, the more traditional approach based on natural language processing and machine learning is evaluated using different vectorization techniques and similarity measures adapted and customized to the domain-specific trouble reports. The multimodality and many fields of the trouble reports call for a wide range of techniques, including term frequency-inverse document frequency, BM25, and latent semantic analysis. A pipeline processing each data field of the trouble reports independently and automatically weighing the importance of each data field is proposed. The best performing model achieves a recall rate of 89% for a duplicate candidate list size of 10. Further, obtaining knowledge on which types of data are most important for duplicate detection is explored through what is known as Shapley values. Results indicate that utilizing all types of data indeed improve performance, and that date and code parameters are strong indicators. Secondly, a Siamese network based on Transformer-encoders is evaluated on data fields believed to have some underlying representation of the semantic meaning or sequentially important information, which a deep model can capture. To alleviate the issues when having few samples, pre-training through automatic data labeling is studied. Results show an increase in performance compared to not pre-training the Siamese network. However, compared to the more traditional model it performs on par, indicating that traditional models may perform equally well when having few samples besides also being simpler, more robust, and faster. / Buggrapporter är kritiska för underhåll av mjukvara och för att identifiera fel — en utmanande och tidskrävande uppgift. I de fall då felet och rapporterna liknar eller är identiska med tidigare och redan lösta ärenden, kan tiden som krävs minskas avsevärt, vilket gör automatiskt detektering av dubbletter mycket önskvärd. I detta arbete utvärderas och jämförs vanliga metoder och tekniker i litteraturen på domänspecifika och multimodala buggrapporter från Ericssons mjukvara. Antalet tillgängliga träningsexempel är få, vilket inte är ett så välstuderat fall. Utifrån detta utvärderas både traditionella samt nyare tekniker baserade på djupinlärning med målet att detektera dubbletter så bra som möjligt. Först utvärderas det mer traditionella tillvägagångssättet baserat på naturlig språkbearbetning och maskininlärning med hjälp av olika vektoriseringstekniker och likhetsmått specialanpassade till buggrapporterna. Multimodaliteten och de många datafälten i buggrapporterna kräver en rad av tekniker, så som termfrekvens-invers dokumentfrekvens, BM25 och latent semantisk analys. I detta arbete föreslås en modell som behandlar varje datafält i buggrapporterna separat och automatiskt sammanväger varje datafälts betydelse. Den bäst presterande modellen uppnår en återkallningsfrekvens på 89% för en lista med 10 dubblettkandidater. Vidare undersöks vilka datafält som är mest viktiga för dubblettdetektering genom Shapley-värden. Resultaten tyder på att utnyttja alla tillgängliga datafält förbättrar prestandan, och att datum och kodparametrar är starka indikatorer. Sedan utvärderas ett siamesiskt nätverk baserat på Transformator-kodare på datafält som tros ha en underliggande representation av semantisk betydelse eller sekventiellt viktig information, vilket en djup modell kan utnyttja. För att lindra de problem som uppstår med få träningssexempel, studeras det hur den djupa modellen kan förtränas genom automatisk datamärkning. Resultaten visar på en ökning i prestanda jämfört med att inte förträna det siamesiska nätverket. Men jämfört med den mer traditionella modellen presterar den likvärdigt, vilket indikerar att mer traditionella modeller kan prestera lika bra när antalet träningsexempel är få, förutom att också vara enklare, mer robusta, och snabbare.

Page generated in 0.1015 seconds