Global ETD Search

1	Landskapsförändringar och deras påverkan på dagens biodiversitetsmönster hos kärlväxter i ett skånskt jordbrukslandskap Biederstädt, Jana January 2014 (has links) I detta arbete studerades landskapsförändringar, i synnerhet olika typer av gräsmarkshabitat i fråga om artinnehåll och rikedom samt hur dagens diversitetsmönster har påverkats av den historiska markanvändningen. Undersökningsområdet har en area av ca 143 ha och är beläget i nordöstra Skåne. Resultatet visade att områdets areal av naturliga gräsmarker var mest omfattande kring 1783/98 och minst 1926-23. De naturliga gräsmarkernas omfattning har nästan halverats från slutet av 1700-talet till idag. Andelen betesmark som har hävdats från 1700-talets slut till 2014 utgör dock inte mer än 11 % av betesmarkernas totala areal. Väg- och dikesren uppvisade störst andel arter som var gemensamma med naturbetesmarkernas; minst var andelen i kultiverad gräsmark. Artrikedomen var minst lika hög i vägren, dikesren och skogsbryn som i naturbetesmarken medan den var lägre i kuliverad gräsmark. Artrikedomen var högst på vägrenen. Skåne biodiversitetsmönster naturbetesmark Jaccard index historisk markanvändning landskapsvetenskap
2	Evaluation of intra-set clustering techniques for redundant social media content Jubinville, Jason 19 December 2018 (has links) This thesis evaluates various techniques for intra-set clustering of social media data from an industry perspective. The research goal was to establish methods for reducing the amount of redundant information an end user must review from a standard social media search. The research evaluated both clustering algorithms and string similarity measures for their effectiveness in clustering a selection of real-world topic and location-based social media searches. In addition, the algorithms and similarity measures were tested in scenarios based on industry constraints such as rate limits. The results were evaluated using several practical measures to determine which techniques were effective. / Graduate social media twitter clustering T Information Jaccard Hamming T Codes
3	Acceleration of Jaccard's Index Algorithm for Training to Tag Damage on Post-earthquake Images Mulligan, Kyle John 01 June 2018 (has links) (PDF) There are currently different efforts to use Supervised Neural Networks (NN) to automatically label damages on images of above ground infrastructure (buildings made of concrete) taken after an earthquake. The goal of the supervised NN is to classify raw input data according to the patterns learned from an input training set. This input training data set is usually supplied by experts in the field, and in the case of this project, structural engineers carefully and mostly manually label these images for different types of damage. The level of expertise of the professionals labeling the training set varies widely, and some data sets contain pictures that different people have labeled in different ways when in reality the label should have been the same. Therefore, we need to get several experts to evaluate the same data set; the bigger the ground truth/training set the more accurate the NN classifier will be. To evaluate these variations among experts, which can be considered equal to the task of evaluating the quality of the expert, using probabilistic theory we first need to implement a tool able to compare different images classified by different experts and apply a certainty level to the experts tagged labels. This master's thesis implements this comparative tool. We also decided to implement the comparative tool using parallel programming paradigms since we foresee that it will be used to train multiple young engineering students/professionals or even novice citizen volunteers (“trainees”) during after-earthquake meetings and workshops. The implementation of this software tool involves selecting around 200 photographs tagged by an expert with proven accuracy (“ground truth”) and comparing them to files tagged by the trainees. The trainees are then provided with instantaneous feedback on the accuracy of their damage assessment. The aforementioned problem of evaluating trainee results against the expert is not as simple as comparing and finding differences between two sets of image files. We anticipate challenges in that each trainee will select a slightly different sized area for the same occurrence of damage, and some damage-structure pairs are more difficult to recognize and tag. Results show that we can compare 500 files in 1.5 seconds which is an improvement of 2x faster compared to sequential implementation. Read more Jaccard acceleration parallel earthquake image tagging Other Computer Sciences
4	Efficient Graph Summarization of Large Networks Hajiabadi, Mahdi 24 June 2022 (has links) In this thesis, we study the notion of graph summarization, which is a fundamental task of finding a compact representation of the original graph called the summary. Graph summarization can be used for reducing the footprint of the input graph, better visualization, anonymizing the identity of users, and query answering. There are two different frameworks of graph summarization we consider in this thesis, the utility-based framework and the correction set-based framework. In the utility-based framework, the input graph is summarized until a utility threshold is not violated. In the correction set-based framework a set of correction edges is produced along with the summary graph. In this thesis we propose two algorithms for the utility-based framework and one for the correction set-based framework. All these three algorithms are for static graphs (i.e. graphs that do not change over time). Then, we propose two more utility-based algorithms for fully dynamic graphs (i.e. graphs with edge insertions and deletions). Algorithms for graph summarization can be lossless (summarizing the input graph without losing any information) or lossy (losing some information about the input graph in order to summarize it more). Some of our algorithms are lossless and some lossy, but with controlled utility loss. Our first utility-driven graph summarization algorithm, G-SCIS, is based on a clique and independent set decomposition, that produces optimal compression with zero loss of utility. The compression provided is significantly better than state-of-the-art in lossless graph summarization, while the runtime is two orders of magnitude lower. Our second algorithm is T-BUDS, a highly scalable, utility-driven algorithm for fully controlled lossy summarization. It achieves high scalability by combining memory reduction using Maximum Spanning Tree with a novel binary search procedure. T-BUDS outperforms state-of-the-art drastically in terms of the quality of summarization and is about two orders of magnitude better in terms of speed. In contrast to the competition, we are able to handle web-scale graphs in a single machine without performance impediment as the utility threshold (and size of summary) decreases. Also, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank and shortest paths. We then propose algorithm LDME, a correction set-based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number of comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression. We also present two lossless summarization algorithms, Optimal and Scalable, for summarizing fully dynamic graphs. More concretely, we follow the framework of G-SCIS, which produces summaries that can be used as-is in several graph analytics tasks. Different from G-SCIS, which is a batch algorithm, Optimal and Scalable are fully dynamic and can respond rapidly to each change in the graph. Not only are Optimal and Scalable able to outperform G-SCIS and other batch algorithms by several orders of magnitude, but they also significantly outperform MoSSo, the state-of-the-art in lossless dynamic graph summarization. While Optimal produces always the most optimal summary, Scalable is able to trade the amount of node reduction for extra scalability. For reasonable values of the parameter $K$, Scalable is able to outperform Optimal by an order of magnitude in speed, while keeping the rate of node reduction close to that of Optimal. An interesting fact that we observed experimentally is that even if we were to run a batch algorithm, such as G-SCIS, once for every big batch of changes, still they would be much slower than Scalable. For instance, if 1 million changes occur in a graph, Scalable is two orders of magnitude faster than running G-SCIS just once at the end of the 1 million-edge sequence. / Graduate Read more Graph Summarization Query Answering Lossless summary Lossy summary Locality Sensitive Hashing Jaccard Similarity Weighted Jaccard Similarity Hashing Incremental Algorithms Randomized Algorithms
5	Ask.com, Web Wombat och Yahoo : En studie av två globala och en lokal sökmotor. / Ask.com, Web Wombat and Yahoo : A study of two global and one local search engines. Ekstein, Jonas, Runesson, Christian January 2007 (has links) This thesis is focusing on how global and local search engines retrieve information from the local domain. The three search engines tested are the global search engines Yahoo, Ask.com and the local search engine Web Wombat. The questions we examined were: which search engine has the best retrieval effectiveness? Could there be other reasons than retrieval effectiveness, to choosing a local search engine?For our test we constructed 20 questions related to Australia.We chose to divide the questions into topics like nature, sports and culture. For all questions we evaluated the relevance of the first 20 hits. We used the following measures in our test: Jaccard´s index, precision and average precision. We also looked at factors such as duplicates and error pages, because we consider this to be an important aspect to consider, when looking at the relevance of the first 20 hits. The results of our study showed that Yahoo had the best performance for precision. Web Wombat had faulty precision but results from Jaccard´s index revealed that Web Wombat had many unique documents. Web Wombat had the best average precision on one of our questions. In spite of Web Wombats faulty precision, we think that Web Wombat serve a purpose as an alternative to global search engines. / Uppsatsnivå: D Read more yahoo ask.com web wombat australien återvinningseffektivitet precision jaccard´s index sökmotorer Social Sciences Samhällsvetenskap
6	品種重複的無母數估計 / Nonparametric Estimation of Species Overlap 林逢章, Lin, Feng-Chang Unknown Date (has links) 關於描述兩個觀察地A和B相似的程度而言，生物品種是否相同是其中的一個切入點，因此品種重複（species overlap）便為描述兩觀察地相似度的一種指標。就一般的生物或生態研究而言，較常使用的品種重複指數為以品種數為計算基礎的 Jaccard index，公式為，其中和分別為觀察地A和B的總品種數，而則為兩地的共同品種數，這樣的計算方式為Gower(1985) 歸類描述兩單位（unit）的相似度（similarity）中的一種。在我們的研究中，將令依觀察到的品種數及品種重複數所計算出的 Jaccard index 視為估計值，記為；若描述相似度時僅以品種為計算單位，而忽略個別品種的數量未免有資訊流失的情形，因此我們延伸 Jaccard index 指數而另立以個別品種數為計算單位的 N 指數，並以無母數最大概似估計法（Nonparametric Maximum Likelihood Estimator, NPMLE）估計 N 指數，記為。另外，Smith, Solow 和 Preston (1996) 也提出利用 delta-beta-binomial 模型修正 Jaccard index 的低估（underestimate）情形，我們將此模型所推估的品種重複記為，因此我們的研究重點便在於以模擬實驗比較、和在估計真正參數時的行為。在模擬實驗中，根據蒙地卡羅（Monte-Carlo）模擬法則，我們設計6種品種發生機率相等的平衡母體，及12種品種發生機率服從幾何分配的不平衡母體，以500次抽樣所得的平均數及標準差決定估計的好壞。根據研究結果，若在已知母體為平衡母體的情形之下，和有不錯的估計；而則是不管在平衡母體或不平衡母體皆有不錯的估計，但和在某些不平衡母體時，卻有極偏差的估計。除了模擬實驗之外，我們並推導出的期望值和變異數，並證明其為 N 指數的大樣本不偏估計值（asymptotic unbiased estimator），並以台灣西北部濕地的鳥類記錄為實例，計算出三個估計值，並以跋靴法（Bootstrapping）計算出三個估計量的標準差估計值，發現NPMLE 有最小的變異程度。 / In describing the similarity between communities A and B, species overlap is one kind of measure. In ecology and biology, the Jaccard index (Gower, 1985) ,denoted , for species overlap is widely used and is useded as an estimation in our research. However, the Jaccard index is simply the proportion of overlapping species, that is those species appearing in more than one community, to unique species, that is those species appearing in only one community. However, this index ignores species proportion information, assigning equal weight to all species. We propose a new index, N, which includes proportion information and is estimated by a Nonparametric Maximum Likelihood Estimator (NPMLE), denoted . Smith et al. (1996) proposed a delta-beta-binomial model to improve underestimation of the Jaccard index, we denoted this estimator . In our Monte-Carlo simulations, we design 6 balanced populations in which every species has an equal proportion and 12 unbalanced populations in which species proportions follow a geometric distribution. We found that and are accurate for balanced populations but overestimate or underestimate the true value for some unbalanced populations. However, is robust for both balanced and unbalanced populations. In addition to simulation results, we also give theoretical results, which prove some asymptotic properties of NPMLE .For example, species abundance of wild birds communications occurred at two locations in north-western Taiwan.Via bootstrapping, has smaller standard error than and . Read more 品種重複 species overlap Jaccard index NPMLE delta-beta-binomial bootstrap
7	Determining the Biomechanical Behavior of the Liver Using Medical Image Analysis and Evolutionary Computation Martínez Martínez, Francisco 03 September 2014 (has links) Modeling the liver deformation forms the basis for the development of new clinical applications that improve the diagnosis, planning and guidance in liver surgery. However, the patient-specific modeling of this organ and its validation are still a challenge in Biomechanics. The reason is the difficulty to measure the mechanical response of the in vivo liver tissue. The current approach consist of performing minimally invasive or open surgery aimed at estimating the elastic constant of the proposed biomechanical models. This dissertation presents how the use of medical image analysis and evolutionary computation allows the characterization of the biomechanical behavior of the liver, avoiding the use of these minimally invasive techniques. In particular, the use of similarity coefficients commonly used in medical image analysis has permitted, on one hand, to estimate the patient-specific biomechanical model of the liver avoiding the invasive measurement of its mechanical response. On the other hand, these coefficients have also permitted to validate the proposed biomechanical models. Jaccard coefficient and Hausdorff distance have been used to validate the models proposed to simulate the behavior of ex vivo lamb livers, calculating the error between the volume of the experimentally deformed samples of the livers and the volume from biomechanical simulations of these deformations. These coefficients has provided information, such as the shape of the samples and the error distribution along their volume. For this reason, both coefficients have also been used to formulate a novel function, the Geometric Similarity Function (GSF). This function has permitted to establish a methodology to estimate the elastic constants of the models proposed for the human liver using evolutionary computation. Several optimization strategies, using GSF as cost function, have been developed aimed at estimating the patient-specific elastic constants of the biomechanical models proposed for the human liver. Finally, this methodology has been used to define and validate a biomechanical model proposed for an in vitro human liver. / Martínez Martínez, F. (2014). Determining the Biomechanical Behavior of the Liver Using Medical Image Analysis and Evolutionary Computation [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/39337 Read more Biomechanical modeling Liver Jaccard Hausdorff Scatter Search Genetic Algorithms LENGUAJES Y SISTEMAS INFORMATICOS
8	COVID-19 Variant Analyzer through Genomic Sequences and Jaccard Similarities Bharadwaj, Atul Narasimha Murthy 26 March 2025 (has links) The COVID-19 pandemic has underscored the urgent need for efficient genomic surveillance to track the emergence and spread of SARS-CoV-2 variants. This study developed a novel computational framework to enhance variant detection by leveraging a database-driven approach and genomic sequence analysis. The framework utilizes MySQL database architecture where each variant is stored in distinct tables, enabling rapid comparison and classification of new variants through Jaccard similarity calculations. The innovative aspect of this research lies in its unique database structure and classification method. Unlike traditional clustering approaches, this system creates individual tables for each variant, allowing for dynamic updates and efficient comparisons. When a new variant is introduced, the framework calculates Jaccard similarity scores between the new variant and existing variant tables, automatically creating new tables for potentially novel variants that fall below-established similarity thresholds. This approach enables real-time variant tracking and classification, adapting to the evolving nature of the virus. The system employs advanced bioinformatics tools including sourmash for signature generation and NumPy for computational analysis, alongside Python-MySQL connectors for seamless database interactions. It implements similarity thresholds of 0.817 for primary classification and 0.867 for secondary validation to determine variant group membership. Whole-genome data was analyzed to compare its effectiveness in identifying variants of concern, with the database structure accommodating genomic data. The results demonstrated the framework's ability to accurately detect and classify SARS-CoV-2 variants with high sensitivity and specificity. The study highlighted the potential of whole-genome sequences as a cost-effective alternative for variant detection in resource-limited settings, while also revealing their limitations compared to whole-genome analysis. This research contributes to global genomic surveillance efforts by providing scalable database tools for rapid variant identification, aiding public health strategies, vaccine development, and therapeutic interventions. / Master of Science / The COVID-19 pandemic has shown how important it is to track changes in the COVID-19 virus. This study focused on creating better ways to find and classify new versions of the virus (variants) by analyzing its genetic material. Using bioinformatics tools, the research aimed to make it easier and faster to identify these variants and understand how they are related. The project used methods like comparing virus genomes and grouping similar ones to see how they evolve. It also tested whether analyzing only part of the virus's genetic material could be as effective as looking at the whole genome. These techniques helped identify patterns in the virus's mutations and group them into meaningful categories. This work is important because it provides tools that can help scientists quickly spot new or dangerous variants of COVID-19. These findings can guide public health decisions, improve vaccines, and develop treatments more effectively. By making these methods scalable and accessible, this research supports global efforts to manage the ongoing pandemic and prepare for future outbreaks. Read more Genomic Surveillance Variant Analysis Jaccard Similarity Lineage Designation Mutation Monitoring Bioinformatics Genomic Sequences
9	Quelques propositions pour la comparaison de partitions non strictes / Some proposals for comparison of soft partitions Quéré, Romain 06 December 2012 (has links) Cette thèse est consacrée au problème de la comparaison de deux partitions non strictes (floues/probabilistes, possibilistes) d’un même ensemble d’individus en plusieurs clusters. Sa résolution repose sur la définition formelle de mesures de concordance reprenant les principes des mesures historiques développées pour la comparaison de partitions strictes et trouve son application dans des domaines variés tels que la biologie, le traitement d’images, la classification automatique. Selon qu’elles s’attachent à observer les relations entre les individus décrites par chacune des partitions ou à quantifier les similitudes entre les clusters qui composent ces partitions, nous distinguons deux grandes familles de mesures pour lesquelles la notion même d’accord entre partitions diffère, et proposons d’en caractériser les représentants selon un même ensemble de propriétés formelles et informelles. De ce point de vue, les mesures sont aussi qualifiées selon la nature des partitions comparées. Une étude des multiples constructions sur lesquelles reposent les mesures de la littérature vient compléter notre taxonomie. Nous proposons trois nouvelles mesures de comparaison non strictes tirant profit de l’état de l’art. La première est une extension d’une approche stricte tandis que les deux autres reposent sur des approches dite natives, l’une orientée individus, l’autre orientée clusters, spécifiquement conçues pour la comparaison de partitions non strictes. Nos propositions sont comparées à celles de la littérature selon un plan d’expérience choisi pour couvrir les divers aspects de la problématique. Les résultats présentés montrent l’intérêt des propositions pour le thème de recherche qu’est la comparaison de partitions. Enfin, nous ouvrons de nouvelles perspectives en proposant les prémisses d’un cadre qui unifie les principales mesures non strictes orientées individus. / This thesis is dedicated to the problem of comparing two soft (fuzzy/ probabilistic, possibilistic) partitions of a same set of individuals into several clusters. Its solution stands on the formal definition of concordance measures based on the principles of historical measures developped for comparing strict partitions and can be used invarious fields such as biology, image processing and clustering. Depending on whether they focus on the observation of the relations between the individuals described by each partition or on the quantization of the similarities between the clusters composing those partitions, we distinguish two main families for which the very notion of concordance between partitions differs, and we propose to characterize their representatives according to a same set of formal and informal properties. From that point of view, the measures are also qualified according to the nature of the compared partitions. A study of the multiple constructions on which the measures of the literature lie completes our taxonomy. We propose three new soft comparison measures taking benefits of the state of art. The first one is an extension of a strict approach, while the two others lie on native approaches, one individual-wise oriented, the other cluster-wise, both specifically defined to compare soft partitions. Our propositions are compared to the existing measures of the literature according to a set of experimentations chosen to cover the various issues of the problem. The given results clearly show how relevant our measures are. Finally we open new perspectives by proposing the premises of a new framework unifying most of the individual-wise oriented measures. Read more Comparaison de partitions Indice de Rand Indice de Jaccard Partition floue Partition possibiliste Cluster analysis Contingence-paires Matrice de contingence Matrice de coïncidence Norme triangulaire Comparing partitions Rand index Jaccard index Fuzzy partition Possibilistic partition Cluster analysis Mismatch matrix Contingency matrix Coincidence matrix Triangular norm
10	Analysis of Test Coverage Data on a Large-Scale Industrial System Vasconcelos Jansson, Erik Sven January 2016 (has links) Software testing verifies the program's functional behavior, one important process when engineering critical software. Measuring the degree of testing is done with code coverage, describing the amount of production code affected by tests. Both concepts are extensively used for industrial systems. Previous research has shown that gathering and analyzing test coverages becomes problematic on large-scale systems. Here, development experience, implementation feasibility, coverage measurements and analysis method are explored; providing potential solutions and insights into these issues. Outlined are methods for constructing and integrating such gathering and analysis system in a large-scale project, along with the problems encountered and given remedies. Instrumentations for gathering coverage information affect performance negatively, these measurements are provided. Since large-scale test suite measurements are quite lacking, the line, branch, and function criteria are presented here. Finally, an analysis method is proposed, by using coverage set operations and Jaccard indices, to find test similarities. Results gathered imply execution time was significantly affected when gathering coverage, [2.656, 2.911] hours for instrumented software, originally between [2.075, 2.260] on the system under test, given under the alpha = 5% and n = 4, while both processor & memory usages were inconclusive. Measured criteria were (59.3, 70.7, 24.6)% for these suites. Analysis method shows potential areas of test redundancy. Read more code coverage software testing industrial analysis large-scale similarity jaccard implementation feasibility performance Software Engineering Programvaruteknik Computer Sciences Datavetenskap (datalogi)

Search results