Spelling suggestions: "subject:"jaccard"" "subject:"paccard""
1 |
Landskapsförändringar och deras påverkan på dagens biodiversitetsmönster hos kärlväxter i ett skånskt jordbrukslandskapBiederstädt, Jana January 2014 (has links)
I detta arbete studerades landskapsförändringar, i synnerhet olika typer av gräsmarkshabitat i fråga om artinnehåll och rikedom samt hur dagens diversitetsmönster har påverkats av den historiska markanvändningen. Undersökningsområdet har en area av ca 143 ha och är beläget i nordöstra Skåne. Resultatet visade att områdets areal av naturliga gräsmarker var mest omfattande kring 1783/98 och minst 1926-23. De naturliga gräsmarkernas omfattning har nästan halverats från slutet av 1700-talet till idag. Andelen betesmark som har hävdats från 1700-talets slut till 2014 utgör dock inte mer än 11 % av betesmarkernas totala areal. Väg- och dikesren uppvisade störst andel arter som var gemensamma med naturbetesmarkernas; minst var andelen i kultiverad gräsmark. Artrikedomen var minst lika hög i vägren, dikesren och skogsbryn som i naturbetesmarken medan den var lägre i kuliverad gräsmark. Artrikedomen var högst på vägrenen.
|
2 |
Evaluation of intra-set clustering techniques for redundant social media contentJubinville, Jason 19 December 2018 (has links)
This thesis evaluates various techniques for intra-set clustering of social media data from an industry perspective. The research goal was to establish methods for reducing the amount of redundant information an end user must review from a standard social media search. The research evaluated both clustering algorithms and string similarity measures for their effectiveness in clustering a selection of real-world topic and location-based social media searches. In addition, the algorithms and similarity measures were tested in scenarios based on industry constraints such as rate limits. The results were evaluated using several practical measures to determine which techniques were effective. / Graduate
|
3 |
Acceleration of Jaccard's Index Algorithm for Training to Tag Damage on Post-earthquake ImagesMulligan, Kyle John 01 June 2018 (has links) (PDF)
There are currently different efforts to use Supervised Neural Networks (NN) to automatically label damages on images of above ground infrastructure (buildings made of concrete) taken after an earthquake. The goal of the supervised NN is to classify raw input data according to the patterns learned from an input training set. This input training data set is usually supplied by experts in the field, and in the case of this project, structural engineers carefully and mostly manually label these images for different types of damage. The level of expertise of the professionals labeling the training set varies widely, and some data sets contain pictures that different people have labeled in different ways when in reality the label should have been the same. Therefore, we need to get several experts to evaluate the same data set; the bigger the ground truth/training set the more accurate the NN classifier will be. To evaluate these variations among experts, which can be considered equal to the task of evaluating the quality of the expert, using probabilistic theory we first need to implement a tool able to compare different images classified by different experts and apply a certainty level to the experts tagged labels. This master's thesis implements this comparative tool. We also decided to implement the comparative tool using parallel programming paradigms since we foresee that it will be used to train multiple young engineering students/professionals or even novice citizen volunteers (“trainees”) during after-earthquake meetings and workshops. The implementation of this software tool involves selecting around 200 photographs tagged by an expert with proven accuracy (“ground truth”) and comparing them to files tagged by the trainees. The trainees are then provided with instantaneous feedback on the accuracy of their damage assessment. The aforementioned problem of evaluating trainee results against the expert is not as simple as comparing and finding differences between two sets of image files. We anticipate challenges in that each trainee will select a slightly different sized area for the same occurrence of damage, and some damage-structure pairs are more difficult to recognize and tag. Results show that we can compare 500 files in 1.5 seconds which is an improvement of 2x faster compared to sequential implementation.
|
4 |
Efficient Graph Summarization of Large NetworksHajiabadi, Mahdi 24 June 2022 (has links)
In this thesis, we study the notion of graph summarization,
which is a fundamental task of finding a compact representation of the original graph called the summary.
Graph summarization can be used for reducing the footprint of the input graph, better visualization, anonymizing the identity of users, and query answering.
There are two different frameworks of graph summarization we consider in this thesis, the utility-based framework and the correction set-based framework.
In the utility-based framework, the input graph is summarized until a utility threshold is not violated.
In the correction set-based framework a set of correction edges is produced along with the summary graph.
In this thesis we propose two algorithms for the utility-based framework and one for the correction set-based framework. All these three algorithms are for static graphs (i.e. graphs that do not change over time).
Then, we propose two more utility-based algorithms for fully dynamic graphs (i.e. graphs with edge insertions and deletions).
Algorithms for graph summarization can be lossless (summarizing the input graph without losing any information) or lossy (losing some information about the input graph in order to summarize it more).
Some of our algorithms are lossless and some lossy, but with controlled utility loss.
Our first utility-driven graph summarization algorithm, G-SCIS, is based on a clique and independent set decomposition, that produces optimal compression with zero
loss of utility. The compression provided is significantly better than
state-of-the-art in lossless graph summarization, while the runtime
is two orders of magnitude lower.
Our second algorithm is T-BUDS, a highly scalable, utility-driven algorithm for fully controlled lossy summarization.
It achieves high scalability by combining memory reduction using Maximum Spanning Tree with a novel binary
search procedure. T-BUDS outperforms state-of-the-art drastically in terms of the quality of summarization and is about two orders of magnitude better in terms of speed. In contrast to the competition, we are able to handle web-scale graphs in a single machine
without performance impediment as the utility threshold (and size of summary) decreases. Also, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank and shortest paths.
We then propose algorithm LDME, a correction set-based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number of comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running
time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression.
We also present two lossless summarization algorithms, Optimal and Scalable, for summarizing fully dynamic graphs.
More concretely, we follow the framework of G-SCIS, which produces summaries that can be used as-is in several graph analytics tasks. Different from G-SCIS, which is a batch algorithm, Optimal and Scalable are fully dynamic and can respond rapidly to each change in the graph.
Not only are Optimal and Scalable able to outperform G-SCIS and other batch algorithms by several orders of magnitude, but they also significantly outperform MoSSo, the state-of-the-art in lossless dynamic graph summarization.
While Optimal produces always the most optimal summary, Scalable is able to trade the amount of node reduction for extra scalability.
For reasonable values of the parameter $K$, Scalable is able to outperform Optimal by an order of magnitude in speed, while keeping the rate of node reduction close to that of Optimal.
An interesting fact that we observed experimentally is that even if we were to run a batch algorithm, such as G-SCIS, once for every big batch of changes, still they would be much slower than Scalable. For instance, if 1 million changes occur in a graph, Scalable is two orders of magnitude faster than running G-SCIS just once at the end of the 1 million-edge sequence. / Graduate
|
5 |
Ask.com, Web Wombat och Yahoo : En studie av två globala och en lokal sökmotor. / Ask.com, Web Wombat and Yahoo : A study of two global and one local search engines.Ekstein, Jonas, Runesson, Christian January 2007 (has links)
This thesis is focusing on how global and local search engines retrieve information from the local domain. The three search engines tested are the global search engines Yahoo, Ask.com and the local search engine Web Wombat. The questions we examined were: which search engine has the best retrieval effectiveness? Could there be other reasons than retrieval effectiveness, to choosing a local search engine?For our test we constructed 20 questions related to Australia.We chose to divide the questions into topics like nature, sports and culture. For all questions we evaluated the relevance of the first 20 hits. We used the following measures in our test: Jaccard´s index, precision and average precision. We also looked at factors such as duplicates and error pages, because we consider this to be an important aspect to consider, when looking at the relevance of the first 20 hits. The results of our study showed that Yahoo had the best performance for precision. Web Wombat had faulty precision but results from Jaccard´s index revealed that Web Wombat had many unique documents. Web Wombat had the best average precision on one of our questions. In spite of Web Wombats faulty precision, we think that Web Wombat serve a purpose as an alternative to global search engines. / Uppsatsnivå: D
|
6 |
品種重複的無母數估計 / Nonparametric Estimation of Species Overlap林逢章, Lin, Feng-Chang Unknown Date (has links)
關於描述兩個觀察地A和B相似的程度而言,生物品種是否相同是其中的一個切入點,因此品種重複(species overlap)便為描述兩觀察地相似度的一種指標。就一般的生物或生態研究而言,較常使用的品種重複指數為以品種數為計算基礎的 Jaccard index,公式為 ,其中 和 分別為觀察地A和B的總品種數,而 則為兩地的共同品種數,這樣的計算方式為Gower(1985) 歸類描述兩單位(unit)的相似度(similarity)中的一種。在我們的研究中,將令依觀察到的品種數及品種重複數所計算出的 Jaccard index 視為估計值,記為 ;若描述相似度時僅以品種為計算單位,而忽略個別品種的數量未免有資訊流失的情形,因此我們延伸 Jaccard index 指數而另立以個別品種數為計算單位的 N 指數,並以無母數最大概似估計法(Nonparametric Maximum Likelihood Estimator, NPMLE)估計 N 指數,記為 。另外,Smith, Solow 和 Preston (1996) 也提出利用 delta-beta-binomial 模型修正 Jaccard index 的低估(underestimate)情形,我們將此模型所推估的品種重複記為 ,因此我們的研究重點便在於以模擬實驗比較 、 和 在估計真正參數時的行為。
在模擬實驗中,根據蒙地卡羅(Monte-Carlo)模擬法則,我們設計6種品種發生機率相等的平衡母體,及12種品種發生機率服從幾何分配的不平衡母體,以500次抽樣所得的平均數及標準差決定估計的好壞。根據研究結果,若在已知母體為平衡母體的情形之下, 和 有不錯的估計;而 則是不管在平衡母體或不平衡母體皆有不錯的估計,但 和 在某些不平衡母體時,卻有極偏差的估計。
除了模擬實驗之外,我們並推導出 的期望值和變異數,並證明其為 N 指數的大樣本不偏估計值(asymptotic unbiased estimator),並以台灣西北部濕地的鳥類記錄為實例,計算出三個估計值,並以跋靴法(Bootstrapping)計算出三個估計量的標準差估計值,發現NPMLE 有最小的變異程度。 / In describing the similarity between communities A and B, species overlap is one kind of measure. In ecology and biology, the Jaccard index (Gower, 1985) ,denoted , for species overlap is widely used and is useded as an estimation in our research. However, the Jaccard index is simply the proportion of overlapping species, that is those species appearing in more than one community, to unique species, that is those species appearing in only one community. However, this index ignores species proportion information, assigning equal weight to all species. We propose a new index, N, which includes proportion information and is estimated by a Nonparametric Maximum Likelihood Estimator (NPMLE), denoted . Smith et al. (1996) proposed a delta-beta-binomial model to improve underestimation of the Jaccard index, we denoted this estimator .
In our Monte-Carlo simulations, we design 6 balanced populations in which every species has an equal proportion and 12 unbalanced populations in which species proportions follow a geometric distribution. We found that and are accurate for balanced populations but overestimate or underestimate the true value for some unbalanced populations. However, is robust for both balanced and unbalanced populations.
In addition to simulation results, we also give theoretical results, which prove some asymptotic properties of NPMLE .For example, species abundance of wild birds communications occurred at two locations in north-western Taiwan.Via bootstrapping, has smaller standard error than and .
|
7 |
Determining the Biomechanical Behavior of the Liver Using Medical Image Analysis and Evolutionary ComputationMartínez Martínez, Francisco 03 September 2014 (has links)
Modeling the liver deformation forms the basis for the development of
new clinical applications that improve the diagnosis, planning and guidance
in liver surgery. However, the patient-specific modeling of this organ and its
validation are still a challenge in Biomechanics. The reason is the difficulty
to measure the mechanical response of the in vivo liver tissue. The current
approach consist of performing minimally invasive or open surgery aimed at
estimating the elastic constant of the proposed biomechanical models.
This dissertation presents how the use of medical image analysis and evolutionary
computation allows the characterization of the biomechanical behavior
of the liver, avoiding the use of these minimally invasive techniques. In particular,
the use of similarity coefficients commonly used in medical image analysis
has permitted, on one hand, to estimate the patient-specific biomechanical
model of the liver avoiding the invasive measurement of its mechanical response.
On the other hand, these coefficients have also permitted to validate
the proposed biomechanical models.
Jaccard coefficient and Hausdorff distance have been used to validate the
models proposed to simulate the behavior of ex vivo lamb livers, calculating
the error between the volume of the experimentally deformed samples of the
livers and the volume from biomechanical simulations of these deformations.
These coefficients has provided information, such as the shape of the samples
and the error distribution along their volume. For this reason, both coefficients
have also been used to formulate a novel function, the Geometric Similarity
Function (GSF). This function has permitted to establish a methodology to
estimate the elastic constants of the models proposed for the human liver using
evolutionary computation. Several optimization strategies, using GSF as cost
function, have been developed aimed at estimating the patient-specific elastic
constants of the biomechanical models proposed for the human liver.
Finally, this methodology has been used to define and validate a biomechanical
model proposed for an in vitro human liver. / Martínez Martínez, F. (2014). Determining the Biomechanical Behavior of the Liver Using Medical Image Analysis and Evolutionary Computation [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/39337
|
8 |
Quelques propositions pour la comparaison de partitions non strictes / Some proposals for comparison of soft partitionsQuéré, Romain 06 December 2012 (has links)
Cette thèse est consacrée au problème de la comparaison de deux partitions non strictes (floues/probabilistes, possibilistes) d’un même ensemble d’individus en plusieurs clusters. Sa résolution repose sur la définition formelle de mesures de concordance reprenant les principes des mesures historiques développées pour la comparaison de partitions strictes et trouve son application dans des domaines variés tels que la biologie, le traitement d’images, la classification automatique. Selon qu’elles s’attachent à observer les relations entre les individus décrites par chacune des partitions ou à quantifier les similitudes entre les clusters qui composent ces partitions, nous distinguons deux grandes familles de mesures pour lesquelles la notion même d’accord entre partitions diffère, et proposons d’en caractériser les représentants selon un même ensemble de propriétés formelles et informelles. De ce point de vue, les mesures sont aussi qualifiées selon la nature des partitions comparées. Une étude des multiples constructions sur lesquelles reposent les mesures de la littérature vient compléter notre taxonomie. Nous proposons trois nouvelles mesures de comparaison non strictes tirant profit de l’état de l’art. La première est une extension d’une approche stricte tandis que les deux autres reposent sur des approches dite natives, l’une orientée individus, l’autre orientée clusters, spécifiquement conçues pour la comparaison de partitions non strictes. Nos propositions sont comparées à celles de la littérature selon un plan d’expérience choisi pour couvrir les divers aspects de la problématique. Les résultats présentés montrent l’intérêt des propositions pour le thème de recherche qu’est la comparaison de partitions. Enfin, nous ouvrons de nouvelles perspectives en proposant les prémisses d’un cadre qui unifie les principales mesures non strictes orientées individus. / This thesis is dedicated to the problem of comparing two soft (fuzzy/ probabilistic, possibilistic) partitions of a same set of individuals into several clusters. Its solution stands on the formal definition of concordance measures based on the principles of historical measures developped for comparing strict partitions and can be used invarious fields such as biology, image processing and clustering. Depending on whether they focus on the observation of the relations between the individuals described by each partition or on the quantization of the similarities between the clusters composing those partitions, we distinguish two main families for which the very notion of concordance between partitions differs, and we propose to characterize their representatives according to a same set of formal and informal properties. From that point of view, the measures are also qualified according to the nature of the compared partitions. A study of the multiple constructions on which the measures of the literature lie completes our taxonomy. We propose three new soft comparison measures taking benefits of the state of art. The first one is an extension of a strict approach, while the two others lie on native approaches, one individual-wise oriented, the other cluster-wise, both specifically defined to compare soft partitions. Our propositions are compared to the existing measures of the literature according to a set of experimentations chosen to cover the various issues of the problem. The given results clearly show how relevant our measures are. Finally we open new perspectives by proposing the premises of a new framework unifying most of the individual-wise oriented measures.
|
9 |
Analysis of Test Coverage Data on a Large-Scale Industrial SystemVasconcelos Jansson, Erik Sven January 2016 (has links)
Software testing verifies the program's functional behavior, one important process when engineering critical software. Measuring the degree of testing is done with code coverage, describing the amount of production code affected by tests. Both concepts are extensively used for industrial systems. Previous research has shown that gathering and analyzing test coverages becomes problematic on large-scale systems. Here, development experience, implementation feasibility, coverage measurements and analysis method are explored; providing potential solutions and insights into these issues. Outlined are methods for constructing and integrating such gathering and analysis system in a large-scale project, along with the problems encountered and given remedies. Instrumentations for gathering coverage information affect performance negatively, these measurements are provided. Since large-scale test suite measurements are quite lacking, the line, branch, and function criteria are presented here. Finally, an analysis method is proposed, by using coverage set operations and Jaccard indices, to find test similarities. Results gathered imply execution time was significantly affected when gathering coverage, [2.656, 2.911] hours for instrumented software, originally between [2.075, 2.260] on the system under test, given under the alpha = 5% and n = 4, while both processor & memory usages were inconclusive. Measured criteria were (59.3, 70.7, 24.6)% for these suites. Analysis method shows potential areas of test redundancy.
|
10 |
A Method for Recommending Computer-Security Training for Software DevelopersNadeem, Muhammad 12 August 2016 (has links)
Vulnerable code may cause security breaches in software systems resulting in financial and reputation losses for the organizations in addition to loss of their customers’ confidential data. Delivering proper software security training to software developers is key to prevent such breaches. Conventional training methods do not take the code written by the developers over time into account, which makes these training sessions less effective. We propose a method for recommending computer–security training to help identify focused and narrow areas in which developers need training. The proposed method leverages the power of static analysis techniques, by using the flagged vulnerabilities in the source code as basis, to suggest the most appropriate training topics to different software developers. Moreover, it utilizes public vulnerability repositories as its knowledgebase to suggest community accepted solutions to different security problems. Such mitigation strategies are platform independent, giving further strength to the utility of the system. This research discussed the proposed architecture of the recommender system, case studies to validate the system architecture, tailored algorithms to improve the performance of the system, and human subject evaluation conducted to determine the usefulness of the system. Our evaluation suggests that the proposed system successfully retrieves relevant training articles from the public vulnerability repository. The human subjects found these articles to be suitable for training. The human subjects also found the proposed recommender system as effective as a commercial tool.
|
Page generated in 0.1156 seconds