Spelling suggestions: "subject:"[een] SIMILARITY"" "subject:"[enn] SIMILARITY""
851 |
Analysis and Evaluation ofVisuospatial Complexity ModelsHammami, Bashar, Afram, Mjed January 2022 (has links)
Visuospatial complexity refers to the level of detail or intricacy present within a scene, takinginto account both spatial and visual properties of the dynamic scene or the place (e.g.moving images, everyday driving, video games and other immersive media). There havebeen several studies on measuring visual complexity from various viewpoints, e.g. marketing,psychology, computer vision and cognitive science. This research project aims atanalysing and evaluating different models and tools that have been developed to measurelow-level features of visuospatial complexity such as Structural Similarity Index measurement,Feature Congestion measurement of clutter and Subband Entropy measurement ofclutter. We use two datasets, one focusing on (reflectional) symmetry in static images,and another that consists of real-world driving videos. The results of the evaluation showdifferent correlations between the implemented models such that the nature of the sceneplays a significant role.
|
852 |
Fuzzer Test Log Analysis Using Machine Learning : Framework to analyze logs and provide feedback to guide the fuzzerYadav, Jyoti January 2018 (has links)
In this modern world machine learning and deep learning have become popular choice for analysis and identifying various patterns on data in large volumes. The focus of the thesis work has been on the design of the alternative strategies using machine learning to guide the fuzzer in selecting the most promising test cases. Thesis work mainly focuses on the analysis of the data by using machine learning techniques. A detailed analysis study and work is carried out in multiple phases. First phase is targeted to convert the data into suitable format(pre-processing) so that necessary features can be extracted and fed as input to the unsupervised machine learning algorithms. Machine learning algorithms accepts the input data in form of matrices which represents the dimensionality of the extracted features. Several experiments and run time benchmarks have been conducted to choose most efficient algorithm based on execution time and results accuracy. Finally, the best choice has been implanted to get the desired result. The second phase of the work deals with applying supervised learning over clustering results. The final phase describes how an incremental learning model is built to score the test case logs and return their score in near real time which can act as feedback to guide the fuzzer. / I denna moderna värld har maskininlärning och djup inlärning blivit populärt val för analys och identifiering av olika mönster på data i stora volymer. Uppsatsen har fokuserat på utformningen av de alternativa strategierna med maskininlärning för att styra fuzzer i valet av de mest lovande testfallen. Examensarbete fokuserar huvudsakligen på analys av data med hjälp av maskininlärningsteknik. En detaljerad analysstudie och arbete utförs i flera faser. Första fasen är inriktad på att konvertera data till lämpligt format (förbehandling) så att nödvändiga funktioner kan extraheras och matas som inmatning till de oövervakade maskininlärningsalgoritmerna. Maskininlärningsalgoritmer accepterar ingångsdata i form av matriser som representerar dimensionen av de extraherade funktionerna. Flera experiment och körtider har genomförts för att välja den mest effektiva algoritmen baserat på exekveringstid och resultatnoggrannhet. Slutligen har det bästa valet implanterats för att få önskat resultat. Den andra fasen av arbetet handlar om att tillämpa övervakat lärande över klusterresultat. Slutfasen beskriver hur en inkrementell inlärningsmodell är uppbyggd för att få poäng i testfallsloggarna och returnera poängen i nära realtid vilket kan fungera som feedback för att styra fuzzer.
|
853 |
Travel Diary Semantics Enrichment of Trajectoriesbased on Trajectory Similarity MeasuresLIU, RUI January 2018 (has links)
Trajectory data is playing an increasingly important role in our daily lives, as well as in commercial applications and scientific research. With the rapid development andpopularity of GPS, people can locate themselves in real time. Therefore, the users’behavior information can be collected by analyzing their GPS trajectory data, so as topredict their new trajectories’ destinations, ways of travelling and even thetransportation mode they use, which forms a complete personal travel diary. The taskin this thesis is to implement travel diary semantics enrichment of user’s trajectoriesbased on the historical labeled data of the user and trajectory similarity measures.Specially, this dissertation studies the following tasks: Firstly, trip segmentationconcerns detecting the trips from trajectory which is an unbounded sequence oftimestamp locations of the user. This means that it is important to detect the stops,moves and trips of the user between two consecutive stops. In this thesis, a heuristicrule is used to identify the stops. Secondly, tripleg segmentation concerns identifyingthe location / time instances between two triplegs where / when a user changesbetween transport modes in the user's trajectory, also called makes transport modetransitions. Finally, mode inference concerns identifying travel mode for each tripleg.Specially, steps 2 and 3 are both based on the same trajectory similarity measure andproject the information from the matched similar trip trajectory onto the unlabeled triptrajectory. The empirical evaluation of these three tasks is based on real word data set(contains 4240 trips and 5451 triplegs with 14 travel modes for 206 users using oneweek study period) and the experiment performance (including trends, coverage andaccuracy) are evaluated and accuracy is around 25% for trip segmentation; accuracyvaries between 50% and 55% for tripleg segmentation; for mode inference, it isbetween 55% and 60%. Moreover, accuracy is higher for longer trips than shortertrips, probably because people have more mode choices in short distance trips (likemoped, bus and car), which makes the measure more confused and the accuracy canbe increased by nearly 10% with the help of reverse trip identifiable, because it makesa trip have more similar historical trips and increases the probability that a newunlabeled trip can be matched based on its historical trips.
|
854 |
Learning Techniques For Information Retrieval And Mining In High-dimensional DatabasesCheng, Hao 01 January 2009 (has links)
The main focus of my research is to design effective learning techniques for information retrieval and mining in high-dimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users' requests and execute learning algorithms as fast as possible. However, these problems are non-trivial because of the complexity of the high-level semantic concepts, the heterogeneous natures of the feature space, the high dimensionality of data representations and the size of the databases. My dissertation is dedicated to addressing these issues. Specifically, my work has five main contributions as follows. The first contribution is a novel manifold learning algorithm, Local and Global Structures Preserving Projection (LGSPP), which defines salient low-dimensional representations for the high-dimensional data. A small number of projection directions are sought in order to properly preserve the local and global structures for the original data. Specifically, two groups of points are extracted for each individual point in the dataset: the first group contains the nearest neighbors of the point, and the other set are a few sampled points far away from the point. These two point sets respectively characterize the local and global structures with regard to the data point. The objective of the embedding is to minimize the distances of the points in each local neighborhood and also to disperse the points far away from their respective remote points in the original space. In this way, the relationships between the data in the original space are well preserved with little distortions. The second contribution is a new constrained clustering algorithm. Conventionally, clustering is an unsupervised learning problem, which systematically partitions a dataset into a small set of clusters such that data in each cluster appear similar to each other compared with those in other clusters. In the proposal, the partial human knowledge is exploited to find better clustering results. Two kinds of constraints are integrated into the clustering algorithm. One is the must-link constraint, indicating that the involved two points belong to the same cluster. On the other hand, the cannot-link constraint denotes that two points are not within the same cluster. Given the input constraints, data points are arranged into small groups and a graph is constructed to preserve the semantic relations between these groups. The assignment procedure makes a best effort to assign each group to a feasible cluster without violating the constraints. The theoretical analysis reveals that the probability of data points being assigned to the true clusters is much higher by the new proposal, compared to conventional methods. In general, the new scheme can produce clusters which can better match the ground truth and respect the semantic relations between points inferred from the constraints. The third contribution is a unified framework for partition-based dimension reduction techniques, which allows efficient similarity retrieval in the high-dimensional data space. Recent similarity search techniques, such as Piecewise Aggregate Approximation (PAA), Segmented Means (SMEAN) and Mean-Standard deviation (MS), prove to be very effective in reducing data dimensionality by partitioning dimensions into subsets and extracting aggregate values from each dimension subset. These partition-based techniques have many advantages including very efficient multi-phased pruning while being simple to implement. They, however, are not adaptive to different characteristics of data in diverse applications. In this study, a unified framework for these partition-based techniques is proposed and the issue of dimension partitions is examined in this framework. An investigation of the relationships of query selectivity and the dimension partition schemes discovers indicators which can predict the performance of a partitioning setting. Accordingly, a greedy algorithm is designed to effectively determine a good partitioning of data dimensions so that the performance of the reduction technique is robust with regard to different datasets. The fourth contribution is an effective similarity search technique in the database of point sets. In the conventional model, an object corresponds to a single vector. In the proposed study, an object is represented by a set of points. In general, this new representation can be used in many real-world applications and carries much more local information, but the retrieval and learning problems become very challenging. The Hausdorff distance is the common distance function to measure the similarity between two point sets, however, this metric is sensitive to outliers in the data. To address this issue, a novel similarity function is defined to better capture the proximity of two objects, in which a one-to-one mapping is established between vectors of the two objects. The optimal mapping minimizes the sum of distances between each paired points. The overall distance of the optimal matching is robust and has high retrieval accuracy. The computation of the new distance function is formulated into the classical assignment problem. The lower-bounding techniques and early-stop mechanism are also proposed to significantly accelerate the expensive similarity search process. The classification problem over the point-set data is called Multiple Instance Learning (MIL) in the machine learning community in which a vector is an instance and an object is a bag of instances. The fifth contribution is to convert the MIL problem into a standard supervised learning in the conventional vector space. Specially, feature vectors of bags are grouped into clusters. Each object is then denoted as a bag of cluster labels, and common patterns of each category are discovered, each of which is further reconstructed into a bag of features. Accordingly, a bag is effectively mapped into a feature space defined by the distances from this bag to all the derived patterns. The standard supervised learning algorithms can be applied to classify objects into pre-defined categories. The results demonstrate that the proposal has better classification accuracy compared to other state-of-the-art techniques. In the future, I will continue to explore my research in large-scale data analysis algorithms, applications and system developments. Especially, I am interested in applications to analyze the massive volume of online data.
|
855 |
[pt] METODOS DE BUSCA POR SIMILARIDADE EM SEQUÊNCIAS TEMPORAIS DE VETORES COM UMA APLICAÇÃO À RECUPERAÇÃO DE ANÚNCIOS CLASSIFICADOS / [en] STAGED VECTOR STREAM SIMILARITY SEARCH METHODS WITH AN APPLICATION TO CLASSIFIED AD RETRIEVABRUNO FRANCISCO MARTINS DA SILVA 22 February 2024 (has links)
[pt] Uma sequência temporal de vetores (vector stream) pode ser modeladacomo uma sequência de pares ((v1, t1). . .(vn, tn)), onde vk é um vetor e tk écarimbo de tempo tais que todos os vetores são da mesma dimensão e tkmenor que tk+1. O problema de busca por similaridade em sequências temporais devetores é definido como: Dado um vetor (de alta dimensão) v e um intervalode tempo T, encontre uma lista ranqueada de vetores, recuperados de umasequência temporal de vetores, que sejam similares a v e que foram recebidosdentro do intervalo de tempo T. Esta dissertação primeiro introduz umafamília de métodos de busca por similaridade em sequências temporais devetores que não dependem da sequência completa, mas se adaptam à medidaque os vetores são incluídos na sequência. Os métodos geram uma sequênciade índices, que são então usados para implementar uma busca aproximadado vizinho mais próximo na sequência temporal de vetores. Em seguida, adissertação descreve uma implementação de um método da família baseado em Hierarchical Navigable Small World graphs. Utilizando esta implementação,a dissertação apresenta uma ferramenta de busca de anúncios classificadosque oferece recuperação de anúncios à medida que usuários continuamentesubmetem novos anúncios. A ferramenta é estruturada em um módulo principale três módulos auxiliares, sendo que o módulo principal é responsável porcoordenar os módulos auxiliares e prover uma interface para o usuário, e osmódulos auxiliares são responsáveis pela codificação dos textos e imagens emvetores, a indexação dos vetores, e o armazenamento dos textos, imagens evetores. Por fim, para avaliar a ferramenta, a dissertação utiliza um conjuntode aproximadamente 1 milhão de registros com as descrições de anúnciosclassificados e suas imagens. Os resultados mostraram que a ferramenta atingiuuma precisão de 98 por cento e um recall de 97 por cento. / [en] A vector stream can be modeled as a sequence of pairs ((v1, t1). . .(vn, tn)),
where vk is a vector and tk is a timestamp such that all vectors are of the
same dimension and tk less than tk+1. The vector stream similarity search problem is
defined as: Given a (high-dimensional) vector q and a time interval T, find a
ranked list of vectors, retrieved from a vector stream, that are similar to q and
that were received in the time interval T. This dissertation first introduces
a family of vector stream similarity search methods that do not depend on
having the full set of vectors available beforehand but adapt to the vector
stream as the vectors are added. The methods generate a sequence of indices
that are used to implement approximated nearest neighbor search over the
vector stream. Then, the dissertation describes an implementation of a method
in the family based on Hierarchical Navigable Small World graphs. Based on
this implementation, the dissertation presents a Classified Ad Retrieval tool
that supports classified ad retrieval as new ads are continuously submitted.
The tool is structured into a main module and three auxiliary modules, where
the main module is responsible for coordinating the auxiliary modules and for
providing a user interface, and the auxiliary modules are responsible for text
and image encoding, vector stream indexing, and data storage. To evaluate the
tool, the dissertation uses a dataset with approximately 1 million records with
descriptions of classified ads and their respective images. The results showed
that the tool reached an average precision of 98 percent and an average recall of 97 percent.
|
856 |
A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributesBashon, Yasmina M., Neagu, Daniel, Ridley, Mick J. 09 1900 (has links)
No / Real-world data collections are often heterogeneous (represented by a set of mixed attributes data types: numerical, categorical and fuzzy); since most available similarity measures can only be applied to one type of data, it becomes essential to construct an appropriate similarity measure for comparing such complex data. In this paper, a framework of new and unified similarity measures is proposed for comparing heterogeneous objects described by numerical, categorical and fuzzy attributes. Examples are used to illustrate, compare and discuss the applications and efficiency of the proposed approach to heterogeneous data comparison and clustering.
|
857 |
Using Pareto points for model identification in predictive toxicologyPalczewska, Anna Maria, Neagu, Daniel, Ridley, Mick J. January 2013 (has links)
no / Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals. A reliable prediction of toxic effects of chemicals in living systems is highly desirable in cosmetics, drug design or food protection to speed up the process of chemical compound discovery while reducing the need for lab tests. There is an extensive literature associated with the best practice of model generation and data integration but management and automated identification of relevant models from available collections of models is still an open problem. Currently, the decision on which model should be used for a new chemical compound is left to users. This paper intends to initiate the discussion on automated model identification. We present an algorithm, based on Pareto optimality, which mines model collections and identifies a model that offers a reliable prediction for a new chemical compound. The performance of this new approach is verified for two endpoints: IGC50 and LogP. The results show a great potential for automated model identification methods in predictive toxicology.
|
858 |
Impact of eWOM Source Characteristics on The Purchasing Intention.Shabsogh, Nisrein Mohammad Ahmad January 2013 (has links)
The use of e-mail communication between consumers has been growing and companies are seeking to increase their understanding of this type of private communication medium between consumers. The privacy and cost-effectiveness characteristics of e-mail make it an important communication medium for consumers. Consumers use e-mail to exchange a variety of information including electronic word of mouth (eWOM) about products, services and organisations. The travel industry, the context of this study, is increasingly being delivered online. Understanding what influences consumers and how consumers evaluate eWOM will increase the travel industry’s knowledge about its consumer base.
This study aims to contribute to existing knowledge on the impact of eWOM on consumer purchase intention. Its focus is on an interpersonal context where eWOM is sent from the source to the receiver in an e-mail about holiday destination. The study, which was undertaken from a positivist perspective, used qualitative and quantitative research techniques to better understand the influence of eWOM on purchase intention. The literature on word of mouth (WOM) and eWOM was initially examined to identify the major factors that have an influence on the receiver of eWOM.
Consistent with previous studies, both perceived expertise and similarity were identified as source characteristics that have an influence on the receiver’s purchase intention. The literature also indicated that trustworthiness belief would have a key effect on the influence of eWOM on the attitude of the receiver. Consequently, this study examined each trustworthiness dimension – ability, benevolence, and integrity – with respect to its role in the influence of eWOM on purchase intention.
The literature review also revealed that certain receiver characteristics were important in the process of influence, especially consumer susceptibility to interpersonal influence. The relationships between the variables identified were further developed into the research model, which has its roots in the theory of reasoned-action (Fishbein and Ajzen, 1975) and the dual process theory of influence (Deutsch and Gerard, 1955).
Methodologically, a scenario-building approach to developing authentic e-mail was used. The qualitative data gathered from eight focus group discussions were analysed using “framework analysis” (Ritchie and Spencer, 1994) to develop eight scenarios. This was then used to manipulate the moderating variables in the scenario. Three manipulations, each with two levels, were included: eWOM direction “positive and negative”; source characteristic of “expert/non-expert”; and source characteristic of “similar/non-similar”. These scenarios formed part of a questionnaire.
The questionnaire was used to collect data from a sample of University of Bradford students. The final number of usable questionnaires was 477. Structural equation modelling was used to determine the validity of the conceptual model and test the hypotheses. In particular, multiple group analysis was used to assess both the measurement and structural models, and to identify the impact of the eWOM direction. The theoretical model that describes the relationships between the exogenous variables (source’s and receiver’s characteristics) and the endogenous variables (trustworthiness dimensions, interpersonal influence and purchase intention) was accepted. The research findings provided empirical evidence on the difference in the impact of positive and negative eWOM on purchase intention. The source’s and receiver’s characteristics and related trustworthiness beliefs, (i.e. ability, benevolence, and integrity) are influenced by the direction of eWOM.
The findings show that positive and negative eWOM differ with respect to how they impact on consumers’ attitudes and intentions. For instance, consumers have more belief in the credibility of a source who provides negative eWOM. However, the overall influence of the source’s characteristics tends to be stronger with positive than with negative eWOM. The findings of this study provide insights for both academics and practitioners to understand the potential of eWOM. This might be tailored to help develop more private relationships with customers through e-mail marketing strategies that incorporate eWOM. Negative eWOM is more credible but less directly useful to marketers. Nevertheless, it is important for marketers to realise the significance of managing dissatisfaction and to harness the power of negative eWOM. Similarly, positive eWOM is effective especially when the source is both expert and similar. This might be translated into online marketing campaigns that use consumer-to-consumer discussions in addition to viral marketing. Future research might test the model in different contexts, (e.g. financial services), to provide a more comprehensive picture of the influence of eWOM on purchase intention.
|
859 |
A Novel Approach for Continuous Speech Tracking and Dynamic Time Warping. Adaptive Framing Based Continuous Speech Similarity Measure and Dynamic Time Warping using Kalman Filter and Dynamic State ModelKhan, Wasiq January 2014 (has links)
Dynamic speech properties such as time warping, silence removal and background noise interference are the most challenging issues in continuous speech signal matching. Among all of them, the time warped speech signal matching is of great interest and has been a tough challenge for the researchers. An adaptive framing based continuous speech tracking and similarity measurement approach is introduced in this work following a comprehensive research conducted in the diverse areas of speech processing. A dynamic state model is introduced based on system of linear motion equations which models the input (test) speech signal frame as a unidirectional moving object along the template speech signal. The most similar corresponding frame position in the template speech is estimated which is fused with a feature based similarity observation and the noise variances using a Kalman filter. The Kalman filter provides the final estimated frame position in the template speech at current time which is further used for prediction of a new frame size for the next step. In addition, a keyword spotting approach is proposed by introducing wavelet decomposition based dynamic noise filter and combination of beliefs. The Dempster’s theory of belief combination is deployed for the first time in relation to keyword spotting task. Performances for both; speech tracking and keyword spotting approaches are evaluated using the statistical metrics and gold standards for the binary classification. Experimental results proved the superiority of the proposed approaches over the existing methods. / The appendices files are not available online.
|
860 |
Similarity metric for crowd trajectory evaluation on a per-agent basis : An approach based on the sum of absolute differences / Likhetsmetrik för folkmassautvärdering ur ett per-agent perspektiv : En metod baserad på summan av absoluta skillnaderBrunnberg, Karl January 2023 (has links)
Simulation models that replicate realistic crowd behaviours and dynamics are of great societal use in a variety of fields of research and entertainment. In order to evaluate the accuracy of such models there is a demand for metrics and evaluation solutions that measure how well they simulate the dynamics of real crowds. A crowd similarity metric is a performance indicator which quantifies the similarity of crowd trajectories. Similarity metrics may be used to evaluate the validity of simulation models by comparing the content they produce to real-world crowd trajectory data. This thesis presents and evaluates a similarity metric which employs an approach based on the Sum of Absolute Differences to compare two-dimensional crowd trajectories. The metric encapsulates the similarity of crowd trajectories by iteratively summing time-wise positional differences on a per-agent basis. The resulting metric is simple, highly reproducible and simulatorindependent. Its accuracy in quantifying similarity is evaluated by means of a user study investigating the correlation between metric values and human perception of similarity for real and simulated crowd scenarios of varying density, trajectory, speed, and presence of environmental obstacles. The user study explores different aspects of crowd perception by dividing similarity ratings on a five-point Likert scale into four categories: overall, in terms of trajectories, speeds, and positions. Scenarios and rating categories that indicate high and low degrees of correlation between metric values and perceived similarity are identified and discussed. Furthermore, the findings are compared to previous research on crowd trajectory similarity metrics. The results indicate that the metric shows promising potential for accurate similarity measurement in simple and sparse scenarios across all rated categories. Moreover, the metric is strongly correlated with the trajectory-ratings of crowd motion similarity. However, it appears to not correlate well with the perception of overall similarity for large and dense crowds. / Simuleringsmodeller som efterhärmar realistiskt beteende och dynamik bland folkmassor är av stor samhällelig nytta i flertalet forskningsområden och i underhållningsbranschen. För att utvärdera sådana modellers noggrannhet finns det en efterfrågan av metriker och bedömningslösningar som mäter hur väl modeller simulerar verklig folkmassadynamik. En likhetsmetrik för folkmassautvärdering är en prestationsindikator som kvantifierar likheten mellan folkmassarörelser. Likhetsmetriker kan användas för att utvärdera simuleringsmodellers validitet genom att jämföra beteendet de producerar med rörelsedata från verkliga folkmassor. Följande examensarbete presenterar och utvärderar en likhetsmetrik för folkmassor som utnyttjar en metod baserad på ”Summan av Absoluta Skillnader” för att jämföra par av två-dimensionella folkmassarörelser. Metriken uppskattar likheten mellan två folkmassors rörelser genom att iterativt summera skillnaderna mellan folkmassornas positioner baserat på de individuella virtuella agenterna. Resultatet är en simpel, kraftigt reproducerbar och simulator-oberoende metrik. Dess noggrannhet avseende likhetsestimering utvärderas med en perceptuell användarstudie som undersöker korrelationen mellan metrikvärden och mänsklig perception av visuell likhet för flertalet verkliga och simulerade folkmassor av varierande densitet, färdväg, hastighet, och förekomst av hinder. Användarstudien utforskar olika aspekter av folkmassaperception genom att dela upp likhetsgradering på en femgradig skala i fyra kategorier: övergripande, avseende färdvägar, hastigheter, och positioner. Folkmassascenarier och graderingskategorier som indikerar höga och låga nivåer av korrelation mellan metrikvärden och uppfattad likhet identifieras och diskuteras. Fortsättningsvis jämförs resultaten med tidigare forskning om likhetsmetriker för folkmassautvärdering. Resultaten tyder på att metriken har lovande potential för noggrann likhetsestimering i enkla och glesa scenarier oavsett graderingskategori. Dessutom korrelerar metriken starkt med färdvägs-graderingar av likhet. Däremot verkar den inte korrelera väl med perceptionen av övergripande likhet för stora och täta folkmassor
|
Page generated in 0.0967 seconds