Global ETD Search

11	Similarity-based recommendation of OLAP sessions / Recommandation de sessions OLAP, basé sur des mesures de similarités Aligon, Julien 13 December 2013 (has links) L’OLAP (On-Line Analytical Processing) est le paradigme principal pour accéder aux données multidimensionnelles dans les entrepôts de données. Pour obtenir une haute expressivité d’interrogation, malgré un petit effort de formulation de la requête, OLAP fournit un ensemble d’opérations (comme drill-down et slice-and-dice ) qui transforment une requête multidimensionnelle en une autre, de sorte que les requêtes OLAP sont normalement formulées sous la forme de séquences appelées Sessions OLAP. Lors d’une session OLAP l’utilisateur analyse les résultats d’une requête et, selon les données spécifiques qu’il voit, applique une seule opération afin de créer une nouvelle requête qui lui donnera une meilleure compréhension de l’information. Les séquences de requêtes qui en résultent sont fortement liées à l’utilisateur courant, le phénomène analysé, et les données. Alors qu’il est universellement reconnu que les outils OLAP ont un rôle clé dans l’exploration souple et efficace des cubes multidimensionnels dans les entrepôts de données, il est aussi communément admis que le nombre important d’agrégations et sélections possibles, qui peuvent être exploités sur des données, peut désorienter l’expérience utilisateur. / OLAP (On-Line Analytical Processing) is the main paradigm for accessing multidimensional data in data warehouses. To obtain high querying expressiveness despite a small query formulation effort, OLAP provides a set of operations (such as drill-down and slice-and-dice) that transform one multidimensional query into another, so that OLAP queries are normally formulated in the form of sequences called OLAP sessions. During an OLAP session the user analyzes the results of a query and, depending on the specific data she sees, applies one operation to determine a new query that will give her a better understanding of information. The resulting sequences of queries are strongly related to the issuing user, to the analyzed phenomenon, and to the current data. While it is universally recognized that OLAP tools have a key role in supporting flexible and effective exploration of multidimensional cubes in data warehouses, it is also commonly agreed that the huge number of possible aggregations and selections that can be operated on data may make the user experience disorientating. Système de recommandation Log Session OLAP Mesures de Similarité OLAP Recommender System Log OLAP Session OLAP Similarity Measures
12	Similarity models for atlas-based segmentation of whole-body MRI volumes Axberg, Elin, Klerstad, Ida January 2020 (has links) In order to analyse body composition of MRI (Magnetic Resonance Imaging) volumes, atlas-based segmentation is often used to retrieve information from specific organs or anatomical regions. The method behind this technique is to use an already segmented image volume, an atlas, to segment a target image volume by registering the volumes to each other. During this registration a deformation field will be calculated, which is applied to a segmented part of the atlas, resulting in the same anatomical segmentation in the target. The drawback with this method is that the quality of the segmentation is highly dependent on the similarity between the target and the atlas, which means that many atlases are needed to obtain good segmentation results in large sets of MRI volumes. One potential solution to overcome this problem is to create the deformation field between a target and an atlas as a sequence of small deformations between more similar bodies. In this master thesis a new method for atlas-based segmentation has been developed, with the anticipation of obtaining good segmentation results regardless of the level of similarity between the target and the atlas. In order to do so, 4000 MRI volumes were used to create a manifold of human bodies, which represented a large variety of different body types. These MRI volumes were compared to each other and the calculated similarities were saved in matrices called similarity models. Three different similarity measures were used to create the models which resulted in three different versions of the model. In order to test the hypothesis of achieving good segmentation results when the deformation field was constructed as a sequence of small deformations, the similarity models were used to find the shortest path (the path with the least dissimilarity) between a target and an atlas in the manifold. In order to evaluate the constructed similarity models, three MRI volumes were chosen as atlases and 100 MRI volumes were randomly picked to be used as targets. The shortest paths between these volumes were used to create the deformation fields as a sequence of small deformations. The created fields were then used to segment the anatomical regions ASAT (abdominal subcutaneous adipose tissue), LPT (left posterior thigh) and VAT (visceral adipose tissue). The segmentation performance was measured with Dice Index, where segmentations constructed at AMRA Medical AB were used as ground truth. In order to put the results in relation to another segmentation method, direct deformation fields between the targets and the atlases were also created and the segmentation results were compared to the ground truth with the Dice Index. Two different types of transformation methods, one non-parametric and one affine transformation, were used to create the deformation fields in this master thesis. The evaluation showed that good segmentation results can be achieved for the segmentation of VAT for one of the constructed similarity models. These results were obtained when a non-parametric registration method was used to create the deformation fields. In order to achieve similar results for an affine registration and to improve the segmentation of other anatomical regions, further investigations are needed. Atlas-based segmentation image registration image similarity measures Medical Engineering Medicinteknik
13	A Recommendation System Based on Multiple Databases. Goyal, Vivek 11 October 2013 (has links) No description available. Computer Science Collaborative Filtering Similarity measures Recommendation System Neighborhood Model Fuzzy Clustering Data Mining
14	RESEARCH-PYRAMID BASED SEARCH TOOLS FOR ONLINE DIGITAL LIBRARIES Bani-Ahmad, Sulieman Ahmad 03 April 2008 (has links) No description available. Computer Science Literature Digital Libraries the research-pyramid model
15	Analysis of Rank Distance for Malware Classification Subramanian, Nandita January 2016 (has links) No description available. Computer Science Rank Distance Malware Classification Mutual Information Text Mining Similarity Measures Windows Malware
16	Large scale similarity-based time series mining / Mineração de séries temporais por similaridade em larga escala Silva, Diego Furtado 25 September 2017 (has links) Time series are ubiquitous in the day-by-day of human beings. A diversity of application domains generate data arranged in time, such as medicine, biology, economics, and signal processing. Due to the great interest in time series, a large variety of methods for mining temporal data has been proposed in recent decades. Several of these methods have one characteristic in common: in their cores, there is a (dis)similarity function used to compare the time series. Dynamic Time Warping (DTW) is arguably the most relevant, studied and applied distance measure for time series analysis. The main drawback of DTW is its computational complexity. At the same time, there are a significant number of data mining tasks, such as motif discovery, which requires a quadratic number of distance computations. These tasks are time intensive even for less expensive distance measures, like the Euclidean Distance. This thesis focus on developing fast algorithms that allow large-scale analysis of temporal data, using similarity-based methods for time series data mining. The contributions of this work have implications in several data mining tasks, such as classification, clustering and motif discovery. Specifically, the main contributions of this thesis are the following: (i) an algorithm to speed up the exact DTW calculation and its embedding into the similarity search procedure; (ii) a novel DTW-based spurious prefix and suffix invariant distance; (iii) a music similarity representation with implications on several music mining tasks, and a fast algorithm to compute it, and; (iv) an efficient and anytime method to find motifs and discords under the proposed prefix and suffix invariant DTW. / Séries temporais são ubíquas no dia-a-dia do ser humano. Dados organizados no tempo são gerados em uma infinidade de domínios de aplicação, como medicina, biologia, economia e processamento de sinais. Devido ao grande interesse nesse tipo de dados, diversos métodos de mineração de dados temporais foram propostos nas últimas décadas. Muitos desses métodos possuem uma característica em comum: em seu núcleo, há uma função de (dis)similaridade utilizada para comparar as séries. Dynamic Time Warping (DTW) é indiscutivelmente a medida de distância mais relevante na análise de séries temporais. A principal dificuldade em se utilizar a DTW é seu alto custo computacional. Ao mesmo tempo, algumas tarefas de mineração de séries temporais, como descoberta de motifs, requerem um alto número de cálculos de distância. Essas tarefas despendem um grande tempo de execução, mesmo utilizando-se medidas de distância menos custosas, como a distância Euclidiana. Esta tese se concentra no desenvolvimento de algoritmos eficientes que permitem a análise de dados temporais em larga escala, utilizando métodos baseados em similaridade. As contribuições desta tese têm implicações em variadas tarefas de mineração de dados, como classificação, agrupamento e descoberta de padrões frequentes. Especificamente, as principais contribuições desta tese são: (i) um algoritmo para acelerar o cálculo exato da distância DTW e sua incorporação ao processo de busca por similaridade; (ii) um novo algoritmo baseado em DTW para prover invariância a prefixos e sufixos espúrios no cálculo da distância; (iii) uma representação de similaridade musical com implicações em diferentes tarefas de mineração de dados musicais e um algoritmo eficiente para computá-la; (iv) um método eficiente e anytime para encontrar motifs e discords baseado na medida DTW invariante a prefixos e sufixos. Data mining Dynamic Time Warping Dynamic Time Warping Medidas de similaridade Mineração de dados Séries temporais Similarity measures Time series
17	Similarity algorithms for Heterogeneous Information Networks / Algoritmos de similaridade para Redes de Informações Heterogêneas Ribeiro, Angélica Abadia Paulista 28 January 2019 (has links) Most real systems can be represented as a graph of multi-typed components with a large number of interactions. Heterogeneous Information Networks (HIN) are interconnected structures with data of multiple types which support the rich semantic meaning of structural types of nodes and edges. In HIN, different information can be presented using different types and forms of data, but may have the same or complementary information. So there is knowledge to be discovered. Terminology Knowledge Structures (TKS) como terminology products can be sources of linguistic representations and knowledge to be used for enrich the HIN and create a measure of similarity to extract the documents similar to each other, even if these documents are of different types (for example, finding medical articles that are in some way related to medical records). In this sense, this work presents the creation of a Heterogeneous Information Network using classical similarity measures, terminology products and the attributes of documents by an algorithm called NetworkCreator. As a contribution, an algorithm called NetworkCreator was created that from medical records and scientific articles builds an HIN with related documents, was also created. The algorithm HeteSimTKSQuery to calculate similarity measures between documents of different types which are in HIN. Terminology products with meta-paths were also explored. The results were efficient, reaching on average 89\\% accuracy in some cases. However, it is important to note that all HIN presented in the researched literature were constructed only by one type of data coming from a single source. The results show that the algorithms are feasible to solve the problems of HIN construction and search for similarity. But it still needs improvement. In the future one can work on detection in the detection of node granularity of these networks and try to reduce the network construction runtime / A maioria dos sistemas reais pode ser representada como um grafo de componentes multi-tipados com um grande número de interações. Redes de Informação Heterogênea (HIN) são estruturas interconectadas com dados de múltiplos tipos que suportam o rico significado semântico de tipos estruturais de nós e arestas. Nas HIN, diferentes informações podem ser apresentadas usando diferentes tipos e formas de dados, mas podem ter informações iguais ou complementares. Então, há conhecimento a ser descoberto. Estruturas de Conhecimento Terminológicos (TKS) como produtos terminológicos podem ser fontes de representações linguísticas e de conhecimento a ser usado para enriquecer a HIN e criar uma medida de similaridade para extrair os documentos similares entre si, mesmo que esses documentos sejam de tipos diferentes (por exemplo, encontrar os artigos médicos que de alguma forma estão relacionados com registros médicos). Nesse sentido, este trabalho apresenta o algoritmo NetworkCreator que cria uma Rede de Informações Heterogêneas utilizando medidas de similaridade clássicas, produtos de terminológicos e os atributos dos documentos. Nos experimentos, foram utilizados prontuários médicos e artigos científicos para construir a HIN e relacionar seus conteúdos. O algoritmo HeteSimTKSQuery também foi criado para calcular medidas de similaridade entre os documentos de diferentes tipos que se encontram na HIN. Produtos terminológicos com meta-caminhos também foram explorados. Os resultados se mostraram eficientes, alcançando em média 89\\% de acurácia, em alguns casos. No entanto, é importante notar que todas as HIN apresentadas na literatura pesquisada foram construídas apenas por um tipo de dados proveniente de uma única fonte. Os resultados mostram que os algoritmos são viáveis para resolver os problemas de construção de HIN e busca de similaridade. Porém, eles ainda precisam de aperfeiçoamentos. Futuramente, pode-se trabalhar na detecção da granularidade dos nós destas redes e tentar reduzir o tempo de construção da rede Heterogeneous Information Network Medidas de Similaridade Meta-caminho Meta-path Produtos terminológicos Redes de Informação Heterogêneas Similarity measures Terminology products
18	Effective and efficient similarity search in databases Lange, Dustin January 2013 (has links) Given a large set of records in a database and a query record, similarity search aims to find all records sufficiently similar to the query record. To solve this problem, two main aspects need to be considered: First, to perform effective search, the set of relevant records is defined using a similarity measure. Second, an efficient access method is to be found that performs only few database accesses and comparisons using the similarity measure. This thesis solves both aspects with an emphasis on the latter. In the first part of this thesis, a frequency-aware similarity measure is introduced. Compared record pairs are partitioned according to frequencies of attribute values. For each partition, a different similarity measure is created: machine learning techniques combine a set of base similarity measures into an overall similarity measure. After that, a similarity index for string attributes is proposed, the State Set Index (SSI), which is based on a trie (prefix tree) that is interpreted as a nondeterministic finite automaton. For processing range queries, the notion of query plans is introduced in this thesis to describe which similarity indexes to access and which thresholds to apply. The query result should be as complete as possible under some cost threshold. Two query planning variants are introduced: (1) Static planning selects a plan at compile time that is used for all queries. (2) Query-specific planning selects a different plan for each query. For answering top-k queries, the Bulk Sorted Access Algorithm (BSA) is introduced, which retrieves large chunks of records from the similarity indexes using fixed thresholds, and which focuses its efforts on records that are ranked high in more than one attribute and thus promising candidates. The described components form a complete similarity search system. Based on prototypical implementations, this thesis shows comparative evaluation results for all proposed approaches on different real-world data sets, one of which is a large person data set from a German credit rating agency. / Ziel von Ähnlichkeitssuche ist es, in einer Menge von Tupeln in einer Datenbank zu einem gegebenen Anfragetupel all diejenigen Tupel zu finden, die ausreichend ähnlich zum Anfragetupel sind. Um dieses Problem zu lösen, müssen zwei zentrale Aspekte betrachtet werden: Erstens, um eine effektive Suche durchzuführen, muss die Menge der relevanten Tupel mithilfe eines Ähnlichkeitsmaßes definiert werden. Zweitens muss eine effiziente Zugriffsmethode gefunden werden, die nur wenige Datenbankzugriffe und Vergleiche mithilfe des Ähnlichkeitsmaßes durchführt. Diese Arbeit beschäftigt sich mit beiden Aspekten und legt den Fokus auf Effizienz. Im ersten Teil dieser Arbeit wird ein häufigkeitsbasiertes Ähnlichkeitsmaß eingeführt. Verglichene Tupelpaare werden entsprechend der Häufigkeiten ihrer Attributwerte partitioniert. Für jede Partition wird ein unterschiedliches Ähnlichkeitsmaß erstellt: Mithilfe von Verfahren des Maschinellen Lernens werden Basisähnlichkeitsmaßes zu einem Gesamtähnlichkeitsmaß verbunden. Danach wird ein Ähnlichkeitsindex für String-Attribute vorgeschlagen, der State Set Index (SSI), welcher auf einem Trie (Präfixbaum) basiert, der als nichtdeterministischer endlicher Automat interpretiert wird. Zur Verarbeitung von Bereichsanfragen wird in dieser Arbeit die Notation der Anfragepläne eingeführt, um zu beschreiben welche Ähnlichkeitsindexe angefragt und welche Schwellwerte dabei verwendet werden sollen. Das Anfrageergebnis sollte dabei so vollständig wie möglich sein und die Kosten sollten einen gegebenen Schwellwert nicht überschreiten. Es werden zwei Verfahren zur Anfrageplanung vorgeschlagen: (1) Beim statischen Planen wird zur Übersetzungszeit ein Plan ausgewählt, der dann für alle Anfragen verwendet wird. (2) Beim anfragespezifischen Planen wird für jede Anfrage ein unterschiedlicher Plan ausgewählt. Zur Beantwortung von Top-k-Anfragen stellt diese Arbeit den Bulk Sorted Access-Algorithmus (BSA) vor, der große Mengen von Tupeln mithilfe fixer Schwellwerte von den Ähnlichkeitsindexen abfragt und der Tupel bevorzugt, die hohe Ähnlichkeitswerte in mehr als einem Attribut haben und damit vielversprechende Kandidaten sind. Die vorgestellten Komponenten bilden ein vollständiges Ähnlichkeitssuchsystem. Basierend auf einer prototypischen Implementierung zeigt diese Arbeit vergleichende Evaluierungsergebnisse für alle vorgestellten Ansätze auf verschiedenen Realwelt-Datensätzen; einer davon ist ein großer Personendatensatz einer deutschen Wirtschaftsauskunftei. Datenbanken Ähnlichkeitssuche Suchverfahren Ähnlichkeitsmaße Indexstrukturen Databases Similarity Search Search Algorithms Similarity Measures Index Structures Data processing Computer science
19	Μέτρα ομοιότητας στην τεχνική ομαδοποίησης (clustering): εφαρμογή στην ανάλυση κειμένων (text mining) / Similarity measures in clustering: an application in text mining Παπαστεργίου, Θωμάς 17 May 2007 (has links) Ανάπτυξη ενός μέτρου ανομοιότητας μεταξύ κατηγορικών δεδομένων και η εφαρμογή του για την ομαδοποίηση κειμένων και την λύση του προβλήματος αυθεντiκότητας κειμένων. / Developement of a similarity measure for categorical data and the application of the measure in text clustering and in the authoring attribution problem. Ομαδοποίηση Μέτρα ομοιότητας Εξόρυξη κειμένου 006.312 Clustering Similarity measures Authoring attribution problem Text mining
20	Large scale similarity-based time series mining / Mineração de séries temporais por similaridade em larga escala Diego Furtado Silva 25 September 2017 (has links) Time series are ubiquitous in the day-by-day of human beings. A diversity of application domains generate data arranged in time, such as medicine, biology, economics, and signal processing. Due to the great interest in time series, a large variety of methods for mining temporal data has been proposed in recent decades. Several of these methods have one characteristic in common: in their cores, there is a (dis)similarity function used to compare the time series. Dynamic Time Warping (DTW) is arguably the most relevant, studied and applied distance measure for time series analysis. The main drawback of DTW is its computational complexity. At the same time, there are a significant number of data mining tasks, such as motif discovery, which requires a quadratic number of distance computations. These tasks are time intensive even for less expensive distance measures, like the Euclidean Distance. This thesis focus on developing fast algorithms that allow large-scale analysis of temporal data, using similarity-based methods for time series data mining. The contributions of this work have implications in several data mining tasks, such as classification, clustering and motif discovery. Specifically, the main contributions of this thesis are the following: (i) an algorithm to speed up the exact DTW calculation and its embedding into the similarity search procedure; (ii) a novel DTW-based spurious prefix and suffix invariant distance; (iii) a music similarity representation with implications on several music mining tasks, and a fast algorithm to compute it, and; (iv) an efficient and anytime method to find motifs and discords under the proposed prefix and suffix invariant DTW. / Séries temporais são ubíquas no dia-a-dia do ser humano. Dados organizados no tempo são gerados em uma infinidade de domínios de aplicação, como medicina, biologia, economia e processamento de sinais. Devido ao grande interesse nesse tipo de dados, diversos métodos de mineração de dados temporais foram propostos nas últimas décadas. Muitos desses métodos possuem uma característica em comum: em seu núcleo, há uma função de (dis)similaridade utilizada para comparar as séries. Dynamic Time Warping (DTW) é indiscutivelmente a medida de distância mais relevante na análise de séries temporais. A principal dificuldade em se utilizar a DTW é seu alto custo computacional. Ao mesmo tempo, algumas tarefas de mineração de séries temporais, como descoberta de motifs, requerem um alto número de cálculos de distância. Essas tarefas despendem um grande tempo de execução, mesmo utilizando-se medidas de distância menos custosas, como a distância Euclidiana. Esta tese se concentra no desenvolvimento de algoritmos eficientes que permitem a análise de dados temporais em larga escala, utilizando métodos baseados em similaridade. As contribuições desta tese têm implicações em variadas tarefas de mineração de dados, como classificação, agrupamento e descoberta de padrões frequentes. Especificamente, as principais contribuições desta tese são: (i) um algoritmo para acelerar o cálculo exato da distância DTW e sua incorporação ao processo de busca por similaridade; (ii) um novo algoritmo baseado em DTW para prover invariância a prefixos e sufixos espúrios no cálculo da distância; (iii) uma representação de similaridade musical com implicações em diferentes tarefas de mineração de dados musicais e um algoritmo eficiente para computá-la; (iv) um método eficiente e anytime para encontrar motifs e discords baseado na medida DTW invariante a prefixos e sufixos. Dynamic Time Warping Medidas de similaridade Mineração de dados Séries temporais Data mining Dynamic Time Warping Similarity measures Time series

Search results