• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 17
  • 8
  • 1
  • 1
  • Tagged with
  • 28
  • 28
  • 11
  • 9
  • 9
  • 7
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Vertex Weighted Spectral Clustering

Masum, Mohammad 01 August 2017 (has links)
Spectral clustering is often used to partition a data set into a specified number of clusters. Both the unweighted and the vertex-weighted approaches use eigenvectors of the Laplacian matrix of a graph. Our focus is on using vertex-weighted methods to refine clustering of observations. An eigenvector corresponding with the second smallest eigenvalue of the Laplacian matrix of a graph is called a Fiedler vector. Coefficients of a Fiedler vector are used to partition vertices of a given graph into two clusters. A vertex of a graph is classified as unassociated if the Fiedler coefficient of the vertex is close to zero compared to the largest Fiedler coefficient of the graph. We propose a vertex-weighted spectral clustering algorithm which incorporates a vector of weights for each vertex of a given graph to form a vertex-weighted graph. The proposed algorithm predicts association of equidistant or nearly equidistant data points from both clusters while the unweighted clustering does not provide association. Finally, we implemented both the unweighted and the vertex-weighted spectral clustering algorithms on several data sets to show that the proposed algorithm works in general.
12

Webová aplikace doporučovacího systému / Web Application of Recommender System

Koníček, Igor January 2015 (has links)
This master's thesis describes creation of recommender system that is used in real server cbdb.cz. A~fully operational recommender system was developed using collaborative and content-based filtering techniques. Thanks to many user feedback, we were able to evaluate their opinion. Many recommended books were tagged as desirable. This thesis is extending current functionality of cbdb.cz with recommender system. This system uses its extensive database of ratings, users and books.
13

Webová aplikace doporučovacího systému / Web Application of Recommender System

Hlaváček, Pavel January 2013 (has links)
This thesis deals with problems of recommender systems and their usage in web applications. There are three main data mining techniques summarized and individual approaches for recommendation. Main part of this thesis is a suggestion and an implementation of web applications for recommending dishes from restaurants. Algorithm for food recommending is designed and implemented in this paper. The algorithm deals with the problem of frequently changing items. The algorithm utilizes hybrid filtering technique which is based on content and knowledge. This filtering technique uses cosine vector similarity for computation.
14

Automatická detekce témat, segmentace a vizualizace on-line kurzů / Automatic Topic Detection, Segmentation and Visualization of On-Line Courses

Řídký, Josef January 2016 (has links)
The aim of this work is to create a web application for automatic topic detection and segmentation of on-line courses. During playback of processed records, the application should be able to offer records from thematically consistent on-line courses. This document contains problem description, list of used instruments, description of implementation, the principle of operation and description of final user interface.
15

Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations

Della Corte, Giuseppe January 2020 (has links)
The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set.
16

Stratified-medium sound speed profiling for CPWC ultrasound imaging

D'Souza, Derrell 13 July 2020 (has links)
Coherent plane-wave compounding (CPWC) ultrasound is an important modality enabling ultrafast biomedical imaging. To perform CWPC image reconstruction for a stratified (horizontally layered) medium, one needs to know how the speed of sound (SOS) varies with the propagation depth. Incorrect sound speed and layer thickness assumptions can cause focusing errors, degraded spatial resolution and significant geometrical distortions resulting in poor image reconstruction. We aim to determine the speed of sound and thickness values for each horizontal layer to accurately locate the recorded reflection events to their true locations within the medium. Our CPWC image reconstruction process is based on phase-shift migration (PSM) that requires the user to specify the speed of sound and thickness of each layer in advance. Prior to performing phase-shift migration (one layer at a time, starting from the surface), we first estimate the speed of sound values of a given layer using a cosine similarity metric, based on the data obtained by a multi-element transducer array for two different plane-wave emission angles. Then, we use our speed estimate to identify the layer thickness via end-of-layer boundary detection. A low-cost alternative that obtains reconstructed images with fewer phase shifts (i.e., fewer complex multiplications) using a spectral energy threshold is also proposed in this thesis. Our evaluation results, based on the CPWC imaging simulation of a three-layer medium, show that our sound speed and layer thickness estimates are within 4% of their true values (i.e., those used to generate simulated data). We have also confirmed the accuracy of our speed and layer thickness estimation separately, using two experimental datasets representing two special cases. For speed estimation, we used a CPWC imaging dataset for a constant-speed (i.e., single-layer) medium, yielding estimates within 1% of their true values. For layer thickness estimation, we used a monostatic (i.e., single-element) synthetic-aperture (SA) imaging dataset of the three-layer medium, also yielding estimates within 1% of their true values. Our evaluation results for the low-cost alternative showed a 93% reduction in complex multiplications for the three-layer CPWC imaging dataset and 76% for the three-layer monostatic SA imaging dataset, producing images nearly similar to those obtained using the original PSM methods. / Graduate
17

PDF document search within a very large database

Wang, Lizhong January 2017 (has links)
Digital search engine, taking a search request from user and then returning a result responded to the request to the user, is indispensable for modern humans who are used to surfing the Internet. On the other hand, the digital document PDF is accepted by more and more people and becomes widely used in this day and age due to the convenience and effectiveness. It follows that, the traditional library has already started to be replaced by the digital one. Combining these two factors, a document based search engine that is able to query a digital document database with an input file is urgently needed. This thesis is a software development that aims to design and implement a prototype of such search engine, and propose latent optimization methods for Loredge. This research can be mainly divided into two categories: Prototype Development and Optimization Analysis. It involves an analytical research on sample documents provided by Loredge and a multi-perspective performance analysis. The prototype contains reading, preprocessing and similarity measurement. The reading part reads in a PDF file by using an imported Java library Apache PDFBox. The preprocessing processes the in-reading document and generates document fingerprint. The similarity measurement is the final stage that measures the similarity between the input fingerprint with all the document fingerprints in the database. The optimization analysis is to balance resource consumptions involving response time, accuracy rate and memory consumption. According to the performance analysis, the shorter the document fingerprint is, the better performance the search program presents. Moreover, a permanent feature database and a similarity based filtration mechanism are proposed to further optimize the program. This project has laid a solid foundation for further study in the document based search engine by providing a feasible prototype and enough relevant experimental data. This study figures out that the following study should mainly focuses on improving the effectiveness of the database access, which involves data entry labeling and search algorithm optimization. / Digital sökmotor, som tar en sökfråga från användaren och sedan returnerar ett resultat som svarar på den begäran tillbaka till användaren, är oumbärligt för moderna människor som brukar surfa på Internet. Å andra sidan, det digitala dokumentets format PDF accepteras av fler och fler människor, och det används i stor utsträckning i denna tidsålder på grund av bekvämlighet och effektivitet. Det följer att det traditionella biblioteket redan har börjat bytas ut av det digitala biblioteket. När dessa två faktorer kombineras, framgår det att det brådskande behövs en dokumentbaserad sökmotor, som har förmåga att fråga en digital databas om en viss fil. Den här uppsatsen är en mjukvaruutveckling som syftar till att designa och implementera en prototyp av en sådan sökmotor, och föreslå relevant optimeringsmetod för Loredge. Den här undersökningen kan huvudsakligen delas in i två kategorier, prototyputveckling och optimeringsanalys. Arbeten involverar en analytisk forskning om exempeldokument som kommer från Loredge och en prestandaanalys utifrån flera perspektiv. Prototypen innehåller läsning, förbehandling och likhetsmätning. Läsningsdelen läser in en PDF-fil med hjälp av en importerad Java bibliotek, Apache PDFBox. Förbehandlingsdelen bearbetar det inlästa dokumentet och genererar ett dokumentfingeravtryck. Likhetsmätningen är det sista steget, som mäter likheten mellan det inlästa fingeravtrycket och fingeravtryck av alla dokument i Loredge databas. Målet med optimeringsanalysen är att balansera resursförbrukningen, som involverar responstid, noggrannhet och minnesförbrukning. Ju kortare ett dokuments fingeravtryck är, desto bättre prestanda visar sökprogram enligt resultat av prestandaanalysen. Dessutom föreslås en permanent databas med fingeravtryck, och en likhetsbaserad filtreringsmekanism för att ytterligare optimera sökprogrammet. Det här projektet har lagt en solid grund för vidare studier om dokumentbaserad sökmotorn, genom att tillhandahålla en genomförbar prototyp och tillräckligt relevanta experimentella data. Den här studie visar att kommande forskning bör huvudsakligen inriktas på att förbättra effektivitet i databasåtkomsten, vilken innefattar data märkning och optimering av sökalgoritm.
18

Nuoxus - um modelo de caching proativo de conteúdo multimídia para Fog Radio Access Networks (F-RANs)

Costa, Felipe Rabuske 28 February 2018 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-05-11T12:40:43Z No. of bitstreams: 1 Felipe Rabuske Costa_.pdf: 3408830 bytes, checksum: 25a67ecb02629c811b5f305a1f2e3d27 (MD5) / Made available in DSpace on 2018-05-11T12:40:43Z (GMT). No. of bitstreams: 1 Felipe Rabuske Costa_.pdf: 3408830 bytes, checksum: 25a67ecb02629c811b5f305a1f2e3d27 (MD5) Previous issue date: 2018-02-28 / Nenhuma / Estima-se que até o ano de 2020, cerca de 50 bilhões de dispositivos móveis estarão conectados a redes sem fio e que 78% de todo o tráfego de dados gerado por esse tipo de dispositivos será conteúdo multimídia. Essas estimativas fomentam o desenvolvimento da quinta geração de redes móveis (5G). Uma das arquiteturas mais recentemente proposta, chamada de Fog Radio Access Networks (F-RAN), dá aos componentes localizados na borda da rede poder de processamento e armazenamento endereçados às atividades da rede. Um dos principais problemas dessa arquitetura é o intenso tráfego de dados no seu canal de comunicação centralizado chamado fronthaul, utilizado para conectar as antenas (F-APs) à rede externa. Dado esse contexto, esse trabalho apresenta o Nuoxus, um modelo de caching de conteúdo multimídia voltado para F-RANs que visa amenizar esse problema. Ao armazenar esse tipo de conteúdo nos nós de rede mais próximos ao usuário, o número de acessos concorrentes ao fronthaul é reduzido, sendo esse um dos fatores agravantes na latência de comunicação na rede. O Nuoxus pode ser executado em qualquer nó da rede que possua capacidade de armazenamento e processamento, ficando responsável por gerenciar o caching de conteúdo desse nó. Sua política de substituição de conteúdo utiliza a similaridade de requisições entre os nós filhos e o restante da rede como um fator para definir a relevância de armazenar o conteúdo requisitado em cache. Além disso, utilizando esse mesmo processo, o Nuoxus sugere, de forma proativa, aos demais nós filhos que apresentam um alto grau de similaridade que façam o caching desse conteúdo, visando um possível futuro acesso. A análise do estado da arte demonstra que até o momento não existe nenhum outro trabalho que explore o histórico de requisições para fazer caching de conteúdo em arquiteturas multicamadas para redes sem fio de forma proativa e sem utilizar algum componente centralizado para fazer coordenação e predição de caching. A fim de comprovar a eficiência do modelo, foi desenvolvido um protótipo utilizando o simulador ns-3. Os resultados obtidos demostram que a utilização do Nuoxus foi capaz de reduzir a latência de rede em cerca de 29.75%. Além disso, quando comparado com outras estratégias de caching, o número de acesso à cache dos componentes de rede aumentou em 53.16% em relação à estratégia que obteve o segundo melhor resultado. / It is estimated that by the year 2020, about 50 billion mobile devices will be connected to wireless networks and 78% of the data traffic of this kind of device will be multimedia content. These estimates foster the development of the 5th generation of mobile networks (5G). One of the most recently proposed architectures, named Fog Radio Access Networks or F-RAN, gives the components located at the edge of the network the processing power and storage capacity to address network activities. One of the main problems of this architecture is the intense data traffic in its centralized component named fronthaul, which is used to connect the antennas (FAPs) to the external network. Given this context, we propose Nuoxus, a multimedia content caching model for F-RANs that aims to mitigate this problem. By storing the content in the nodes closest to the user, the number of concurrent accesses to the fronthaul is reduced, which decreases the communication latency of the network. Nuoxus can run on any network node that has storage and processing capacity, becoming the responsible for managing the cache of that node. Its content replacement policy uses the similarity of requests between the child nodes and the rest of the network as a factor to decide the relevance of storing the requested content in the cache. Furthermore, by using this same process, Nuoxus proactively suggests to the child nodes whose degree of similarity is high to perform the caching of the content, assuming they will access the content at a future time. The State-of-the-art analysis shows that there is no other work that explores the history of requests to cache content in multi-layer architectures for wireless networks in a proactive manner, without using some centralized component to do coordination and prediction of caching. To demonstrate the efficiency of the model, a prototype was developed using the ns 3 simulator. The results obtained demonstrate that the use of Nuoxus reduced network latency in 29.75%. In addition, when compared to other caching strategies, the cache hit increased by 53.16% when compared to the strategy that obtained the second-best result.
19

An Efficient Classification Model for Analyzing Skewed Data to Detect Frauds in the Financial Sector / Un modèle de classification efficace pour l'analyse des données déséquilibrées pour détecter les fraudes dans le secteur financier

Makki, Sara 16 December 2019 (has links)
Différents types de risques existent dans le domaine financier, tels que le financement du terrorisme, le blanchiment d’argent, la fraude de cartes de crédit, la fraude d’assurance, les risques de crédit, etc. Tout type de fraude peut entraîner des conséquences catastrophiques pour des entités telles que les banques ou les compagnies d’assurances. Ces risques financiers sont généralement détectés à l'aide des algorithmes de classification. Dans les problèmes de classification, la distribution asymétrique des classes, également connue sous le nom de déséquilibre de classe (class imbalance), est un défi très commun pour la détection des fraudes. Des approches spéciales d'exploration de données sont utilisées avec les algorithmes de classification traditionnels pour résoudre ce problème. Le problème de classes déséquilibrées se produit lorsque l'une des classes dans les données a beaucoup plus d'observations que l’autre classe. Ce problème est plus vulnérable lorsque l'on considère dans le contexte des données massives (Big Data). Les données qui sont utilisées pour construire les modèles contiennent une très petite partie de groupe minoritaire qu’on considère positifs par rapport à la classe majoritaire connue sous le nom de négatifs. Dans la plupart des cas, il est plus délicat et crucial de classer correctement le groupe minoritaire plutôt que l'autre groupe, comme la détection de la fraude, le diagnostic d’une maladie, etc. Dans ces exemples, la fraude et la maladie sont les groupes minoritaires et il est plus délicat de détecter un cas de fraude en raison de ses conséquences dangereuses qu'une situation normale. Ces proportions de classes dans les données rendent très difficile à l'algorithme d'apprentissage automatique d'apprendre les caractéristiques et les modèles du groupe minoritaire. Ces algorithmes seront biaisés vers le groupe majoritaire en raison de leurs nombreux exemples dans l'ensemble de données et apprendront à les classer beaucoup plus rapidement que l'autre groupe. Dans ce travail, nous avons développé deux approches : Une première approche ou classifieur unique basée sur les k plus proches voisins et utilise le cosinus comme mesure de similarité (Cost Sensitive Cosine Similarity K-Nearest Neighbors : CoSKNN) et une deuxième approche ou approche hybride qui combine plusieurs classifieurs uniques et fondu sur l'algorithme k-modes (K-modes Imbalanced Classification Hybrid Approach : K-MICHA). Dans l'algorithme CoSKNN, notre objectif était de résoudre le problème du déséquilibre en utilisant la mesure de cosinus et en introduisant un score sensible au coût pour la classification basée sur l'algorithme de KNN. Nous avons mené une expérience de validation comparative au cours de laquelle nous avons prouvé l'efficacité de CoSKNN en termes de taux de classification correcte et de détection des fraudes. D’autre part, K-MICHA a pour objectif de regrouper des points de données similaires en termes des résultats de classifieurs. Ensuite, calculez les probabilités de fraude dans les groupes obtenus afin de les utiliser pour détecter les fraudes de nouvelles observations. Cette approche peut être utilisée pour détecter tout type de fraude financière, lorsque des données étiquetées sont disponibles. La méthode K-MICHA est appliquée dans 3 cas : données concernant la fraude par carte de crédit, paiement mobile et assurance automobile. Dans les trois études de cas, nous comparons K-MICHA au stacking en utilisant le vote, le vote pondéré, la régression logistique et l’algorithme CART. Nous avons également comparé avec Adaboost et la forêt aléatoire. Nous prouvons l'efficacité de K-MICHA sur la base de ces expériences. Nous avons également appliqué K-MICHA dans un cadre Big Data en utilisant H2O et R. Nous avons pu traiter et analyser des ensembles de données plus volumineux en très peu de temps / There are different types of risks in financial domain such as, terrorist financing, money laundering, credit card fraudulence and insurance fraudulence that may result in catastrophic consequences for entities such as banks or insurance companies. These financial risks are usually detected using classification algorithms. In classification problems, the skewed distribution of classes also known as class imbalance, is a very common challenge in financial fraud detection, where special data mining approaches are used along with the traditional classification algorithms to tackle this issue. Imbalance class problem occurs when one of the classes have more instances than another class. This problem is more vulnerable when we consider big data context. The datasets that are used to build and train the models contain an extremely small portion of minority group also known as positives in comparison to the majority class known as negatives. In most of the cases, it’s more delicate and crucial to correctly classify the minority group rather than the other group, like fraud detection, disease diagnosis, etc. In these examples, the fraud and the disease are the minority groups and it’s more delicate to detect a fraud record because of its dangerous consequences, than a normal one. These class data proportions make it very difficult to the machine learning classifier to learn the characteristics and patterns of the minority group. These classifiers will be biased towards the majority group because of their many examples in the dataset and will learn to classify them much faster than the other group. After conducting a thorough study to investigate the challenges faced in the class imbalance cases, we found that we still can’t reach an acceptable sensitivity (i.e. good classification of minority group) without a significant decrease of accuracy. This leads to another challenge which is the choice of performance measures used to evaluate models. In these cases, this choice is not straightforward, the accuracy or sensitivity alone are misleading. We use other measures like precision-recall curve or F1 - score to evaluate this trade-off between accuracy and sensitivity. Our objective is to build an imbalanced classification model that considers the extreme class imbalance and the false alarms, in a big data framework. We developed two approaches: A Cost-Sensitive Cosine Similarity K-Nearest Neighbor (CoSKNN) as a single classifier, and a K-modes Imbalance Classification Hybrid Approach (K-MICHA) as an ensemble learning methodology. In CoSKNN, our aim was to tackle the imbalance problem by using cosine similarity as a distance metric and by introducing a cost sensitive score for the classification using the KNN algorithm. We conducted a comparative validation experiment where we prove the effectiveness of CoSKNN in terms of accuracy and fraud detection. On the other hand, the aim of K-MICHA is to cluster similar data points in terms of the classifiers outputs. Then, calculating the fraud probabilities in the obtained clusters in order to use them for detecting frauds of new transactions. This approach can be used to the detection of any type of financial fraud, where labelled data are available. At the end, we applied K-MICHA to a credit card, mobile payment and auto insurance fraud data sets. In all three case studies, we compare K-MICHA with stacking using voting, weighted voting, logistic regression and CART. We also compared with Adaboost and random forest. We prove the efficiency of K-MICHA based on these experiments
20

Algoritmy pro detekci anomálií v datech z klinických studií a zdravotnických registrů / Algorithms for anomaly detection in data from clinical trials and health registries

Bondarenko, Maxim January 2018 (has links)
This master's thesis deals with the problems of anomalies detection in data from clinical trials and medical registries. The purpose of this work is to perform literary research about quality of data in clinical trials and to design a personal algorithm for detection of anomalous records based on machine learning methods in real clinical data from current or completed clinical trials or medical registries. In the practical part is described the implemented algorithm of detection, consists of several parts: import of data from information system, preprocessing and transformation of imported data records with variables of different data types into numerical vectors, using well known statistical methods for detection outliers and evaluation of the quality and accuracy of the algorithm. The result of creating the algorithm is vector of parameters containing anomalies, which has to make the work of data manager easier. This algorithm is designed for extension the palette of information system functions (CLADE-IS) on automatic monitoring the quality of data by detecting anomalous records.

Page generated in 0.0992 seconds