Global ETD Search

31	Identifying Early Usage Patterns That Increase User Retention Rates In A Mobile Web Browser / Att identifiera tidiga användarmönster som ökar användares återvändningsfrekvens Persson, Pontus January 2017 (has links) One of the major challenges for modern technology companies is user retentionmanagement. This work focuses on identifying early usage patterns that signifyincreased retention rates in a mobile web browser.This is done using a targetedparallel implementation of the association rule mining algorithm FP-Growth.Different item subset selection techniques including clustering and otherstatistical methods have been used in order to reduce the mining time and allowfor lower support thresholds.A lot of interesting rules have been mined. The best retention-wise ruleimplies a retention rate of 99.5%. The majority of the rules analyzed in thiswork implies a retention rate increase between 150% and 200%. data mining retention churn association analysis association rule clustering DBSCAN Other Computer and Information Science Annan data- och informationsvetenskap
32	Automated error matching system using machine learning and data clustering : Evaluating unsupervised learning methods for categorizing error types, capturing bugs, and detecting outliers. Bjurenfalk, Jonatan, Johnson, August January 2021 (has links) For large and complex software systems, it is a time-consuming process to manually inspect error logs produced from the test suites of such systems. Whether it is for identifyingabnormal faults, or finding bugs; it is a process that limits development progress, and requires experience. An automated solution for such processes could potentially lead to efficient fault identification and bug reporting, while also enabling developers to spend more time on improving system functionality. Three unsupervised clustering algorithms are evaluated for the task, HDBSCAN, DBSCAN, and X-Means. In addition, HDBSCAN, DBSCAN and an LSTM-based autoencoder are evaluated for outlier detection. The dataset consists of error logs produced from a robotic test system. These logs are cleaned and pre-processed using stopword removal, stemming, term frequency-inverse document frequency (tf-idf) and singular value decomposition (SVD). Two domain experts are tasked with evaluating the results produced from clustering and outlier detection. Results indicate that X-Means outperform the other clustering algorithms when tasked with automatically categorizing error types, and capturing bugs. Furthermore, none of the outlier detection methods yielded sufficient results. However, it was found that X-Means’s clusters with a size of one data point yielded an accurate representation of outliers occurring in the error log dataset. Conclusively, the domain experts deemed X-means to be a helpful tool for categorizing error types, capturing bugs, and detecting outliers. Unsupervised learning machine learning clustering DBSCAN HDBSCAN X-Means outlier detection error log clustering Computer Sciences Datavetenskap (datalogi)
33	Analýza dat na sociálních sítích s využitím dolování dat / Analysis of Data on Social Networks Based on Data Mining Fešar, Marek January 2014 (has links) The thesis presents general principles of data mining and it also focuses on specific needs of social networks. Certain social networks, chosen with respect to popularity and availability to Czech users, are discussed from various points of view. The benefits and drawbacks of each are also mentioned. Afterwards, one suitable API is selected for futher analysis. The project explains harvesting data via Twitter API and the process of mining of data from this particular network. Design of a mining algorithm inspired by density based clustering methods is described. The implementation is explained in its own chapter, preceded by thorough explanation of MVC architectural pattern. In the end some examples of usage of gathered knowledge are shown as well as possibility of future extensions.
34	Nástroj pro shlukovou analýzu / Cluster Analysis Tool Hezoučký, Ladislav January 2010 (has links) The master' s thesis deals with cluster data analysis. There are explained basic concepts and methods from this domain. Result of the thesis is Cluster analysis tool, in which are implemented methods K-Medoids and DBSCAN. Adjusted results on real data are compared with programs Rapid Miner and SAS Enterprise Miner.
35	Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-Dialoger Hidén, Oskar, Björelind, David January 2021 (has links) The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature. Machine Learning NLP Text Representations Clustering Extractive summarization TFIDF S-BERT FastText K-means DBSCAN HDBSCAN LSA TextRank Word Mover's Distance (WMD) Computer Engineering Datorteknik
36	Guardrail detection for landmark-based localization Gumaelius, Nils January 2022 (has links) A requirement for safe autonomous driving is to have an accurate global localization of the ego vehicle. Methods based on Global Navigation Satellite System (GNSS) are the most common but are not precise enough in areas without good satellite signals. Instead, methods likelandmark-based localization (LBL) can be used. In LBL, sensors onboard the vehicle detectlandmarks near the vehicle. With these detections, the vehicle’s position is deduced by looking up matching landmarks on a high-definition map. Commonly found along roads, stretching for long distances, guardrails are a great landmark that can be used for LBL. In this thesis, two different methods are proposed to detect and vectorize guardrails from vehicle sensor data to enable future map matching for LBL. The first method uses semantically labeled LiDAR data with pre-classified guardrail LiDAR points as input data. The method is based on the DBSCAN clustering algorithm to cluster and filter out false positives from the pre-classified LiDAR points. The second algorithm uses raw LiDAR data as input. The algorithm finds guardrail candidate points by segmenting high-densityareas and matching these with thresholds taken from the geometry of guardrails. Similar to the first method, these are then clustered into guardrail clusters. The clusters are then vectorized into the wanted output of a 2D vector, corresponding to points inside the guardrail with aspecific interval. To evaluate the performance of the proposed algorithms, simulations from real-life data are analyzed in both a quantitative and qualitative way. The qualitative experiments showcase that both methods perform well even in difficult scenarios. Timings of the simulations show that both methods are fast enough to be applicable in real-time use cases. The defined performance measures show that the method using raw LiDAR data is more robust and manages to detect more and longer parts of the guardrails. computer vision LiDAR object detection localization landmark-based localization guardrail detection dbscan autonomous driving
37	[en] REAL-TIME RISKS DETERMINATION OF TRANSMISSION LINES OUTAGE BY LIGHTNINGS / [pt] DETERMINAÇÃO EM TEMPO REAL DOS RISCOS DE DESLIGAMENTOS EM LINHAS DE TRANSMISSÃO DEVIDO A DESCARGAS ATMOSFÉRICAS MARCELO CASCARDO CARDOSO 12 February 2019 (has links) [pt] As descargas atmosféricas são de grande importância para o setor elétrico, sendo frequentemente responsáveis por desligamentos de linhas de transmissão, que podem desencadear uma sequência de eventos que levem o sistema elétrico interligado ao colapso. As longas extensões de linhas de transmissão, expostas a intemperes climáticas, determinam uma probabilidade significativa de incidência direta de descargas atmosféricas nestes equipamentos. Devido ao caráter estratégico das linhas para o fornecimento de energia e a constatação de que descargas atmosféricas estão entre as principais causas de desligamentos, torna-se importante o estudo do comportamento das descargas atmosféricas, antes do instante da ocorrência do desligamento das linhas de transmissão, para compreender os padrões característicos potenciais causadores destes desligamentos. Os estudos encontrados atualmente estão orientados na eficiência das redes de detecção de descargas atmosféricas e na identificação de condições climáticas que indiquem a ocorrência de raios de forma preditiva, sem correlação a ocorrências em linhas de transmissão. Assim, essa dissertação consiste na determinação do risco de desligamentos de linhas de transmissão por descargas atmosféricas, visando fornecer informações antecipadas e possibilitar ações operativas para manter a segurança do sistema elétrico. O modelo desenvolvido nesse estudo, denominado Risco de Desligamentos de Linhas de Transmissão por Raios (RDLR), é composto de dois módulos principais, sendo o primeiro o agrupamento do conjunto amostral de descargas atmosféricas, realizado através de um método baseado em densidade. Nesse módulo, os ruídos são eliminados de forma eficiente e são formados grupos representativos de descargas atmosféricas. O segundo módulo consiste em uma etapa classificatória, baseado em redes neurais artificiais para identificar padrões de grupos de descargas que representem riscos de desligamentos de linhas de transmissão. Visando a otimização do modelo, foi aplicado um método de seleção das variáveis, através de componentes principais, para determinar aquelas que mais contribuem na caracterização desses eventos. O modelo RDLR foi testado com dados reais dos registros de desligamentos de linhas de transmissão, associado a outro banco com dados reais contendo milhões de registros de descargas atmosféricas oriundos das redes de detecção de raios, sendo obtidos excelentes resultados na determinação dos riscos de desligamentos de linhas de transmissão por descargas atmosféricas. / [en] Atmospheric discharges are of great importance to power systems, and are often responsible for outages of transmission lines, which can trigger a sequence of events that leads to a system collapse. The long extensions of transmission lines, exposed to climatic conditions, create significant probability of direct incidence of atmospheric discharges in these equipments. Due to the strategic nature of power supply lines and the fact that atmospheric discharges are among the main causes of outages, it is important to study atmospheric discharges characteristics before failure of transmission lines and understand patterns that are responsible for interruptions. Current studies focus on efficiency of lightning detection networks and on identification of climatic conditions that indicate lightning occurrence in a predictive approach, without any correlation with transmission lines outages. Therefore, this thesis consists on real-time risk determination of transmission lines outage by lightning, providing early information to enabling operational procedures for power system safety. The proposed model, named Transmission Lines Outage Risk by Lightning (TLORL) is composed of two main modules: Atmospheric Discharge Data Clustering and Classification. In the atmospheric discharges data-clustering module, performed by a density-based method, the outages are efficiently eliminated and representative groups of atmospheric discharges are formed. The second module consists of a classification step, based on artificial neural networks, to identify patterns of discharges groups that represent risks to cause transmission lines outages. Aiming at improving the proposed model, principal components analysis (PCA) was applied to determine the input variables that most contribute to the events characterization. The TLORL model was tested with real data transmission line outages, associated to another database with millions lightning records from the detection networks, producing excellent results of transmission lines outages caused by atmospheric discharges. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] ANALISE DE COMPONENTES PRINCIPAIS [en] PRINCIPAL COMPONENT ANALYSIS [pt] BIG DATA [en] BIG DATA [pt] DBSCAN [en] DBSCAN [pt] PERTURBACOES NO SISTEMA ELETRICO [en] POWER SYSTEM DISTURBANCE [pt] INTERRUPCAO DE ENERGIA [en] POWER INTERRUPTION [pt] OPERACAO DO SISTEMA INTERLIGADO [en] POWER INTERRUPTION [pt] RAIOS [en] LIGHTNING [pt] DESCARGAS ATMOSFERICAS [en] ATMOSPHERIC DISCHARGES [en] TRANSMISSION LINES [pt] SISTEMAS INTELIGENTES [en] INTELLIGENT SYSTEMS [pt] REDES NEURAIS MLP [en] MLP
38	Использование машинного обучения для автоматической интерпретации данных из систем веб-аналитики : магистерская диссертация / Using machine learning to automatically interpret data from web analytics systems Цинцов, Н. В., Tsintsov, N. V. January 2023 (has links) В данной работе был разработан и реализован комплексный подход к анализу и интерпретации пользовательских данных, собранных в рамках системы веб-аналитики. Применяя методы машинного обучения и аналитики данных, были исследованы и выявлены ключевые события пользователей, влияющие на определенные бизнес-метрики. Начальные этапы проекта включали сбор и предварительную обработку данных, с последующей кластеризацией для выявления скрытых взаимосвязей и структур. Использовались или тестировались различные библиотеки для объяснимости работы моделей машинного обучении, такие как Eli5 и SHAP. Для решения задачи тестировались кластеризации, включая K-средних, DBSCAN, спектральную кластеризацию и OPTICS. В качестве алгоритмов применялась логистическая регрессия, случайны лес и CatBoost. Применялась нейронная сеть. Для определения значимости признаков использовались методы Permutation Importance, с применением моделей логистической регрессии, случайного леса и нейронной сети. Основным результатом стала разработка скрипта, осуществляющего автоматический сбор, обработку данных и определение наиболее значимых событий. Полученный инструментарий значительно облегчает задачу аналитиков, помогая определять ключевые аспекты поведения пользователей и строить более эффективные стратегии взаимодействия. Применение полученных результатов имеет высокий потенциал для улучшения бизнес–решений и оптимизации работы с пользовательской аудиторией. / In this work, an integrated approach to the analysis and interpretation of user data collected within the framework of a web analytics system was developed and implemented. Using machine learning and data analytics methods, key user events that impact certain business metrics were investigated and identified. The initial stages of the project included data collection and pre-processing, followed by clustering to identify hidden relationships and structures. Various libraries have been used or tested to make machine learning models explainable, such as Eli5 and SHAP. Clusterings including K-means, DBSCAN, spectral clustering, and OPTICS were tested to solve the problem. The algorithms used were logistic regression, random forest and CatBoost. A neural network was used. To determine the significance of features, Permutation Importance methods were used using logistic regression, random forest and neural network models. The main result was the development of a script that automatically collects, processes data and determines the most significant events. The resulting tools greatly facilitate the task of analysts, helping to identify key aspects of user behavior and build more effective interaction strategies. The application of the results obtained has high potential for improving business decisions and optimizing work with the user audience. СИСТЕМЫ ВЕБ-АНАЛИТИКИ БИЗНЕС-МЕТРИКИ ELI5 SHAP K-СРЕДНИХ DBSCAN OPTICS CATBOOST PERMUTATION IMPORTANCE СЛУЧАЙНЫЙ ЛЕС MASTER'S THESIS WEB ANALYTICS SYSTEMS BUSINESS METRICS ELI5 SHAP K-MEANS DBSCAN SPECTRAL CLUSTERING OPTICS CATBOOST PERMUTATION IMPORTANCE LOGISTIC REGRESSION RANDOM FOREST
39	Recomendação pedagógica para melhoria da aprendizagem em redações. / Pedagogical recommendation to improve learning in essays. SANTOS, Danilo Abreu. 02 May 2018 (has links) Submitted by Johnny Rodrigues (johnnyrodrigues@ufcg.edu.br) on 2018-05-02T13:28:09Z No. of bitstreams: 1 DANILO ABREU SANTOS - DISSERTAÇÃO PPGCC 2015..pdf: 2955839 bytes, checksum: 45290d85cdffbae0320f29fc5e633cb6 (MD5) / Made available in DSpace on 2018-05-02T13:28:09Z (GMT). No. of bitstreams: 1 DANILO ABREU SANTOS - DISSERTAÇÃO PPGCC 2015..pdf: 2955839 bytes, checksum: 45290d85cdffbae0320f29fc5e633cb6 (MD5) Previous issue date: 2015-08-24 / A modalidade de educação online tem crescido significativamente nas últimas décadas em todo o mundo, transformando-se em uma opção viável tanto àqueles que não dispõem de tempo para trabalhar a sua formação acadêmica na forma presencial quanto àqueles que desejam complementá-la. Há também os que buscam ingressar no ensino superior por meio do Exame Nacional do Ensino Médio (ENEM) e utilizam esta modalidade de ensino para complementar os estudos, objetivando sanar lacunas deixadas pela formação escolar. O ENEM é composto por questões objetivas (subdivididas em 4 grandes áreas: Linguagens e Códigos; Matemática; Ciências Humanas; e Ciências Naturais) e a questão subjetiva (redação). Segundo dados do Ministério da Educação (MEC), mais de 50% dos candidatos que fizeram a prova do ENEM em 2014 obtiveram desempenho abaixo de 500 pontos na redação. Esta pesquisa utilizará recomendações pedagógicas baseadas no gênero textual utilizado pelo ENEM, visando prover uma melhoria na escrita da redação dissertativa. Para tanto, foi utilizado, como ferramenta experimental, o ambiente online de aprendizagem MeuTutor. O ambiente possui um módulo de escrita de redação, no qual é utilizada para correção dos textos elaborados pelos alunos, a metodologia de avaliação por pares, cujo pesquisas mostram que os resultados avaliativos são significativos e bastante similares aos obtidos por professores especialistas. Entretanto, apenas apresentar a pontuação da redação por si só, não garante a melhora da produção textual do aluno avaliado. Desta forma, visando um ganho em performance na produção da redação, foi adicionado ao MeuTutor um módulo de recomendação pedagógica baseado em 19 perfis resultados do uso de algoritmos de mineração de dados (DBScan e Kmeans) nos microdados do ENEM 2012 disponibilizado pelo MEC. Estes perfis foram agrupados em 6 blocos que possuíam um conjunto de tarefas nas áreas de escrita, gramática e coerências e concordância textual. A validação destas recomendações foi feita em um experimento de 3 ciclos, onde em cada ciclo o aluno: escreve a redação; avalia os seus pares; realiza a recomendação pedagógica que foi recebida. A partir da análise estatística destes dados, foi possível constatar que o modelo estratégico de recomendação utilizado nesta pesquisa, possibilitou um ganho mensurável na qualidade da produção textual. / Online education has grown significantly in recent years throughout the world, becoming a viable option for those who don’t have the time to pursuit traditional technical training or academic degree. In Brazil, people seek to enter higher education through the National Secondary Education Examination (ENEM) and use online education to complement their studies, aiming to remedy gaps in their school formation. The ENEM consists of objective questions (divided into 4 main areas: languages and codes; Mathematics; Social Sciences, and Natural Sciences), and the subjective questions (the essay). According to the Brazilian Department of Education (MEC), more than 50% of the candidates who took the test (ENEM) in 2014, obtained performance below 500 points (out of a 1000 maximum points) for their essays. This research uses educational recommendations based on the five official correction criteria for the ENEM essays, to improve writing. Thus, this research used an experimental tool in an online learning environment called MeuTutor. The mentioned learning environment has an essay writing/correction module. The correction module uses peer evaluation techniques, for which researches show that the results are, significantly, similar to those obtained by specialists’ correction. However, to simply display the scores for the criteria does not guarantee an improvement in students’ writing. Thus, to promote that, an educational recommendation module was added to MeuTutor. It is based on 19 profiles obtained mining data from the 2012 ENEM. It uses the algorithms DBSCAN and K-Means, and grouped the profiles into six blocks, to which a set of tasks were associated to the areas of writing, grammar and coherence, and textual agreement. The validation of these recommendations was made in an experiment with three cycles, where students should: (1) write the essay; (2) evaluate their peers; (3) perform the pedagogical recommendations received. From the analysis of these data, it was found that the strategic model of recommendation used in this study, enabled a measurable gain in quality of textual production. Ciências Exatas e da Terra. Redação Recomendação Pedagógica Education Data Mining Ambiente Virtual de Aprendizagem Avaliação por pares Redação dissertativa Meu Tutor - Ambiente Online Algoritmos de mineração de dados DBScan e Kmeans Análise estatística de dados
40	[en] TIME SERIES ANALYSIS USING SINGULAR SPECTRUM ANALYSIS (SSA) AND BASED DENSITY CLUSTERING OF THE COMPONENTS / [pt] ANÁLISE DE SÉRIES TEMPORAIS USANDO ANÁLISE ESPECTRAL SINGULAR (SSA) E CLUSTERIZAÇÃO DE SUAS COMPONENTES BASEADA EM DENSIDADE KEILA MARA CASSIANO 19 June 2015 (has links) [pt] Esta tese propõe a utilização do DBSCAN (Density Based Spatial Clustering of Applications with Noise) para separar os componentes de ruído na fase de agrupamento das autotriplas da Análise Singular Espectral (SSA) de Séries Temporais. O DBSCAN é um método moderno de clusterização (revisto em 2013) e especialista em identificar ruído através de regiões de menor densidade. O método de agrupamento hierárquico até então é a última inovação na separação de ruído na abordagem SSA, implementado no pacote R- SSA. No entanto, o método de agrupamento hierárquico é muito sensível a ruído, não é capaz de separá-lo corretamente, não deve ser usado em conjuntos com diferentes densidades e não funciona bem no agrupamento de séries temporais de diferentes tendências, ao contrário dos métodos de aglomeração à base de densidade que são eficazes para separar o ruído a partir dos dados e dedicados para trabalhar bem em dados a partir de diferentes densidades. Este trabalho mostra uma melhor eficiência de DBSCAN sobre os outros métodos já utilizados nesta etapa do SSA, garantindo considerável redução de ruídos e proporcionando melhores previsões. O resultado é apoiado por avaliações experimentais realizadas para séries simuladas de modelos estacionários e não estacionários. A combinação de metodologias proposta também foi aplicada com sucesso na previsão de uma série real de velocidade do vento. / [en] This thesis proposes using DBSCAN (Density Based Spatial Clustering of Applications with Noise) to separate the noise components of eigentriples in the grouping stage of the Singular Spectrum Analysis (SSA) of Time Series. The DBSCAN is a modern (revised in 2013) and expert method at identify noise through regions of lower density. The hierarchical clustering method was the last innovation in noise separation in SSA approach, implemented on package R-SSA. However, is repeated in the literature that the hierarquical clustering method is very sensitive to noise, is unable to separate it correctly, and should not be used in clusters with varying densities and neither works well in clustering time series of different trends. Unlike, the methods of density based clustering are effective in separating the noise from the data and dedicated to work well on data from different densities This work shows better efficiency of DBSCAN over the others methods already used in this stage of SSA, because it allows considerable reduction of noise and provides better forecasting. The result is supported by experimental evaluations realized for simulated stationary and non-stationary series. The proposed combination of methodologies also was applied successfully to forecasting real series of wind s speed. [pt] MINERACAO DE DADOS [en] DATA MINING [pt] SERIES TEMPORAIS [en] TIME SERIES [pt] PREVISAO [en] FORECASTING [pt] ENERGIA EOLICA [en] WIND ENERGY [pt] MODELOS ARIMA [pt] ANALISE SINGULAR ESPECTRAL [pt] CLUSTERIZACAO BASEADA EM DENSIDADE [pt] DBSCAN [pt] PREVISAO SSA

Search results