1 |
G2P-DBSCAN: Estratégia de Particionamento de Dados e de Processamento Distribuído fazer DBSCAN com MapReduce. / G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.Araújo Neto, Antônio Cavalcante January 2016 (has links)
ARAÚJO NETO, Antônio Cavalcante. G2P-DBSCAN: Estratégia de Particionamento de Dados e de Processamento Distribuído fazer DBSCAN com MapReduce. 2016. 63 f. Dissertação (mestrado em ciência da computação)- Universidade Federal do Ceará, Fortaleza-CE, 2016. / Submitted by Elineudson Ribeiro (elineudsonr@gmail.com) on 2016-03-22T19:21:02Z
No. of bitstreams: 1
2016_dis_acaraujoneto.pdf: 5671232 bytes, checksum: ce91a85d087f63206ad938133c163560 (MD5) / Approved for entry into archive by Rocilda Sales (rocilda@ufc.br) on 2016-04-25T12:33:12Z (GMT) No. of bitstreams: 1
2016_dis_acaraujoneto.pdf: 5671232 bytes, checksum: ce91a85d087f63206ad938133c163560 (MD5) / Made available in DSpace on 2016-04-25T12:33:12Z (GMT). No. of bitstreams: 1
2016_dis_acaraujoneto.pdf: 5671232 bytes, checksum: ce91a85d087f63206ad938133c163560 (MD5)
Previous issue date: 2016 / Clustering is a data mining technique that brings together elements of a data set such so that the elements of a same group are more similar to each other than to those from other groups. This thesis studied the problem of processing the clustering based on density DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing it is important that the partitions are processed have approximately the same size, provided that the total of the processing time is limited by the time the node with a larger amount of data leads to complete the computation of data assigned to it. For this reason we also propose a data set partitioning strategy called G2P, which aims to distribute the data set in a balanced manner between partitions and takes into account the characteristics of DBSCAN algorithm. More Specifically, the G2P strategy uses grid and graph structures to assist in the division of space low density regions. Distributed DBSCAN the algorithm is done processing MapReduce two stages and an intermediate phase that identifies groupings that can were divided into more than one partition, called candidates from merging. The first MapReduce phase applies the algorithm DSBCAN the partitions individually. The second and checks correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both at runtime and quality of obtained partitions. / Clusterizaçao é uma técnica de mineração de dados que agrupa elementos de um conjunto de dados de forma que os elementos que pertencem ao mesmo grupo são mais semelhantes entre si que entre elementos de outros grupos. Nesta dissertação nós estudamos o problema de processar o algoritmo de clusterização baseado em densidade DBSCAN de maneira distribuída através do paradigma MapReduce. Em processamentos distribuídos é importante que as partições de dados a serem processadas tenham tamanhos proximadamente iguais, uma vez que o tempo total de processamento é delimitado pelo tempo que o nó com uma maior quantidade de dados leva para finalizar a computação dos dados a ele atribuídos. Por essa razão nós também propomos uma estratégia de particionamento de dados, chamada G2P, que busca distribuir o conjunto de dados de forma balanceada entre as partições e que leva em consideração as características do algoritmo DBSCAN. Mais especificamente, a estratégia G2P usa estruturas de grade e grafo para auxiliar na divisão do espaço em regiões de baixa densidade. Já o processamento distribuído do algoritmo DBSCAN se dá por meio de duas fases de processamento MapReduce e uma fase intermediária que identifica clusters que podem ter sido divididos em mais de uma partição, chamados de candidatos à junção. A primeira fase de MapReduce aplica o algoritmo DSBCAN nas partições de dados individualmente, e a segunda verifica e corrige, caso necessário, os clusters candidatos à junção. Experimentos utilizando dados reais mostram que a estratégia G2P-DBSCAN se comporta melhor que a solução utilizada para comparação em todos os cenários considerados, tanto em tempo de execução quanto em qualidade das partições obtidas.
|
2 |
G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce. / G2P-DBSCAN: EstratÃgia de Particionamento de Dados e de Processamento DistribuÃdo fazer DBSCAN com MapReduce.AntÃnio Cavalcante AraÃjo Neto 17 August 2015 (has links)
CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior / Clustering is a data mining technique that brings together elements of a data set such
so that the elements of a same group are more similar to each other than to those from
other groups. This thesis studied the problem of processing the clustering based on density
DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing
it is important that the partitions are processed have approximately the same size, provided that the total of
the processing time is limited by the time the node with a larger amount of data leads to
complete the computation of data assigned to it. For this reason we also propose a data set
partitioning strategy called G2P, which aims to distribute the data set in a balanced manner
between partitions and takes into account the characteristics of DBSCAN algorithm. More
Specifically, the G2P strategy uses grid and graph structures to assist in the division of
space low density regions. Distributed DBSCAN the algorithm is done processing
MapReduce two stages and an intermediate phase that identifies groupings that can
were divided into more than one partition, called candidates from merging. The first MapReduce
phase applies the algorithm DSBCAN the partitions individually. The second and checks
correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true
G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both
at runtime and quality of obtained partitions. / ClusterizaÃao à uma tÃcnica de mineraÃÃo de dados que agrupa elementos de um conjunto de dados de forma que os elementos que pertencem ao mesmo grupo sÃo mais semelhantes entre si que entre elementos de outros grupos. Nesta dissertaÃÃo nÃs estudamos o problema de processar o algoritmo de clusterizaÃÃo baseado em densidade DBSCAN de maneira distribuÃda atravÃs do paradigma MapReduce. Em processamentos distribuÃdos à importante que as partiÃÃes de dados a serem processadas tenham tamanhos proximadamente iguais, uma vez que o tempo total de processamento à delimitado pelo tempo que o nà com uma maior quantidade de dados leva para finalizar a computaÃÃo dos dados a ele atribuÃdos. Por essa razÃo nÃs tambÃm propomos uma estratÃgia de particionamento de dados, chamada G2P, que busca distribuir o conjunto de dados de forma balanceada entre as partiÃÃes e que leva em consideraÃÃo as caracterÃsticas do algoritmo DBSCAN. Mais especificamente, a estratÃgia G2P usa estruturas de grade e grafo para auxiliar na divisÃo do espaÃo em regiÃes de baixa densidade. Jà o processamento distribuÃdo do algoritmo DBSCAN se dà por meio de duas fases de processamento MapReduce e uma fase intermediÃria que identifica clusters que podem ter sido divididos em mais de uma partiÃÃo, chamados de candidatos à junÃÃo. A primeira fase de MapReduce aplica o algoritmo DSBCAN nas partiÃÃes de dados individualmente, e a segunda verifica e corrige, caso necessÃrio, os clusters candidatos à junÃÃo. Experimentos utilizando dados reais mostram que a estratÃgia G2P-DBSCAN se comporta melhor que a soluÃÃo utilizada para comparaÃÃo em todos os cenÃrios considerados, tanto em tempo de execuÃÃo quanto em qualidade
das partiÃÃes obtidas.
|
3 |
Identifiering av områden med förhöjd olycksrisk för cyklister baserad på cykelhjälmsdataRoos, Johannes, Lindqvist, Sven January 2020 (has links)
Antalet cyklister i Sverige väntas öka under kommande år, men trots stora insatser för trafiksäkerheten minskar inte antalet allvarliga cykelolyckor i samma takt som bilolyckor. Denna studie har tittat på cykelhjälm-tillverkaren Hövdings data som samlats in från deras kunder. Hjälmen fungerar som en krockkudde som löses ut vid en kraftig huvudrörelse som sker vid en olycka. Datan betsår av GPS-positioner tillsammans med ett värde från en Support Vector Machine (SVM) som indikerar hur nära en hjälm är att registrera en olycka och därmed lösas ut. Syftet med studien var att analysera denna data från cyklister i Malmö för att se om det går att identifiera platser som är överrepresenterade i antalet förhöjda SVM-nivåer, och om dessa platser speglar verkliga, potentiellt farliga trafiksituationer. Density-based spatial clustering of applications with noise (DBSCAN) användes för att identifiera kluster av förhöjda SVM-nivåer. DBSCAN är en oövervakad maskininlärningsalgoritm som ofta används för att klustra på spatial data med brusdata i datamängden. Från dessa kluster räknades antalet unika cykelturer som genererat en förhöjd SVM-nivå i klustret, samt totala antalet cykelturer som passerat genom klustret. 405 kluster identifierades och sorterades på flest unika cykelturer som genererat en förhöjd SVM-nivå, varpå de 30 översta valdes ut för närmare analys. För att validera klusterna mot registrerade cykelolyckor hämtades data från från Swedish Traffic Accident Data Acquisition (STRADA), den nationella olycksdatabasen i Sverige. De trettio utvalda klustren hade 0,082\% cykelolyckor per unik cykeltur i klustren och för resterande 375 kluster var siffran 0,041\%. Antal olyckor per kluster i de utvalda trettio klustren var 0,46 och siffran för övriga kluster var 0,064. De topp trettio klustren kategoriserades sedan i tre kategorier. De kluster som hade en eventuell förklaring till förhöjda SVM-nivåer, som farthinder och kullersten gavs kategori 1. Hövding har kommunicerat att sådana inslag i underlaget kan generera en lägre grad av förhöjd SVM-nivå. Kategori 2 var de kluster som hade haft en byggarbetsplats inom klustret. Kategori 3 var de kluster som inte kunde förklaras med något av de andra två kategorierna. Andel olyckor per unik cykeltur i kluster som tillhörde kategori 1 var 0,068\%, för kategori 2 0,071\% och kategori 3 0,106\%. Resultaten indikerar att denna data är användbar för att identifiera platser med förhöjd olycksrisk för cyklister. Datan som behandlats i denna studie har en rad svagheter i sig varpå resultaten bör tolkas med försiktigthet. Exempelvis är datamängden från en kort tidsperiod, ca 6 månader, varpå säsongsbetingat cykelbeteende inte är representerat i dataunderlaget. Det antas även förekomma en del brusdata, vilket eventuellt har påverkat resultaten. Men det finns potential i denna typ av data att i framtiden, när mer data samlats in, med större träffsäkerhet kunna identifiera olycksdrabbade platser för cyklister. / The number of cyclists in Sweden is expected to increase in the coming years, but despite major efforts in road safety, the number of serious bicycle accidents does not decrease at the same rate as car accidents.This study has looked at the data collected by the bicycle helmet manufacturer Hövding's customers. The helmet acts as an airbag that is triggered when a strong head movement occurs in the event of an accident. The data consists of GPS positions along with a Support Vector machine (SVM)- generated value which indicates how close the helmet is to registering an accident, and thus is triggered. The purpose of the study was to analyze this data from cyclists in Malmö to see if it's possible to identify places that are over-represented in the number of elevated SVM levels, and whether these sites reflect real, potentially dangerous traffic situations. Density-based spatial clustering of applications with noise (DBSCAN) was used to identify clusters of elevated SVM levels. DBSCAN is an unsupervised clustering algorithm widely used when clustering on spatial data. From these clusters, the number of unique cycle trips that generated an elevated SVM level in the cluster was calculated, as well as the total number of cycle trips that passed through each cluster. 405 clusters were identified and sorted by the highest number of unique bike rides that generated an elevated SVM level, whereupon the top 30 were selected for further analysis. In order to validate the clusters against registered bicycle accidents, data were obtained from the Swedish Traffic Accident Data Acquisition (STRADA), the national accident database in Sweden. The thirty selected clusters had 0.082 \% cycling accidents per unique cycle trip in the clusters and for the remaining 375 clusters the figure was 0.041 \%. The number of accidents per cluster in the selected thirty clusters was 0.46 and the number for the other clusters was 0.064. The top thirty clusters were then categorized into three categories. The clusters that had a possible explanation for elevated SVM levels, such as cruise barriers and cobblestones were given category 1. Hövding has communicated that such elements in the substrate can generate elevated SVM levels. Category 2 was the clusters that had a construction site within the cluster. Category 3 was the clusters that could not be explained by any of the other two categories. The proportion of accidents per unique cycle trip in clusters belonging to category 1 was 0.068 \%, for category 2 0.071 \% and for category 3 0.106 \%.The results indicate that this data is useful for identifying places with increased risk of accidents for cyclists. The data processed in this study has a number of weaknesses in itself and the results should be interpreted with caution. For example, the data is from a short period of time, about 6 months, whereby seasonal cycling behavior is not represented in the data set. The data set is also assumed to contain some noisy data, which may have affected the results. But there is potential in this type of data so that in the future, when more data is collected, it can be used to identify places with higher risk of accidents for cyclists with greater accuracy.
|
4 |
An Evaluation of Clustering and Classification Algorithms in Life-Logging DevicesAmlinger, Anton January 2015 (has links)
Using life-logging devices and wearables is a growing trend in today’s society. These yield vast amounts of information, data that is not directly overseeable or graspable at a glance due to its size. Gathering a qualitative, comprehensible overview over this quantitative information is essential for life-logging services to serve its purpose. This thesis provides an overview comparison of CLARANS, DBSCAN and SLINK, representing different branches of clustering algorithm types, as tools for activity detection in geo-spatial data sets. These activities are then classified using a simple model with model parameters learned via Bayesian inference, as a demonstration of a different branch of clustering. Results are provided using Silhouettes as evaluation for geo-spatial clustering and a user study for the end classification. The results are promising as an outline for a framework of classification and activity detection, and shed lights on various pitfalls that might be encountered during implementation of such service.
|
5 |
Article identification for inventory list in a warehouse environmentGao, Yang January 2014 (has links)
In this paper, an object recognition system has been developed that uses local image features. In the system, multiple classes of objects can be recognized in an image. This system is basically divided into two parts: object detection and object identification. Object detection is based on SIFT features, which are invariant to image illumination, scaling and rotation. SIFT features extracted from a test image are used to perform a reliable matching between a database of SIFT features from known object images. Method of DBSCAN clustering is used for multiple object detection. RANSAC method is used for decreasing the amount of false detection. Object identification is based on 'Bag-of-Words' model. The 'BoW' model is a method based on vector quantization of SIFT descriptors of image patches. In this model, K-means clustering and Support Vector Machine (SVM) classification method are applied.
|
6 |
Diseño de procesos para la segmentación de clientes según su comportamiento de compra y hábito de consumo en una empresa de consumo masivoRojas Araya, Javier Orlando January 2017 (has links)
Magíster en Ingeniería de Negocios con Tecnologías de Información / La industria de alimentos de consumo masivo ha ido evolucionando en el tiempo. Los primeros canales de venta para llegar a los clientes finales fueron los almacenes de barrio los que se vieron fuertemente amenazados con la proliferación de grandes cadenas de supermercados. La aparición de internet también creó un nuevo canal que permite a los clientes finales hacer pedidos de productos y pagarlos a través de aplicaciones móviles para finalmente recibirlos en su domicilio. A pesar de esta evolución en los canales, los almacenes de barrio se niegan a desaparecer. Son muchos los clientes que siguen prefiriendo la atención amable y personalizada de los almacenes junto con un abanico amplio de productos y precios atractivos.
La empresa no está ajena a esta realidad y también comercializa sus productos a clientes finales por los canales supermercado y almacenes. Respecto a los almacenes se atiende mensualmente una cantidad aproximada de 25.000 clientes a nivel nacional donde existe una mayor concentración en la zona centro del país. Segmentar a estos clientes para conocer su comportamiento de compra y hábito de consumo se ha convertido en el eje central de la estrategia de este canal. Ya no basta con analizar los reportes de ventas para aumentar el rendimiento del Área Comercial.
Este proyecto tiene por objetivo agrupar los clientes del canal Almacenes de la empresa bajo los conceptos de comportamiento de compra y hábito de consumo y lograr caracterizarlos. Para alcanzar esta meta se utiliza la metodología de Ingeniería de Negocios que parte desde la definición del posicionamiento estratégico, el modelo de negocio, la arquitectura de procesos, el diseño detallado de los procesos, el diseño del apoyo tecnológico que soportará a los procesos y finalmente la construcción y puesta en marcha de la solución. Además se utilizarán algoritmos propios para este tipo de tareas como son DBSCAN y K-Means.
Los resultados obtenidos permiten segmentar a los clientes en siete grupos para el comportamiento de compra y siete para el hábito consumo. Con esto se puede responder las preguntas de cuándo, cuánto y qué compran los clientes del canal.
El beneficio del proyecto se traduce en un aumento de las ventas por acciones que permiten recuperar a clientes que están en proceso de fugarse y por aumento del ticket promedio de aquellos clientes que realizan compras frecuentes pero de muy bajo monto de facturación. / 07/04/2022
|
7 |
Density and partition based clustering on massive threshold bounded data setsKannamareddy, Aruna Sai January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / The project explores the possibility of increasing efficiency in the clusters formed out of massive data sets which are formed using threshold blocking algorithm. Clusters thus formed are denser and qualitative. Clusters that are formed out of individual clustering algorithms alone, do not necessarily eliminate outliers and the clusters generated can be complex, or improperly distributed over the data set. The threshold blocking algorithm, a current research paper from Michael Higgins of Statistics Department on other hand, in comparison with existing algorithms performs better in forming the dense and distinctive units with predefined threshold. Developing a hybridized algorithm by implementing the existing clustering algorithms to re-cluster these units thus formed is part of this project.
Clustering on the seeds thus formed from threshold blocking Algorithm, eases the task of clustering to the existing algorithm by eliminating the overhead of worrying about the outliers. Also, the clusters thus generated are more representative of the whole. Also, since the threshold blocking algorithm is proven to be fast and efficient, we now can predict a lot more decisions from large data sets in less time. Predicting the similar songs from Million Song Data Set using such a hybridized algorithm is considered as the data set for the evaluation of this goal.
|
8 |
Product categorisation using machine learning / Produktkategorisering med hjälp av maskininlärningStefan, Vasic, Nicklas, Lindgren January 2017 (has links)
Machine learning is a method in data science for analysing large data sets and extracting hidden patterns and common characteristics in the data. Corporations often have access to databases containing great amounts of data that could contain valuable information. Navetti AB wants to investigate the possibility to automate their product categorisation by evaluating different types of machine learning algorithms. This could increase both time- and cost efficiency. This work resulted in three different prototypes, each using different machine learning algorithms with the ability to categorise products automatically. The prototypes were tested and evaluated based on their ability to categorise products and their performance in terms of speed. Different techniques used for preprocessing data is also evaluated and tested. An analysis of the tests shows that when providing a suitable algorithm with enough data it is possible to automate the manual categorisation. / Maskininlärning är en metod inom datavetenskap vars uppgift är att analysera stora mängder data och hitta dolda mönster och gemensamma karaktärsdrag. Företag har idag ofta tillgång till stora mängder data som i sin tur kan innehålla värdefull information. Navetti AB vill undersöka möjligheten att automatisera sin produktkategorisering genom att utvärdera olika typer av maskininlärnings- algoritmer. Detta skulle dramatiskt öka effektiviteten både tidsmässigt och ekonomiskt. Resultatet blev tre prototyper som implementerar tre olika maskininlärnings-algoritmer som automatiserat kategoriserar produkter. Prototyperna testades och utvärderades utifrån dess förmåga att kategorisera och dess prestanda i form av hastighet. Olika tekniker som används för att förbereda data analyseras och utvärderas. En analys av testerna visar att med tillräckligt mycket data och en passande algoritm så är det möjligt att automatisera den manuella kategoriseringen.
|
9 |
Detecting Self-Correlation of Nonlinear, Lognormal, Time-Series Data via DBSCAN Clustering Method, Using Stock Price Data as ExampleHuo, Shiyin 15 December 2011 (has links)
No description available.
|
10 |
Offline Direction Clustering of Overlapping Radar Pulses from Homogeneous Emitters / Fristående riktningsklustring av överlappande radarpulser från homogena emittrarBedoire, Sofia January 2022 (has links)
Within the defence industry, it is essential to be aware of threats in the environment. A potential threat can be detected by identifying certain types of emitters in the surroundings that are typically used in the enemies’ systems. An emitter’s type can be identified by having a receiver measuring radar pulses in the environment and analysing the pulses transmitted from that specific emitter. As several emitters usually transmit pulses in an environment, the receiver measures pulses from all of these emitters. In order to analyse the pulses from only one emitter, the pulses must be sorted into groups based on what emitter they are transmitted from. This sorting can for instance be performed by considering similarities and differences in the pulses’ features. This thesis investigates whether the change in the pulses’ Angle of Arrival (AOA) over time can be used for sorting the pulses. Such an approach can be useful in scenarios where signals from homogeneous emitters, that are similar in their features, need to be distinguished. In addition, by taking the change in AOA into consideration, rather than relying on the AOA itself, the approach has the potential of separating signals from emitters that overlap with respect to the AOA over time at some time step. A multiple-step clustering algorithm which is adapted from Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is used for the pulse sorting. The algorithm is primarily evaluated in testing scenarios including homogeneous emitters whose pulses overlap with respect to the AOA at some time step. The goal is to divide the pulses into groups depending on what emitter they are transmitted from. The pulses involved in an overlap are typically not distinguishable and they should therefore not be assigned to any cluster. Signals received before and after an overlap are allowed to belong to different clusters even if they are from the same emitter. The algorithm was able to cluster signals properly and to identify the overlapping signals in testing scenarios where the emitters were placed in specific patterns. The performance worsened as the emitters were allowed to have any position and the number of emitters increased, which can imply that the algorithm performs poorly when the emitters are closely located. In order to determine whether, or to what extent, this approach is suitable for pulse sorting, the algorithm should be further evaluated in more testing scenarios. / Inom försvarsindustrin är det grundläggande att vara medveten om hot i ens omgivning. Ett möjligt hot kan upptäckas genom att identifiera särskilda typer av emittrar i omgivningen som brukar användas i en fiendes system. Genom att med en mottagare mäta radarpulser i omgivningen och sedan analysera en särskild emitters pulser kan denna emitters typ identifieras. I en omgivning är det normalt ett flertal emittrar som sänder ut signaler vilket gör att mottagaren mäter flera emittrars pulser samtidigt. För att kunna analysera pulserna från endast en särskild emitter måste pulserna sorteras i grupper baserat på vilken emitter de kommer ifrån. Sorteringen kan exempelvis baseras på likheter och skillnader mellan signalernas egenskaper. Detta projekt undersöker huruvida pulser kan sorteras baserat på förändringen i pulsernas ankomstvinkel över tid. Denna metod kan vara användbar då signaler från homogena emittrar ska separeras då dessa signaler har liknande egenskaper. Genom att göra sorteringen baserad på ankomstvinkelns förändring över tid, istället för att endast kolla på ankomstvinkeln, är det även möjligt att skilja på signaler vars ankomstvinklar överlappar vid något tillfälle över tid. En klustringsalgoritm uppbyggd i flera steg används för pulssorteringen. Denna algoritm är i grunden baserad på principerna från Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Algoritmen är huvudsakligen evaluerad på testscenarios med homogena emittrar vars pulsers ankomstvinkel överlappar vid något tillfälle. Målet är att dela in pulser i grupper efter vilken emitter de kommer ifrån. Pulser involverade i ett överlapp är normalt inte möjliga att särskilja och dessa pulser ska därför inte tillhöra något kluster. Signaler som mottages före och efter ett överlapp är tillåtna att höra till olika kluster även om de kommer från samma emitter. Algoritmen lyckades utföra klustringen och identifiera överlappande signaler i testscenarion då emittrarna placerats i särskilda mönster. Algoritmens prestanda försämrades då emittrarna tilläts ha godtyckliga positioner och antalet emittrar ökade. Detta kan innebära att metoden fungerar sämre när emittrarna är placerade nära varandra. För att avgöra huruvida denna metod är lämplig för pulssortering bör metoden utvärderas i flera testscenarion.
|
Page generated in 0.1333 seconds