Spelling suggestions: "subject:"clustering methods"" "subject:"klustering methods""
11 |
Modelos de mistura para dados com distribuições Poisson truncadas no zero / Mixture models for data with zero truncated Poisson distributionsGigante, Andressa do Carmo 22 September 2017 (has links)
Modelo de mistura de distribuições tem sido utilizado desde longa data, mas ganhou maior atenção recentemente devido ao desenvolvimento de métodos de estimação mais eficientes. Nesta dissertação, o modelo de mistura foi utilizado como uma forma de agrupar ou segmentar dados para as distribuições Poisson e Poisson truncada no zero. Para solucionar o problema do truncamento foram estudadas duas abordagens. Na primeira, foi considerado o truncamento em cada componente da mistura, ou seja, a distribuição Poisson truncada no zero. E, alternativamente, o truncamento na resultante do modelo de mistura utilizando a distribuição Poisson usual. As estimativas dos parâmetros de interesse do modelo de mistura foram calculadas via metodologia de máxima verossimilhança, sendo necessária a utilização de um método iterativo. Dado isso, implementamos o algoritmo EM para estimar os parâmetros do modelo de mistura para as duas abordagens em estudo. Para analisar a performance dos algoritmos construídos elaboramos um estudo de simulação em que apresentaram estimativas próximas dos verdadeiros valores dos parâmetros de interesse. Aplicamos os algoritmos à uma base de dados real de uma determinada loja eletrônica e para determinar a escolha do melhor modelo utilizamos os critérios de seleção de modelos AIC e BIC. O truncamento no zero indica afetar mais a metodologia na qual aplicamos o truncamento em cada componente da mistura, tornando algumas estimativas para a distribuição Poisson truncada no zero com viés forte. Ao passo que, na abordagem em que empregamos o truncamento no zero diretamente no modelo as estimativas apontaram menor viés. / Mixture models has been used since long but just recently attracted more attention for the estimations methods development more efficient. In this dissertation, we consider the mixture model like a method for clustering or segmentation data with the Poisson and Poisson zero truncated distributions. About the zero truncation problem we have two emplacements. The first, consider the zero truncation in the mixture component, that is, we used the Poisson zero truncated distribution. And, alternatively, we do the zero truncation in the mixture model applying the usual Poisson. We estimated parameters of interest for the mixture model through maximum likelihood estimation method in which we need an iterative method. In this way, we implemented the EM algorithm for the estimation of interested parameters. We apply the algorithm in one real data base about one determined electronic store and towards determine the better model we use the criterion selection AIC and BIC. The zero truncation appear affect more the method which we truncated in the component mixture, return some estimates with strong bias. In the other hand, when we truncated the zero directly in the model the estimates pointed less bias.
|
12 |
An investigation on automatic systems for fault diagnosis in chemical processesMonroy Chora, Isaac 03 February 2012 (has links)
Plant safety is the most important concern of chemical industries. Process faults can cause economic loses as well as human and environmental damages. Most of the operational faults are normally considered in the process design phase by applying methodologies such as Hazard and Operability Analysis (HAZOP). However, it should be expected that failures may occur in an operating plant. For this reason, it is of paramount importance that plant operators can promptly detect and diagnose such faults in order to take the appropriate corrective actions. In addition, preventive maintenance needs to be considered in order to increase plant safety.
Fault diagnosis has been faced with both analytic and data-based models and using several techniques and algorithms. However, there is not yet a general fault diagnosis framework that joins detection and diagnosis of faults, either registered or non-registered in records. Even more, less efforts have been focused to automate and implement the reported approaches in real practice.
According to this background, this thesis proposes a general framework for data-driven Fault Detection and Diagnosis (FDD), applicable and susceptible to be automated in any industrial scenario in order to hold the plant safety. Thus, the main requirement for constructing this system is the existence of historical process data. In this sense, promising methods imported from the Machine Learning field are introduced as fault diagnosis methods. The learning algorithms, used as diagnosis methods, have proved to be capable to diagnose not only the modeled faults, but also novel faults. Furthermore, Risk-Based Maintenance (RBM) techniques, widely used in petrochemical industry, are proposed to be applied as part of the preventive maintenance in all industry sectors. The proposed FDD system together with an appropriate preventive maintenance program would represent a potential plant safety program to be implemented.
Thus, chapter one presents a general introduction to the thesis topic, as well as the motivation and scope. Then, chapter two reviews the state of the art of the related fields. Fault detection and diagnosis methods found in literature are reviewed. In this sense a taxonomy that joins both Artificial Intelligence (AI) and Process Systems Engineering (PSE) classifications is proposed. The fault diagnosis assessment with performance indices is also reviewed. Moreover, it is exposed the state of the art corresponding to Risk Analysis (RA) as a tool for taking corrective actions to faults and the Maintenance Management for the preventive actions. Finally, the benchmark case studies against which FDD research is commonly validated are examined in this chapter.
The second part of the thesis, integrated by chapters three to six, addresses the methods applied during the research work. Chapter three deals with the data pre-processing, chapter four with the feature processing stage and chapter five with the
diagnosis algorithms. On the other hand, chapter six introduces the Risk-Based Maintenance techniques for addressing the plant preventive maintenance. The third part includes chapter seven, which constitutes the core of the thesis. In this chapter the proposed general FD system is outlined, divided in three steps: diagnosis model construction, model validation and on-line application. This scheme includes a fault detection module and an Anomaly Detection (AD) methodology for the detection of novel faults. Furthermore, several approaches are derived from this general scheme for continuous and batch processes. The fourth part of the thesis presents the validation of the approaches. Specifically, chapter eight presents the validation of the proposed approaches in continuous processes and chapter nine the validation of batch process approaches. Chapter ten raises the AD methodology in real scaled batch processes. First, the methodology is applied to a lab heat exchanger and then it is applied to a Photo-Fenton pilot plant, which corroborates its potential and success in real practice. Finally, the fifth part, including chapter eleven, is dedicated to stress the final conclusions and the main contributions of the thesis. Also, the scientific production achieved during the research period is listed and prospects on further work are envisaged. / La seguridad de planta es el problema más inquietante para las industrias químicas. Un fallo en planta puede causar pérdidas económicas y daños humanos y al medio ambiente. La mayoría de los fallos operacionales son previstos en la etapa de diseño de un proceso mediante la aplicación de técnicas de Análisis de Riesgos y de Operabilidad (HAZOP). Sin embargo, existe la probabilidad de que pueda originarse un fallo en una planta en operación. Por esta razón, es de suma importancia que una planta pueda detectar y diagnosticar fallos en el proceso y tomar las medidas correctoras adecuadas para mitigar los efectos del fallo y evitar lamentables consecuencias. Es entonces también importante el mantenimiento preventivo para aumentar la seguridad y prevenir la ocurrencia de fallos.
La diagnosis de fallos ha sido abordada tanto con modelos analíticos como con modelos basados en datos y usando varios tipos de técnicas y algoritmos. Sin embargo, hasta ahora no existe la propuesta de un sistema general de seguridad en planta que combine detección y diagnosis de fallos ya sea registrados o no registrados anteriormente. Menos aún se han reportado metodologías que puedan ser automatizadas e implementadas en la práctica real.
Con la finalidad de abordar el problema de la seguridad en plantas químicas, esta tesis propone un sistema general para la detección y diagnosis de fallos capaz de implementarse de forma automatizada en cualquier industria. El principal requerimiento para la construcción de este sistema es la existencia de datos históricos de planta sin previo filtrado. En este sentido, diferentes métodos basados en datos son aplicados como métodos de diagnosis de fallos, principalmente aquellos importados del campo de “Aprendizaje Automático”. Estas técnicas de aprendizaje han resultado ser capaces de detectar y diagnosticar no sólo los fallos modelados o “aprendidos”, sino también nuevos fallos no incluidos en los modelos de diagnosis. Aunado a esto, algunas técnicas de mantenimiento basadas en riesgo (RBM) que son ampliamente usadas en la industria petroquímica, son también propuestas para su aplicación en el resto de sectores industriales como parte del mantenimiento preventivo. En conclusión, se propone implementar en un futuro no lejano un programa general de seguridad de planta que incluya el sistema de detección y diagnosis de fallos propuesto junto con un adecuado programa de mantenimiento preventivo.
Desglosando el contenido de la tesis, el capítulo uno presenta una introducción general al tema de esta tesis, así como también la motivación generada para su desarrollo y el alcance delimitado. El capítulo dos expone el estado del arte de las áreas relacionadas al tema de tesis. De esta forma, los métodos de detección y diagnosis de fallos encontrados en la literatura son examinados en este capítulo. Asimismo, se propone una
taxonomía de los métodos de diagnosis que unifica las clasificaciones propuestas en el área de Inteligencia Artificial y de Ingeniería de procesos. En consecuencia, se examina también la evaluación del performance de los métodos de diagnosis en la literatura. Además, en este capítulo se revisa y reporta el estado del arte correspondiente al “Análisis de Riesgos” y a la “Gestión del Mantenimiento” como técnicas complementarias para la toma de medidas correctoras y preventivas. Por último se abordan los casos de estudio considerados como puntos de referencia en el campo de investigación para la aplicación del sistema propuesto. La tercera parte incluye el capítulo siete, el cual constituye el corazón de la tesis. En este capítulo se presenta el esquema o sistema general de diagnosis de fallos propuesto. El sistema es dividido en tres partes: construcción de los modelos de diagnosis, validación de los modelos y aplicación on-line. Además incluye un modulo de detección de fallos previo a la diagnosis y una metodología de detección de anomalías para la detección de nuevos fallos. Por último, de este sistema se desglosan varias metodologías para procesos continuos y por lote. La cuarta parte de esta tesis presenta la validación de las metodologías propuestas. Específicamente, el capítulo ocho presenta la validación de las metodologías propuestas para su aplicación en procesos continuos y el capítulo nueve presenta la validación de las metodologías correspondientes a los procesos por lote. El capítulo diez valida la metodología de detección de anomalías en procesos por lote reales. Primero es aplicada a un intercambiador de calor escala laboratorio y después su aplicación es escalada a un proceso Foto-Fenton de planta piloto, lo cual corrobora el potencial y éxito de la metodología en la práctica real. Finalmente, la quinta parte de esta tesis, compuesta por el capítulo once, es dedicada a presentar y reafirmar las conclusiones finales y las principales contribuciones de la tesis. Además, se plantean las líneas de investigación futuras y se lista el trabajo desarrollado y presentado durante el periodo de investigación.
|
13 |
Classificação de anomalias e redução de falsos positivos em sistemas de detecção de intrusão baseados em rede utilizando métodos de agrupamento / Anomalies classification and false positives reduction in network intrusion detection systems using clustering methodsFerreira, Vinícius Oliveira [UNESP] 27 April 2016 (has links)
Submitted by VINÍCIUS OLIVEIRA FERREIRA null (viniciusoliveira@acmesecurity.org) on 2016-05-18T20:29:41Z
No. of bitstreams: 1
Dissertação-mestrado-vinicius-oliveira-biblioteca-final.pdf: 1594758 bytes, checksum: 0dbb0d2dd3fca3ed2b402b19b73006e7 (MD5) / Approved for entry into archive by Ana Paula Grisoto (grisotoana@reitoria.unesp.br) on 2016-05-20T16:27:30Z (GMT) No. of bitstreams: 1
ferreira_vo_me_sjrp.pdf: 1594758 bytes, checksum: 0dbb0d2dd3fca3ed2b402b19b73006e7 (MD5) / Made available in DSpace on 2016-05-20T16:27:30Z (GMT). No. of bitstreams: 1
ferreira_vo_me_sjrp.pdf: 1594758 bytes, checksum: 0dbb0d2dd3fca3ed2b402b19b73006e7 (MD5)
Previous issue date: 2016-04-27 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Os Sistemas de Detecção de Intrusão baseados em rede (NIDS) são tradicionalmente divididos em dois tipos de acordo com os métodos de detecção que empregam, a saber: (i) detecção por abuso e (ii) detecção por anomalia. Aqueles que funcionam a partir da detecção de anomalias têm como principal vantagem a capacidade de detectar novos ataques, no entanto, é possível elencar algumas dificuldades com o uso desta metodologia. Na detecção por anomalia, a análise das anomalias detectadas pode se tornar dispendiosa, uma vez que estas geralmente não apresentam informações claras sobre os eventos maliciosos que representam; ainda, NIDSs que se utilizam desta metodologia sofrem com a detecção de altas taxas de falsos positivos. Neste contexto, este trabalho apresenta um modelo para a classificação automatizada das anomalias detectadas por um NIDS. O principal objetivo é a classificação das anomalias detectadas em classes conhecidas de ataques. Com essa classificação pretende-se, além da clara identificação das anomalias, a identificação dos falsos positivos detectados erroneamente pelos NIDSs. Portanto, ao abordar os principais problemas envolvendo a detecção por anomalias, espera-se equipar os analistas de segurança com melhores recursos para suas análises. / Network Intrusion Detection Systems (NIDS) are traditionally divided into two types according to the detection methods they employ, namely (i) misuse detection and (ii) anomaly detection. The main advantage in anomaly detection is its ability to detect new attacks. However, this methodology has some downsides. In anomaly detection, the analysis of the detected anomalies is expensive, since they often have no clear information about the malicious events they represent; also, it suffers with high amounts of false positives
detected. In this context, this work presents a model for automated classification of anomalies detected by an anomaly based NIDS. Our main goal is the classification of the detected anomalies in well-known classes of attacks. By these means, we intend the clear identification of anomalies as well as the identification of false positives erroneously detected by NIDSs. Therefore, by addressing the key issues surrounding anomaly based
detection, our main goal is to equip security analysts with best resources for their analyses.
|
14 |
Modelos de mistura para dados com distribuições Poisson truncadas no zero / Mixture models for data with zero truncated Poisson distributionsAndressa do Carmo Gigante 22 September 2017 (has links)
Modelo de mistura de distribuições tem sido utilizado desde longa data, mas ganhou maior atenção recentemente devido ao desenvolvimento de métodos de estimação mais eficientes. Nesta dissertação, o modelo de mistura foi utilizado como uma forma de agrupar ou segmentar dados para as distribuições Poisson e Poisson truncada no zero. Para solucionar o problema do truncamento foram estudadas duas abordagens. Na primeira, foi considerado o truncamento em cada componente da mistura, ou seja, a distribuição Poisson truncada no zero. E, alternativamente, o truncamento na resultante do modelo de mistura utilizando a distribuição Poisson usual. As estimativas dos parâmetros de interesse do modelo de mistura foram calculadas via metodologia de máxima verossimilhança, sendo necessária a utilização de um método iterativo. Dado isso, implementamos o algoritmo EM para estimar os parâmetros do modelo de mistura para as duas abordagens em estudo. Para analisar a performance dos algoritmos construídos elaboramos um estudo de simulação em que apresentaram estimativas próximas dos verdadeiros valores dos parâmetros de interesse. Aplicamos os algoritmos à uma base de dados real de uma determinada loja eletrônica e para determinar a escolha do melhor modelo utilizamos os critérios de seleção de modelos AIC e BIC. O truncamento no zero indica afetar mais a metodologia na qual aplicamos o truncamento em cada componente da mistura, tornando algumas estimativas para a distribuição Poisson truncada no zero com viés forte. Ao passo que, na abordagem em que empregamos o truncamento no zero diretamente no modelo as estimativas apontaram menor viés. / Mixture models has been used since long but just recently attracted more attention for the estimations methods development more efficient. In this dissertation, we consider the mixture model like a method for clustering or segmentation data with the Poisson and Poisson zero truncated distributions. About the zero truncation problem we have two emplacements. The first, consider the zero truncation in the mixture component, that is, we used the Poisson zero truncated distribution. And, alternatively, we do the zero truncation in the mixture model applying the usual Poisson. We estimated parameters of interest for the mixture model through maximum likelihood estimation method in which we need an iterative method. In this way, we implemented the EM algorithm for the estimation of interested parameters. We apply the algorithm in one real data base about one determined electronic store and towards determine the better model we use the criterion selection AIC and BIC. The zero truncation appear affect more the method which we truncated in the component mixture, return some estimates with strong bias. In the other hand, when we truncated the zero directly in the model the estimates pointed less bias.
|
15 |
Uma nova arquitetura para combinação de aglomerados espaciais e aplicação em epidemiologiaHolmes, Danielly Cristina de Souza Costa. 16 December 2015 (has links)
Submitted by Viviane Lima da Cunha (viviane@biblioteca.ufpb.br) on 2017-06-22T17:21:30Z
No. of bitstreams: 1
arquivototal.pdf: 2646336 bytes, checksum: 8fa6ece0a05a4f7bffc899ff5ba5e9b7 (MD5) / Made available in DSpace on 2017-06-22T17:21:30Z (GMT). No. of bitstreams: 1
arquivototal.pdf: 2646336 bytes, checksum: 8fa6ece0a05a4f7bffc899ff5ba5e9b7 (MD5)
Previous issue date: 2015-12-16 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The combination of classifiers aims to produce more accurate results to the decision-making
process. Therefore, this study had the objective of proposing a new architecture based on a
combination of spatial clustering methods and a more detailed voting map on the amount of
votes that each geo-object received, applied to epidemiology. The methods of spatial
clustering, in general, aim to identify the significant and not significant spatial clusters
according to the study area. They are combined by combination of rules. In this work, the
following rules were used: majority voting and neural networks. The new proposed
architecture was applied to dengue data in the state of Paraiba, in the period from 2009 to
2011. According to the World Health Organization, dengue is a disease that annually records
an average of 50 to 100 million cases worldwide, generating large financial burden on the
health sector. A new architecture is proposed for the combination of the methods of spatial
clustering. The combination of spatial clustering methods was applied in three case studies. In
all three case studies, the new architecture identified more precisely the priority and nonpriority
municipalities in Paraiba with regards to the dengue. In the case study 1, the
combination rule was majority voting, in case study 2 the combination rule was neural
networks and in case study 3 a new detailed voting map was proposed, identifying the amount
of votes that each municipality had received. Analyzing the results from a spatial point of
view, it was observed that the mesoregion called Sertão in the state of Paraiba had a greater
number of priority municipalities; and the mesoregion of the Coast in Paraiba, the lowest
number of priority municipalities. Regarding the research from the epidemiological point of
view, it was observed that from the results of diagnostic tests (sensitivity, specificity, positive
predictive value and negative predictive value) and the Kappa statistic, the combination of
models produced satisfactory results. Concluding the analysis from the point of view of the
combination of spatial clustering methods, it was observed that the new architecture presented
satisfactory results by using the combination of the combination of rules. These results, from
the epidemiological point of view, can assist managers in the decision-making process by
verifying more precisely the regions that deserve special attention in combating the disease. / A combinação de classificadores tem por objetivo produzir resultados mais precisos para o
processo de tomada de decisão. Com isso, este estudo teve por objetivo propor uma nova
arquitetura baseada na combinação dos métodos de aglomeração espacial e um mapa de
votação mais detalhado sobre a quantidade de votos que cada geo-objeto recebeu, aplicados à
epidemiologia. Os métodos de aglomerados espaciais, de forma geral, tem por objetivo a
identificação dos conglomerados espaciais significativos e não significativos de acordo com a
região de estudo. Eles são combinados por regras de combinação. Neste trabalho foram
utilizadas as seguintes regras: votação por maioria e redes neurais. A nova arquitetura
proposta foi aplicada a dados do dengue no estado da Paraíba, no período de 2009 a 2011.
Segundo a Organização Mundial da Saúde, o dengue é uma doença que registra anualmente
uma média de 50 a 100 milhões de casos em todo o mundo, gerando grandes encargos
financeiros para o setor da saúde. proposta uma nova arquitetura para a combinação dos
métodos de aglomerados espaciais. A combinação dos métodos de aglomeração espacial, foi
aplicados em três estudos de casos. Em todos os três estudos de casos a nova arquitetura
identificou com maior precisão os municípios prioritários e não prioritários do dengue na
Paraíba. No estudo de caso 1 a regra de combinação foi a votação por maioria, no estudo de
caso 2 a regra de combinação foi das redes neurais e no estudo de caso 3 foi proposto uma
novo mapa de votação detalhado identificando a quantidade de votos que cada município
recebeu. Analisando os resultados do ponto de vista espacial, observou-se que a mesorregião
do Sertão Paraibano apresentou uma maior quantidade de municípios prioritários; e a
mesorregião do Litoral Paraibano, o menor número de municípios prioritários. Em relação, a
pesquisa do ponto de vista epidemiológico foi possível verificar que a partir dos resultados
dos testes diagnósticos (sensibilidade, especificidade, valores preditivos positivos e valores
preditivos negativos) e a estatística Kappa os modelos de combinação produziram resultados
satisfatórios. Finalizando a análise do ponto de vista da combinação dos métodos de
aglomerados espaciais, foi possível observar que a nova arquitetura, apresentou resultados
satisfatórios, a partir da combinação das regras de combinação. Estes resultados do ponto de
vista epidemiológico, podem auxiliar os gestores no processo de tomada de decisão
verificando com mais precisão as regiões que realmente merecem atenção especial no
combate à doença.
|
16 |
Klasifikace elektronických dokumentů s využitím shlukové analýzy / Classification of electronic documents using cluster analysisŠevčík, Radim January 2009 (has links)
The current age is characterised by unprecedented information growth, whether it is by amount or complexity. Most of it is available in digital form so we can analyze it using cluster analysis. We have tried to classify the documents from 20 Newsgroups collection in terms of their content only. The aim was to asses available clustering methods in a variety of applications. After the transformation into binary vector representation we performed several experiments and measured the values of entropy, purity and time of execution in application CLUTO. For a small number of clusters the best results offered the direct method (generally hierarchical method), but for more it was the repeated bisection (divisive). Agglomerative method proved not to be suitable. Using simulation we estimated the optimal number of clusters to be 10. For this solution we described in detail features of each cluster using repeated bisection method and i2 criterion function. In the future focus should be set on realisation of binary clustering with advantage of programming languages like Perl or C++. Results of this work might be of interest to web search engine developers and electronic catalogue administrators.
|
17 |
Analýza vlastností shlukovacích algoritmů / Analysis of Clustering MethodsLipták, Šimon January 2019 (has links)
The aim of this master's thesis was to get acquainted with cluster analysis, clustering methods and their theoretical properties. It was necessary select clustering algorithms whose properties will be analyzed, find and select data sets on which these algorithms will be triggered. Also, the goal was to design and implement an application that will evaluate and display clustering results in an appropriate manner. The last step was to analyze the results and compare them with theoretical assumptions.
|
18 |
Improving Recommender Engines for Video Streaming Platforms with RNNs and Multivariate Data / Förbättring av Rekommendationsmotorer för Videoströmningsplattformar med RNN och Multivariata DataPérez Felipe, Daniel January 2022 (has links)
For over 4 years now, there has been a fierce fight for staying ahead in the so-called ”Streaming War”. The Covid-19 pandemic and its consequent confinement only worsened the situation. In such a market where the user is faced with too many streaming video services to choose from, retaining customers becomes a necessary must. Moreover, an extensive catalogue makes it even more difficult for the user to choose a movie from. Recommender Systems try to ease this task by analyzing the users’ interactions with the platform and predicting movies that, a priori, will be watched next. Neural Networks have started to be implemented as the underlying technology in the development of Recommender Systems. Yet, most streaming services fall victim to a highly uneven movies distribution, where a small fraction of their content is watched by most of their users, having the rest of their catalogue a limited number of views. This is the long-tail problem that makes for a difficult classification model. An RNN model was implemented to solve this problem. Following a multiple-experts classification strategy, where each classifier focuses only on a specific group of films, movies are clustered by popularity. These clusters were created following the Jenks natural breaks algorithm, clustering movies by minimizing the inner group variance and maximizing the outer group variance. This new implementation ended up outperforming other clustering methods, where the proposed Jenks’ movie clusters gave better results for the corresponding models. The model had, as input, an ordered stream of watched movies. An extra input variable, the date of the visualization, gave an increase in performance, being more noticeable in those clusters with a fewer amount of movies and more views, i.e., those clusters not corresponding to the least popular ones. The addition of an extra variable, the percent of movies watched, gave inconclusive results due to hardware limitations. / I över fyra år har det nu varit en hård kamp för att ligga i framkant i det så kallade ”Streaming kriget”. Covid-19-pandemin och den därpå följande karantänen förvärrade bara situationen. På en sådan marknad där användaren står inför alltför många streamingtjänster att välja mellan, blir kvarhållande av kunderna en nödvändighet. En omfattande katalog gör det dessutom ännu svårare för användaren att välja en film. Rekommendationssystem försöker underlätta denna uppgift genom att analysera användarnas interaktion med plattformen och förutsäga vilka filmer som kommer att ses härnäst. Neurala nätverk har börjat användas som underliggande teknik vid utvecklingen av rekommendationssystem. De flesta streamingtjänster har dock en mycket ojämn fördelning av filmerna, då en liten del av deras innehåll ses av de flesta av användarna, medan en stor del av deras katalog har ett begränsat antal visualiseringar. Detta så kallade ”Long Tail”-problem gör det svårt att skapa en klassificeringsmodell. En RNN-modell implementerades för att lösa detta problem. Genom att följa en klassificeringsstrategi med flera experter, där varje klassificerare endast fokuserar på en viss grupp av filmer, grupperas filmerna efter popularitet. Dessa kluster skapades enligt Jenks natural breaks-algoritm, som klustrar filmer genom att minimera variansen i den inre gruppen och maximera variansen i den yttre gruppen. Denna nya implementering överträffade till slut andra klustermetoder, där filmklustren föreslagna av Jenks gav bättre resultat för motsvarande modeller. Modellen hade som indata en ordnad ström av sedda filmer. En extra ingångsvariabel, datumet för visualiseringen, gav en ökning av prestandan, som var mer märkbar i de kluster med färre filmer och fler visualiseringar, dvs. de kluster som inte motsvarade de minst populära klustren. Tillägget av en extra variabel, procent av filmen som har setts, gav inte entydiga resultat på grund av hårdvarubegränsningar / Desde hace más de 4 años, se está librando una lucha encarnizada por mantenerse en cabeza en la llamada ”Guerra del Streaming”. La Covid-19 y su consiguiente confinamiento no han hecho más que empeorar la situación. En un mercado como éste, en el que el usuario se encuentra con demasiados servicios de vídeo en streaming entre los que elegir, retener a los clientes se convierte en una necesidad. Además, un catálogo extenso dificulta aún más la elección de una película por parte del usuario. Los sistemas de recomendación intentan facilitar esta tarea analizando las interacciones de los usuarios con la plataforma y predecir las películas que, a priori, se verán a continuación. Las Redes Neuronales han comenzado a implementarse como tecnología subyacente en el desarrollo de los sistemas de recomendación. Sin embargo, la mayoría de los servicios de streaming son víctimas de una distribución de películas muy desigual, en la que una pequeña fracción de sus contenidos es vista por la mayoría de sus usuarios, teniendo el resto de su catálogo un número muy inferior de visualizaciones. Este es el denominado problema de ”long-tail” que dificulta el modelo de clasificación. Para resolver este problema se implementó un modelo RNN. Siguiendo una estrategia de clasificación de expertos múltiples, en la que cada clasificador se centra en un único grupo específico de películas, agrupadas por popularidad. Estos clusters se crearon siguiendo el algoritmo de Jenks, agrupando las películas mediante minimización y maximización de la varianza entre grupos . Esta nueva implementación acabó superando a otros métodos de clustering, donde los clusters de películas de Jenks propuestos dieron mejores resultados para los modelos correspondientes. El modelo tenía como entrada un flujo ordenado de películas vistas. Una variable de entrada extra, la fecha de la visualización, dio un incremento en el rendimiento, siendo más notable en aquellos clusters con una menor cantidad de películas y más visualizaciones, es decir, aquellos clusters que no corresponden a los menos populares. La adición de una variable extra, el porcentaje de películas vistas, dio resultados no concluyentes debido a limitaciones hardware.
|
19 |
n-TARP: A Random Projection based Method for Supervised and Unsupervised Machine Learning in High-dimensions with Application to Educational Data AnalysisYellamraju Tarun (6630578) 11 June 2019 (has links)
Analyzing the structure of a dataset is a challenging problem in high-dimensions as the volume of the space increases at an exponential rate and typically, data becomes sparse in this high-dimensional space. This poses a significant challenge to machine learning methods which rely on exploiting structures underlying data to make meaningful inferences. This dissertation proposes the <i>n</i>-TARP method as a building block for high-dimensional data analysis, in both supervised and unsupervised scenarios.<div><br></div><div>The basic element, <i>n</i>-TARP, consists of a random projection framework to transform high-dimensional data to one-dimensional data in a manner that yields point separations in the projected space. The point separation can be tuned to reflect classes in supervised scenarios and clusters in unsupervised scenarios. The <i>n</i>-TARP method finds linear separations in high-dimensional data. This basic unit can be used repeatedly to find a variety of structures. It can be arranged in a hierarchical structure like a tree, which increases the model complexity, flexibility and discriminating power. Feature space extensions combined with <i>n</i>-TARP can also be used to investigate non-linear separations in high-dimensional data.<br></div><div><br></div><div>The application of <i>n</i>-TARP to both supervised and unsupervised problems is investigated in this dissertation. In the supervised scenario, a sequence of <i>n</i>-TARP based classifiers with increasing complexity is considered. The point separations are measured by classification metrics like accuracy, Gini impurity or entropy. The performance of these classifiers on image classification tasks is studied. This study provides an interesting insight into the working of classification methods. The sequence of <i>n</i>-TARP classifiers yields benchmark curves that put in context the accuracy and complexity of other classification methods for a given dataset. The benchmark curves are parameterized by classification error and computational cost to define a benchmarking plane. This framework splits this plane into regions of "positive-gain" and "negative-gain" which provide context for the performance and effectiveness of other classification methods. The asymptotes of benchmark curves are shown to be optimal (i.e. at Bayes Error) in some cases (Theorem 2.5.2).<br></div><div><br></div><div>In the unsupervised scenario, the <i>n</i>-TARP method highlights the existence of many different clustering structures in a dataset. However, not all structures present are statistically meaningful. This issue is amplified when the dataset is small, as random events may yield sample sets that exhibit separations that are not present in the distribution of the data. Thus, statistical validation is an important step in data analysis, especially in high-dimensions. However, in order to statistically validate results, often an exponentially increasing number of data samples are required as the dimensions increase. The proposed <i>n</i>-TARP method circumvents this challenge by evaluating statistical significance in the one-dimensional space of data projections. The <i>n</i>-TARP framework also results in several different statistically valid instances of point separation into clusters, as opposed to a unique "best" separation, which leads to a distribution of clusters induced by the random projection process.<br></div><div><br></div><div>The distributions of clusters resulting from <i>n</i>-TARP are studied. This dissertation focuses on small sample high-dimensional problems. A large number of distinct clusters are found, which are statistically validated. The distribution of clusters is studied as the dimensionality of the problem evolves through the extension of the feature space using monomial terms of increasing degree in the original features, which corresponds to investigating non-linear point separations in the projection space.<br></div><div><br></div><div>A statistical framework is introduced to detect patterns of dependence between the clusters formed with the features (predictors) and a chosen outcome (response) in the data that is not used by the clustering method. This framework is designed to detect the existence of a relationship between the predictors and response. This framework can also serve as an alternative cluster validation tool.<br></div><div><br></div><div>The concepts and methods developed in this dissertation are applied to a real world data analysis problem in Engineering Education. Specifically, engineering students' Habits of Mind are analyzed. The data at hand is qualitative, in the form of text, equations and figures. To use the <i>n</i>-TARP based analysis method, the source data must be transformed into quantitative data (vectors). This is done by modeling it as a random process based on the theoretical framework defined by a rubric. Since the number of students is small, this problem falls into the small sample high-dimensions scenario. The <i>n</i>-TARP clustering method is used to find groups within this data in a statistically valid manner. The resulting clusters are analyzed in the context of education to determine what is represented by the identified clusters. The dependence of student performance indicators like the course grade on the clusters formed with <i>n</i>-TARP are studied in the pattern dependence framework, and the observed effect is statistically validated. The data obtained suggests the presence of a large variety of different patterns of Habits of Mind among students, many of which are associated with significant grade differences. In particular, the course grade is found to be dependent on at least two Habits of Mind: "computation and estimation" and "values and attitudes."<br></div>
|
20 |
Channel Probing for an Indoor Wireless Communications ChannelHunter, Brandon 13 March 2003 (has links) (PDF)
The statistics of the amplitude, time and angle of arrival of multipaths in an indoor environment are all necessary components of multipath models used to simulate the performance of spatial diversity in receive antenna configurations. The model presented by Saleh and Valenzuela, was added to by Spencer et. al., and included all three of these parameters for a 7 GHz channel. A system was built to measure these multipath parameters at 2.4 GHz for multiple locations in an indoor environment. Another system was built to measure the angle of transmission for a 6 GHz channel. The addition of this parameter allows spatial diversity at the transmitter along with the receiver to be simulated. The process of going from raw measurement data to discrete arrivals and then to clustered arrivals is analyzed. Many possible errors associated with discrete arrival processing are discussed along with possible solutions. Four clustering methods are compared and their relative strengths and weaknesses are pointed out. The effects that errors in the clustering process have on parameter estimation and model performance are also simulated.
|
Page generated in 0.0752 seconds