Spelling suggestions: "subject:"k nearest neighbor"" "subject:"k nearest weighbor""
31 |
Nearest Neighbor Foreign Exchange Rate Forecasting with Mahalanobis DistancePathirana, Vindya Kumari 01 January 2015 (has links)
Foreign exchange (FX) rate forecasting has been a challenging area of study in the past. Various linear and nonlinear methods have been used to forecast FX rates. As the currency data are nonlinear and highly correlated, forecasting through nonlinear dynamical systems is becoming more relevant. The nearest neighbor (NN) algorithm is one of the most commonly used nonlinear pattern recognition and forecasting methods that outperforms the available linear forecasting methods for the high frequency foreign exchange data. The basic idea behind the NN is to capture the local behavior of the data by selecting the instances having similar dynamic behavior. The most relevant k number of histories to the present dynamical structure are the only past values used to predict the future. Due to this reason, NN algorithm is also known as the k-nearest neighbor algorithm (k-NN). Here k represents the number of chosen neighbors.
In the k-nearest neighbor forecasting procedure, similar instances are captured through a distance function. Since the forecasts completely depend on the chosen nearest neighbors, the distance plays a key role in the k-NN algorithm. By choosing an appropriate distance, we can improve the performance of the algorithm significantly. The most commonly used distance for k-NN forecasting in the past was the Euclidean distance. Due to possible correlation among vectors at different time frames, distances based on deterministic vectors, such as Euclidean, are not very appropriate when applying for foreign exchange data. Since Mahalanobis distance captures the correlations, we suggest using this distance in the selection of neighbors.
In the present study, we used five different foreign currencies, which are among the most traded currencies, to compare the performances of the k-NN algorithm with traditional Euclidean and Absolute distances to performances with the proposed Mahalanobis distance. The performances were compared in two ways: (i) forecast accuracy and (ii) transforming their forecasts in to a more effective technical trading rule. The results were obtained with real FX trading data, and the results showed that the method introduced in this work outperforms the other popular methods.
Furthermore, we conducted a thorough investigation of optimal parameter choice with different distance measures. We adopted the concept of distance based weighting to the NN and compared the performances with traditional unweighted NN algorithm based forecasting.
Time series forecasting methods, such as Auto regressive integrated moving average process (ARIMA), are widely used in many ares of time series as a forecasting technique. We compared the performances of proposed Mahalanobis distance based k-NN forecasting procedure with the traditional general ARIM- based forecasting algorithm. In this case the forecasts were also transformed into a technical trading strategy to create buy and sell signals. The two methods were evaluated for their forecasting accuracy and trading performances.
Multi-step ahead forecasting is an important aspect of time series forecasting. Even though many researchers claim that the k-Nearest Neighbor forecasting procedure outperforms the linear forecasting methods for financial time series data, and the available work in the literature supports this claim with one step ahead forecasting. One of our goals in this work was to improve FX trading with multi-step ahead forecasting. A popular multi-step ahead forecasting strategy was adopted in our work to obtain more than one day ahead forecasts. We performed a comparative study on the performance of single step ahead trading strategy and multi-step ahead trading strategy by using five foreign currency data with Mahalanobis distance based k-nearest neighbor algorithm.
|
32 |
Using machine learning techniques to simplify mobile interfacesSigman, Matthew Stephen 19 April 2013 (has links)
This paper explores how known machine learning techniques can be applied in unique ways to simplify software and therefore dramatically increase its usability.
As software has increased in popularity, its complexity has increased in lockstep, to a point where it has become burdensome. By shifting the focus from the software to the user, great advances can be achieved by way of simplification.
The example problem used in this report is well known: suggest local dining choices tailored to a specific person based on known habits and those of similar people. By analyzing past choices and applying likely probabilities, assumptions can be made to reduce user interaction, allowing the user to realize the benefits of the software faster and more frequently. This is accomplished with Java Servlets, Apache Mahout machine learning libraries, and various third party resources to gather dimensions on each recommendation. / text
|
33 |
Classification of Genotype and Age by Spatial Aspects of RPE Cell MorphologyBoring, Michael 12 August 2014 (has links)
Age related macular degeneration (AMD) is a public health concern in an aging society. The retinal pigment epithelium (RPE) layer of the eye is a principal site of pathogenesis for AMD. Morphological characteristics of the cells in the RPE layer can be used to discriminate age and disease status of individuals. In this thesis three genotypes of mice of various ages are used to study the predictive abilities of these characteristics. The disease state is represented by two mutant genotypes and the healthy state by the wild-type. Classification analysis is applied to the RPE morphology from the different spatial regions of the RPE layer. Variable reduction is accomplished by principal component analysis (PCA) and classification analysis by the k-nearest neighbor (k-NN) algorithm. In this way the differential ability of the spatial regions to predict age and disease status by cellular variables is explored.
|
34 |
Método de mineração de dados para diagnóstico de câncer de mama baseado na seleção de variáveis / A data mining method for breast cancer diagnosis based on selected featuresHolsbach, Nicole January 2012 (has links)
A presente dissertação propõe métodos para mineração de dados para diagnóstico de câncer de mama (CM) baseado na seleção de variáveis. Partindo-se de uma revisão sistemática, sugere-se um método para a seleção de variáveis para classificação das observações (pacientes) em duas classes de resultado, benigno ou maligno, baseado na análise citopatológica de amostras de célula da mama de pacientes. O método de seleção de variáveis para categorização das observações baseia-se em 4 passos operacionais: (i) dividir o banco de dados original em porções de treino e de teste, e aplicar a ACP (Análise de Componentes Principais) na porção de treino; (ii) gerar índices de importância das variáveis baseados nos pesos da ACP e na percentagem da variância explicada pelos componentes retidos; (iii) classificar a porção de treino utilizando as técnicas KVP (k-vizinhos mais próximos) ou AD (Análise Discriminante). Em seguida eliminar a variável com o menor índice de importância, classificar o banco de dados novamente e calcular a acurácia de classificação; continuar tal processo iterativo até restar uma variável; e (iv) selecionar o subgrupo de variáveis responsável pela máxima acurácia de classificação e classificar a porção de teste utilizando tais variáveis. Quando aplicado ao WBCD (Wisconsin Breast Cancer Database), o método proposto apresentou acurácia média de 97,77%, retendo uma média de 5,8 variáveis. Uma variação do método é proposta, utilizando quatro diferentes tipos de kernels polinomiais para remapear o banco de dados original; os passos (i) a (iv) acima descritos são então aplicados aos kernels propostos. Ao aplicar-se a variação do método ao WBCD, obteve-se acurácia média de 98,09%, retendo uma média de 17,24 variáveis de um total de 54 variáveis geradas pelo kernel polinomial recomendado. O método proposto pode auxiliar o médico na elaboração do diagnóstico, selecionando um menor número de variáveis (envolvidas na tomada de decisão) com a maior acurácia, obtendo assim o maior acerto possível. / This dissertation presents a data mining method for breast cancer (BC) diagnosis based on selected features. We first carried out a systematic literature review, and then suggested a method for feature selection and classification of observations, i.e., patients, into benign or malignant classes based on patients’ breast tissue measures. The proposed method relies on four operational steps: (i) split the original dataset into training and testing sets and apply PCA (Principal Component Analysis) on the training set; (ii) generate attribute importance indices based on PCA weights and percent of variance explained by the retained components; (iii) classify the training set using KNN (k-Nearest Neighbor) or DA (Discriminant Analysis) techniques, eliminate irrelevant features and compute the classification accuracy. Next, eliminate the feature with the lowest importance index, classify the dataset, and re-compute the accuracy. Continue such iterative process until one feature is left; and (iv) choose the subset of features yielding the maximum classification accuracy, and classify the testing set based on those features. When applied to the WBCD (Wisconsin Breast Cancer Database), the proposed method led to average 97.77% accurate classifications while retaining average 5.8 features. One variation of the proposed method is presented based on four different types of polynomial kernels aimed at remapping the original database; steps (i) to (iv) are then applied to such kernels. When applied to the WBCD, the proposed modification increased average accuracy to 98.09% while retaining average of 17.24 features from the 54 variables generated by the recommended kernel. The proposed method can assist the physician in making the diagnosis, selecting a smaller number of variables (involved in the decision-making) with greater accuracy, thereby obtaining the highest possible accuracy.
|
35 |
Método de mineração de dados para diagnóstico de câncer de mama baseado na seleção de variáveis / A data mining method for breast cancer diagnosis based on selected featuresHolsbach, Nicole January 2012 (has links)
A presente dissertação propõe métodos para mineração de dados para diagnóstico de câncer de mama (CM) baseado na seleção de variáveis. Partindo-se de uma revisão sistemática, sugere-se um método para a seleção de variáveis para classificação das observações (pacientes) em duas classes de resultado, benigno ou maligno, baseado na análise citopatológica de amostras de célula da mama de pacientes. O método de seleção de variáveis para categorização das observações baseia-se em 4 passos operacionais: (i) dividir o banco de dados original em porções de treino e de teste, e aplicar a ACP (Análise de Componentes Principais) na porção de treino; (ii) gerar índices de importância das variáveis baseados nos pesos da ACP e na percentagem da variância explicada pelos componentes retidos; (iii) classificar a porção de treino utilizando as técnicas KVP (k-vizinhos mais próximos) ou AD (Análise Discriminante). Em seguida eliminar a variável com o menor índice de importância, classificar o banco de dados novamente e calcular a acurácia de classificação; continuar tal processo iterativo até restar uma variável; e (iv) selecionar o subgrupo de variáveis responsável pela máxima acurácia de classificação e classificar a porção de teste utilizando tais variáveis. Quando aplicado ao WBCD (Wisconsin Breast Cancer Database), o método proposto apresentou acurácia média de 97,77%, retendo uma média de 5,8 variáveis. Uma variação do método é proposta, utilizando quatro diferentes tipos de kernels polinomiais para remapear o banco de dados original; os passos (i) a (iv) acima descritos são então aplicados aos kernels propostos. Ao aplicar-se a variação do método ao WBCD, obteve-se acurácia média de 98,09%, retendo uma média de 17,24 variáveis de um total de 54 variáveis geradas pelo kernel polinomial recomendado. O método proposto pode auxiliar o médico na elaboração do diagnóstico, selecionando um menor número de variáveis (envolvidas na tomada de decisão) com a maior acurácia, obtendo assim o maior acerto possível. / This dissertation presents a data mining method for breast cancer (BC) diagnosis based on selected features. We first carried out a systematic literature review, and then suggested a method for feature selection and classification of observations, i.e., patients, into benign or malignant classes based on patients’ breast tissue measures. The proposed method relies on four operational steps: (i) split the original dataset into training and testing sets and apply PCA (Principal Component Analysis) on the training set; (ii) generate attribute importance indices based on PCA weights and percent of variance explained by the retained components; (iii) classify the training set using KNN (k-Nearest Neighbor) or DA (Discriminant Analysis) techniques, eliminate irrelevant features and compute the classification accuracy. Next, eliminate the feature with the lowest importance index, classify the dataset, and re-compute the accuracy. Continue such iterative process until one feature is left; and (iv) choose the subset of features yielding the maximum classification accuracy, and classify the testing set based on those features. When applied to the WBCD (Wisconsin Breast Cancer Database), the proposed method led to average 97.77% accurate classifications while retaining average 5.8 features. One variation of the proposed method is presented based on four different types of polynomial kernels aimed at remapping the original database; steps (i) to (iv) are then applied to such kernels. When applied to the WBCD, the proposed modification increased average accuracy to 98.09% while retaining average of 17.24 features from the 54 variables generated by the recommended kernel. The proposed method can assist the physician in making the diagnosis, selecting a smaller number of variables (involved in the decision-making) with greater accuracy, thereby obtaining the highest possible accuracy.
|
36 |
Método de mineração de dados para diagnóstico de câncer de mama baseado na seleção de variáveis / A data mining method for breast cancer diagnosis based on selected featuresHolsbach, Nicole January 2012 (has links)
A presente dissertação propõe métodos para mineração de dados para diagnóstico de câncer de mama (CM) baseado na seleção de variáveis. Partindo-se de uma revisão sistemática, sugere-se um método para a seleção de variáveis para classificação das observações (pacientes) em duas classes de resultado, benigno ou maligno, baseado na análise citopatológica de amostras de célula da mama de pacientes. O método de seleção de variáveis para categorização das observações baseia-se em 4 passos operacionais: (i) dividir o banco de dados original em porções de treino e de teste, e aplicar a ACP (Análise de Componentes Principais) na porção de treino; (ii) gerar índices de importância das variáveis baseados nos pesos da ACP e na percentagem da variância explicada pelos componentes retidos; (iii) classificar a porção de treino utilizando as técnicas KVP (k-vizinhos mais próximos) ou AD (Análise Discriminante). Em seguida eliminar a variável com o menor índice de importância, classificar o banco de dados novamente e calcular a acurácia de classificação; continuar tal processo iterativo até restar uma variável; e (iv) selecionar o subgrupo de variáveis responsável pela máxima acurácia de classificação e classificar a porção de teste utilizando tais variáveis. Quando aplicado ao WBCD (Wisconsin Breast Cancer Database), o método proposto apresentou acurácia média de 97,77%, retendo uma média de 5,8 variáveis. Uma variação do método é proposta, utilizando quatro diferentes tipos de kernels polinomiais para remapear o banco de dados original; os passos (i) a (iv) acima descritos são então aplicados aos kernels propostos. Ao aplicar-se a variação do método ao WBCD, obteve-se acurácia média de 98,09%, retendo uma média de 17,24 variáveis de um total de 54 variáveis geradas pelo kernel polinomial recomendado. O método proposto pode auxiliar o médico na elaboração do diagnóstico, selecionando um menor número de variáveis (envolvidas na tomada de decisão) com a maior acurácia, obtendo assim o maior acerto possível. / This dissertation presents a data mining method for breast cancer (BC) diagnosis based on selected features. We first carried out a systematic literature review, and then suggested a method for feature selection and classification of observations, i.e., patients, into benign or malignant classes based on patients’ breast tissue measures. The proposed method relies on four operational steps: (i) split the original dataset into training and testing sets and apply PCA (Principal Component Analysis) on the training set; (ii) generate attribute importance indices based on PCA weights and percent of variance explained by the retained components; (iii) classify the training set using KNN (k-Nearest Neighbor) or DA (Discriminant Analysis) techniques, eliminate irrelevant features and compute the classification accuracy. Next, eliminate the feature with the lowest importance index, classify the dataset, and re-compute the accuracy. Continue such iterative process until one feature is left; and (iv) choose the subset of features yielding the maximum classification accuracy, and classify the testing set based on those features. When applied to the WBCD (Wisconsin Breast Cancer Database), the proposed method led to average 97.77% accurate classifications while retaining average 5.8 features. One variation of the proposed method is presented based on four different types of polynomial kernels aimed at remapping the original database; steps (i) to (iv) are then applied to such kernels. When applied to the WBCD, the proposed modification increased average accuracy to 98.09% while retaining average of 17.24 features from the 54 variables generated by the recommended kernel. The proposed method can assist the physician in making the diagnosis, selecting a smaller number of variables (involved in the decision-making) with greater accuracy, thereby obtaining the highest possible accuracy.
|
37 |
Metric space indexing for nearest neighbor search in multimedia context : Indexação de espaços métricos para busca de vizinho mais próximo em contexto multimídia / Indexação de espaços métricos para busca de vizinho mais próximo em contexto multimídiaSilva, Eliezer de Souza da, 1988- 26 August 2018 (has links)
Orientador: Eduardo Alves do Valle Junior / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-26T08:10:33Z (GMT). No. of bitstreams: 1
Silva_EliezerdeSouzada_M.pdf: 2350845 bytes, checksum: dd31928bd19312563101a08caea74d63 (MD5)
Previous issue date: 2014 / Resumo: A crescente disponibilidade de conteúdo multimídia é um desafio para a pesquisa em Recuperação de Informação. Usuários querem não apenas ter acesso aos documentos multimídia, mas também obter semântica destes documentos, de modo que a capacidade de encontrar um conteúdo específico em grandes coleções de documentos textuais e não textuais é fundamental. Nessas grandes escalas, sistemas de informação multimídia de recuperação devem contar com a capacidade de executar a busca por semelhança de forma eficiente. No entanto, documentos multimídia são muitas vezes representados por descritores multimídia representados por vetores de alta dimensionalidade, ou por outras representações complexas em espaços métricos. Fornecer a possibilidade de uma busca por similaridade eficiente para esse tipo de dados é extremamente desafiador. Neste projeto, vamos explorar uma das famílias mais citado de soluções para a busca de similaridade, o Hashing Sensível à Localidade (LSH - Locality-sensitive Hashing em inglês), que se baseia na criação de funções de hash que atribuem, com maior probabilidade, a mesma chave para os dados que são semelhantes. O LSH está disponível apenas para um punhado funções de distância, mas, quando disponíveis, verificou-se ser extremamente eficiente para arquiteturas com custo de acesso uniforme aos dados. A maioria das funções LSH existentes são restritas a espaços vetoriais. Propomos dois métodos novos para o LSH, generalizando-o para espaços métricos quaisquer utilizando particionamento métrico (centróides aleatórios e k-medoids). Apresentamos uma comparação com os métodos LSH bem estabelecidos em espaços vetoriais e com os últimos concorrentes novos métodos para espaços métricos. Desenvolvemos uma modelagem teórica do comportamento probalístico dos algoritmos propostos e demonstramos algumas relações e limitantes para a probabilidade de colisão de hash. Dentre os algoritmos propostos para generelizar LSH para espaços métricos, esse desenvolvimento teórico é novo. Embora o problema seja muito desafiador, nossos resultados demonstram que ela pode ser atacado com sucesso. Esta dissertação apresentará os desenvolvimentos do método, a formulação teórica e a discussão experimental dos métodos propostos / Abstract: The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them --- the ability of finding specific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this project, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most existing LSH functions are restricted to vector spaces. We propose two novel LSH methods (VoronoiLSH and VoronoiPlex LSH) for generic metric spaces based on metric hyperplane partitioning (random centroids and K-medoids). We present a comparison with well-established LSH methods in vector spaces and with recent competing new methods for metric spaces. We develop a theoretical probabilistic modeling of the behavior of the proposed algorithms and show some relations and bounds for the probability of hash collision. Among the algorithms proposed for generalizing LSH for metric spaces, this theoretical development is new. Although the problem is very challenging, our results demonstrate that it can be successfully tackled. This dissertation will present the developments of the method, theoretical and experimental discussion and reasoning of the methods performance / Mestrado / Engenharia de Computação / Mestre em Engenharia Elétrica
|
38 |
Exploração de dados multivariados de fontes e extratos de antocianinas ultilizando análise de componentes princiaipais e método do vizinho mais proximo / Exploring multivariate data of sources and extracts of anthocyanins using principal components analysis and method of nearest neighborFavaro, Martha Maria Andreotti, 1981- 20 August 2018 (has links)
Orientador: Adriana Vitorino Rossi / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Química / Made available in DSpace on 2018-08-20T02:46:28Z (GMT). No. of bitstreams: 1
Favaro_MarthaMariaAndreotti_D.pdf: 3734314 bytes, checksum: 08002efe51b2f18e9a942c3b818270b7 (MD5)
Previous issue date: 2012 / Resumo: Antocianinas (ACYS) são corantes naturais responsáveis pela coloração de frutas, hortaliças, flores e grãos. Novas perspectivas de usos de antocianinas em diversos segmentos industriais estimulam estudos analíticos para sistematizar a identificação e a classificação de fontes e extratos desses corantes. Neste trabalho foram utilizadas fontes de ACYS como frutas típicas brasileiras: AMORA (Morus nigra), amora preta (Rubus sp.), jabuticaba (Myrciaria cauliflora), jambolão (Syzygium cumini), jussara (Euterpe edulis Mart.), morango (Fragaria x ananassa Duch) e uva (Vitis vinífera e Vitis vinífera L. Brasil); hortaliças: alface roxa (Lactuca sativa), berinjela (Solanum melongena), cebola roxa (Allium cepa), rabanete (Raphanus sativus), repolho roxo (Brassica oleraceae) e flores: beijo-turco (Impatiens walleriana), gerânio (Pelargonium hortorum e Pelargonium peltatum L.), hibisco (Hibiscus sinensis e Hibiscus syriacus) e hortênsia (Hydrangea macrophylla). A literatura descreve diversas técnicas para análise de ACYS em vegetais e seus extratos, com destaque para cromatografia líquida de alta eficiência (HPLC), espectrometria de massas (MS) e espectrofotometria (UV-Vis), sendo que todas elas foram aplicadas neste trabalho, incluindo-se espectrofotometria de reflectância e a técnica de eletromigração em capilares cromatografia eletrocinética micelar (MEKC). As ferramentas quimiométricas utilizadas no tratamento dos dados foram análise de componentes principais (PCA) e método do vizinho mais próximo (KNN). Os modelos quimiométricos de classificação obtidos apresentaram-se robustos com erros de previsão de menos de 30 % sendo possível identificar as fontes de ACYS, o solvente extrator, a idade dos extratos e dados sobre sua estabilidade e condições de armazenamento. Os resultados apontaram que dados obtidos de técnicas analíticas simples como espectrofotometria de absorção e sem necessidade de preparo de amostra como reflectância difusa na região do visível são comparáveis a resultados de técnicas mais sofisticadas e caras como HPLC e MEKC e até superam o potencial de algumas informações obtidas por MS / Abstract: Anthocyanins (ACYS) are natural dyes responsible for color in fruits, vegetables, flowers and grains. New perspectives for use of anthocyanins in various industries stimulate analytical studies to systematize the identification and classification of sources and extracts of these dyes. In this work, typical Brazilian fruits: mulberry (Morus nigra), blackberry (Rubus sp), jaboticaba (Myrciaria cauliflora), jambolan (Syzygium cumini), jussara fruit (Euterpe edulis Mart.), strawberry (Fragaria x ananassa Duch) and grapes (Vitis vinifera and Vitis vinifera L. 'Brazil'); vegetables: red lettuce (Lactuca sativa), eggplant (Solanum melongena), purple onion (Allium cepa), radish (Raphanus sativus), red cabbage (Brassica oleracea) and flowers, Buzy Lizzie (Impatiens walleriana), geranium (Pelargonium hortorum and Pelargonium peltatum L.), hibiscus (Hibiscus sinensis and Hibiscus syriacus) and hydrangea (Hydrangea macrophylla) were used as sources of ACYS. The literature describes several techniques for analyzing ACYS in vegetables and their extracts, with emphasis on high performance liquid chromatography (HPLC), mass spectrometry (MS) and spectrophotometry (UV-VIS). All of these techniques were applied in this work, including reflectance spectrophotometry and micellar electrokinetic chromatography (MEKC) which is one of the capillary electromigration techniques. The chemometric tools used in data handling were the principal component analysis (PCA) and the K-nearest neighbor method (KNN). The chemometric classification models obtained are robust with predict errors of less than 30 %. It is possible to identify the sources of ACYS, the extractor solvent, the age of the extracts, their stability and storage conditions. The results show that data obtained from simple analytical techniques such as absorption spectroscopy and diffuse reflectance in the visible region (sample preparation is not needed) are comparable to results of those obtained from sophisticated and expensive techniques such as HPLC and MEKC. These techniques also surpass the information obtained by MS / Doutorado / Quimica Analitica / Doutor em Ciências
|
39 |
Découverte d'évènements par contenu visuel dans les médias sociaux / Visual-based event mining in social mediaTrad, Riadh 05 June 2013 (has links)
L’évolution du web, de ce qui était typiquement connu comme un moyen de communication à sens unique en mode conversationnel, a radicalement changé notre manière de traiter l’information. Des sites de médias sociaux tels que Flickr et Facebook, offrent des espaces d’échange et de diffusion de l’information. Une information de plus en plus riche, mais aussi personnelle, et qui s’organise, le plus souvent, autour d’événements de la vie réelle. Ainsi, un événement peut être perçu comme un ensemble de vues personnelles et locales, capturées par différents utilisateurs. Identifier ces différentes instances permettrait, dès lors, de reconstituer une vue globale de l’événement. Plus particulièrement, lier différentes instances d’un même événement profiterait à bon nombre d’applications tel que la recherche, la navigation ou encore le filtrage et la suggestion de contenus. L’objectif principal de cette thèse est l’identification du contenu multimédia, associé à un événement dans de grandes collections d’images. Une première contribution est une méthode de recherche d’événements basée sur le contenu visuel. La deuxième contribution est une approche scalable et distribuée pour la construction de graphes des K plus proches voisins. La troisième contribution est une méthode collaborative pour la sélection de contenu pertinent. Plus particulièrement, nous nous intéresserons aux problèmes de génération automatique de résumés d’événements et suggestion de contenus dans les médias sociaux. / The ease of publishing content on social media sites brings to the Web an ever increasing amount of user generated content captured during, and associated with, real life events. Social media documents shared by users often reflect their personal experience of the event. Hence, an event can be seen as a set of personal and local views, recorded by different users. These event records are likely to exhibit similar facets of the event but also specific aspects. By linking different records of the same event occurrence we can enable rich search and browsing of social media events content. Specifically, linking all the occurrences of the same event would provide a general overview of the event. In this dissertation we present a content-based approach for leveraging the wealth of social media documents available on the Web for event identification and characterization. To match event occurrences in social media, we develop a new visual-based method for retrieving events in huge photocollections, typically in the context of User Generated Content. The main contributions of the thesis are the following : (1) a new visual-based method for retrieving events in photo collections, (2) a scalable and distributed framework for Nearest Neighbors Graph construction for high dimensional data, (3) a collaborative content-based filtering technique for selecting relevant social media documents for a given event.
|
40 |
Data mining inom tillverkningsindustrin : En fallstudie om möjligheten att förutspå kvalitetsutfall i produktionslinjerJanson, Lisa, Mathisson, Minna January 2021 (has links)
I detta arbete har en fallstudie utförts på Volvo Group i Köping. I takt med ¨övergången till industri 4.0, ökar möjligheterna att använda maskininlärning som ett verktyg i analysen av industriell data och vidareutvecklingen av industriproduktionen. Detta arbete syftar till att undersöka möjligheten att förutspå kvalitetsutfall vid sammanpressning av nav och huvudaxel. Metoden innefattar implementering av tre maskininlärningsmodeller samt evaluering av dess prestation i förhållande till varandra. Vid applicering av modellerna på monteringsdata från fabriken erhölls ett bristfälligt resultat, vilket indikerar att det utifrån de inkluderade variablerna inte är möjligt att förutspå kvalitetsutfallet. Orsakerna som låg till grund för resultatet granskades, och det resulterade i att det förmodligen berodde på att modellerna var oförmögna att finna samband i datan eller att det inte fanns något samband i datasetet. För att avgöra vilken av dessa två faktorer som var avgörande skapades ett fabricerat dataset där tre nya variabler introducerades. De fabricerade värdena på dessa variabler skapades på sådant sätt att det fanns syntetisk kausalitet mellan två av variablerna och kvalitetsutfallet. Vid applicering av modellerna på den fabricerade datan, lyckades samtliga modeller identifiera det syntetiska sambandet. Utifrån det drogs slutsatsen att det bristfälliga resultatet inte berodde på modellernas prestation utan att det inte fanns något samband i datasetet bestående av verklig monteringsdata. Det här bidrog till bedömningen att om spårbarheten på komponenterna hade ökat i framtiden, i kombination med att fler maskiner i produktionslinjen genererade data till ett sammankopplat system, skulle denna studie kunna utföras igen, men med fler variabler och ett större dataset. Support vector machine var den modell som presterade bäst, givet de prestationsmått som användes i denna studie. Det faktum att modellerna som inkluderats i den här studien lyckades identifiera sambandet i datan, när det fanns vetskap om att sambandet existerade, motiverar användandet av dessa modeller i framtida studier. Avslutningsvis kan det konstateras att med förbättrad spårbarhet och en allt mer uppkopplad fabrik, finns det möjlighet att använda maskininlärningsmodeller som komponenter i större system för att kunna uppnå effektiviseringar. / As the adaptation towards Industry 4.0 proceeds, the possibility of using machine learning as a tool for further development of industrial production, becomes increasingly profound. In this paper, a case study has been conducted at Volvo Group in Köping, in order to investigate the wherewithals of predicting quality outcomes in the compression of hub and mainshaft. In the conduction of this study, three different machine learning models were implemented and compared amongst each other. A dataset containing data from Volvo’s production site in Köping was utilized when training and evaluating the models. However, the low evaluation scores acquired from this, indicate that the quality outcome of the compression could not be predicted given solely the variables included in that dataset. Therefore, a dataset containing three additional variables consisting of fabricated values and a known causality between two of the variables and the quality outcome, was also utilized. The purpose of this was to investigate whether the poor evaluation metrics resulted from a non-existent pattern between the included variables and the quality outcome, or from the models not being able to find the pattern. The performance of the models, when trained and evaluated on the fabricated dataset, indicate that the models were in fact able to find the pattern that was known to exist. Support vector machine was the model that performed best, given the evaluation metrics that were chosen in this study. Consequently, if the traceability of the components were to be enhanced in the future and an additional number of machines in the production line would transmit production data to a connected system, it would be possible to conduct the study again with additional variables and a larger data set. The fact that the models included in this study succeeded in finding patterns in the dataset when such patterns were known to exist, motivates the use of the same models. Furthermore, it can be concluded that with enhanced traceability of the components and a larger amount of machines transmitting production data to a connected system, there is a possibility that machine learning models could be utilized as components in larger business monitoring systems, in order to achieve efficiencies.
|
Page generated in 0.0707 seconds