Global ETD Search

41	Interpreting Random Forest Classification Models Using a Feature Contribution Method Palczewska, Anna Maria, Palczewski, J., Marchese-Robinson, R.M., Neagu, Daniel 18 February 2014 (has links) No / Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance . For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution “patterns”, are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models. Random forest Classification Variable importance Feature contribution Cluster analysis
42	Modelagem da produtividade da cultura da cana de açúcar por meio do uso de técnicas de mineração de dados / Modeling sugarcane yield through Data Mining techniques Hammer, Ralph Guenther 27 July 2016 (has links) O entendimento da hierarquia de importância dos fatores que influenciam a produtividade da cana de açúcar pode auxiliar na sua modelagem, contribuindo assim para a otimização do planejamento agrícola das unidades produtoras do setor, bem como no aprimoramento das estimativas de safra. Os objetivos do presente estudo foram a ordenação das variáveis que condicionam a produtividade da cana de açúcar, de acordo com a sua importância, bem como o desenvolvimento de modelos matemáticos de produtividade da cana de açúcar. Para tanto, foram utilizadas três técnicas de mineração de dados nas análises de bancos de dados de usinas de cana de açúcar no estado de São Paulo. Variáveis meteorológicas e de manejo agrícola foram submetidas às análises por meio das técnicas Random Forest, Boosting e Support Vector Machines, e os modelos resultantes foram testados por meio da comparação com dados independentes, utilizando-se o coeficiente de correlação (r), índice de Willmott (d), índice de confiança de Camargo (C), erro absoluto médio (EAM) e raíz quadrada do erro médio (RMSE). Por fim, comparou-se o desempenho dos modelos gerados com as técnicas de mineração de dados com um modelo agrometeorológico, aplicado para os mesmos bancos de dados. Constatou-se que, das variáveis analisadas, o número de cortes foi o fator mais importante em todas as técnicas de mineração de dados. A comparação entre as produtividades estimadas pelos modelos de mineração de dados e as produtividades observadas resultaram em RMSE variando de 19,70 a 20,03 t ha-1 na abordagem mais geral, que engloba todas as regiões do banco de dados. Com isso, o desempenho preditivo foi superior ao modelo agrometeorológico, aplicado no mesmo banco de dados, que obteve RMSE ≈ 70% maior (≈ 34 t ha-1). / The understanding of the hierarchy of the importance of the factors which influence sugarcane yield can subsidize its modeling, thus contributing to the optimization of agricultural planning and crop yield estimates. The objectives of this study were to ordinate the variables which condition the sugarcane yield, according to their relative importance, as well as the development of mathematical models for predicting sugarcane yield. For this, three Data Mining techniques were applied in the analyses of data bases of several sugar mills in the State of São Paulo, Brazil. Meteorological and crop management variables were analyzed through the Data Mining techniques Random Forest, Boosting and Support Vector Machines, and the resulting models were tested through the comparison with an independent data set, using the coefficient of correlation (r), Willmott index (d), confidence index of Camargo (c), mean absolute error (MAE), and root mean square error (RMSE). Finally, the predictive performances of these models were compared with the performance of an agrometeorological model, applied in the same data set. The results allowed to conclude that, within all the variables, the number of cuts was the most important factor considered by all Data Mining models. The comparison between the observed yields and those estimated by the Data Mining techniques resulted in a RMSE ranging between 19,70 to 20,03 t ha-1, in the general method, which considered all regions of the data base. Thus, the predictive performances of the Data Mining algorithms were superior to that of the agrometeorological model, which presented RMSE ≈ 70% higher (≈ 34 t ha-1). Agricultural planning Boosting Boosting Planejamento agrícola Predição Prediction Random forest Random forest Support vector machines Support vector machines
43	Modelagem da produtividade da cultura da cana de açúcar por meio do uso de técnicas de mineração de dados / Modeling sugarcane yield through Data Mining techniques Ralph Guenther Hammer 27 July 2016 (has links) O entendimento da hierarquia de importância dos fatores que influenciam a produtividade da cana de açúcar pode auxiliar na sua modelagem, contribuindo assim para a otimização do planejamento agrícola das unidades produtoras do setor, bem como no aprimoramento das estimativas de safra. Os objetivos do presente estudo foram a ordenação das variáveis que condicionam a produtividade da cana de açúcar, de acordo com a sua importância, bem como o desenvolvimento de modelos matemáticos de produtividade da cana de açúcar. Para tanto, foram utilizadas três técnicas de mineração de dados nas análises de bancos de dados de usinas de cana de açúcar no estado de São Paulo. Variáveis meteorológicas e de manejo agrícola foram submetidas às análises por meio das técnicas Random Forest, Boosting e Support Vector Machines, e os modelos resultantes foram testados por meio da comparação com dados independentes, utilizando-se o coeficiente de correlação (r), índice de Willmott (d), índice de confiança de Camargo (C), erro absoluto médio (EAM) e raíz quadrada do erro médio (RMSE). Por fim, comparou-se o desempenho dos modelos gerados com as técnicas de mineração de dados com um modelo agrometeorológico, aplicado para os mesmos bancos de dados. Constatou-se que, das variáveis analisadas, o número de cortes foi o fator mais importante em todas as técnicas de mineração de dados. A comparação entre as produtividades estimadas pelos modelos de mineração de dados e as produtividades observadas resultaram em RMSE variando de 19,70 a 20,03 t ha-1 na abordagem mais geral, que engloba todas as regiões do banco de dados. Com isso, o desempenho preditivo foi superior ao modelo agrometeorológico, aplicado no mesmo banco de dados, que obteve RMSE ≈ 70% maior (≈ 34 t ha-1). / The understanding of the hierarchy of the importance of the factors which influence sugarcane yield can subsidize its modeling, thus contributing to the optimization of agricultural planning and crop yield estimates. The objectives of this study were to ordinate the variables which condition the sugarcane yield, according to their relative importance, as well as the development of mathematical models for predicting sugarcane yield. For this, three Data Mining techniques were applied in the analyses of data bases of several sugar mills in the State of São Paulo, Brazil. Meteorological and crop management variables were analyzed through the Data Mining techniques Random Forest, Boosting and Support Vector Machines, and the resulting models were tested through the comparison with an independent data set, using the coefficient of correlation (r), Willmott index (d), confidence index of Camargo (c), mean absolute error (MAE), and root mean square error (RMSE). Finally, the predictive performances of these models were compared with the performance of an agrometeorological model, applied in the same data set. The results allowed to conclude that, within all the variables, the number of cuts was the most important factor considered by all Data Mining models. The comparison between the observed yields and those estimated by the Data Mining techniques resulted in a RMSE ranging between 19,70 to 20,03 t ha-1, in the general method, which considered all regions of the data base. Thus, the predictive performances of the Data Mining algorithms were superior to that of the agrometeorological model, which presented RMSE ≈ 70% higher (≈ 34 t ha-1). Boosting Planejamento agrícola Predição Random forest Support vector machines Agricultural planning Boosting Prediction Random forest Support vector machines
44	Análise e predição de bilheterias de filmes FLORÊNCIO, João Carlos Procópio 29 February 2016 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-08-08T12:41:40Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) dissertacao-mestrado-jcpf.pdf: 6512881 bytes, checksum: 0e42b481cf73ab357ca212b410fbd5ee (MD5) / Made available in DSpace on 2016-08-08T12:41:40Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) dissertacao-mestrado-jcpf.pdf: 6512881 bytes, checksum: 0e42b481cf73ab357ca212b410fbd5ee (MD5) Previous issue date: 2016-02-29 / Prever o sucesso de um filme e, por consequência, seu sucesso nas bilheterias tem uma grande importância na indústria cinematográfica, desde a fase de pré-produção do filme, quando os investidores querem saber quais serão os filmes mais promissores, até nas semanas seguintes ao seu lançamento, quando se deseja prever as bilheterias das semanas restantes de exibição. Por conta disso, essa área tem sido alvo de muitos estudos que tem usado diferentes abordagens de predição, seja na seleção das características dos filmes como nas técnicas de aprendizagem, para atingir uma maior capacidade de prever o sucesso dos filmes. Neste trabalho de mestrado, foi feita uma investigação sobre o comportamento das principais características dos filmes (gênero, classificação etária, orçamento de produção, etc), com maior foco nos resultados das bilheterias e sua relação com as características dos filmes, de forma a obter uma visão mais clara de como as caracaterísticas dos filmes podem influenciar no seu sucesso, seja ele interpretado como lucro ou volume de bilheterias. Em seguida, em posse de uma base de filmes extraída do Box-Office Mojo e do IMDb, foi proposto um novo modelo de predição de box office utilizando os dados disponíveis dessa base, que é composta de: meta-dados dos filmes, palavras-chaves, e dados de bilheterias. Algumas dessas características são hibridizadas com o objetivo evidenciar as combinações de características mais importantes. É aplicado também um processo de seleção de características para excluir aquelas que não são relevantes ao modelo. O modelo utiliza Random Forest como máquina de aprendizagem. Os resultados obtidos com a técnica proposta sugerem, além de uma maior simplificação do modelo em relação a estudos anteriores, que o método consegue obter taxas de acerto superior 90% quando a classificação é medida com a métrica 1-away (quando a amostra é classificada com até 1 classe de distância), e consegue melhorar a qualidade da predição em relação a estudos anteriores quando testado com os dados da base disponível. / Predicting the success of a movie and, consequently, its box office success, has a huge importance in the motion pictures industry. Its importance comes since from the pre-production period, when the investors want to know the most promising movies to invest, until the first few weeks after release, when exhibitors want to predict the box office of the remaining weeks of exhibition. As result, this area has been subject of many studies which have used different prediction approaches, in both feature selection and learning methods, to achieve better capacity to predict movies’ success. In this mastership work, a deep research about the movie’s main features (genre, MPAA, production budget, etc) has been done, with more focus on the results of box offices and its relation with the movie’s features in order to get a clearer view of the organization of information and how variables can influence the success of a film, whether this success be interpreted as profit or revenue volumes at the box office. Then, in possession of a movie database extracted from Box-Office Mojo and IMDb, it was proposed a new box office prediction model based on available data from the database composed of: movie meta-data, key-words and box office data. Some of these features are hybridized aiming to emphasize the most important features’ combinations. A features’ selection process is also applied to exclude irrelevant features. The obtained results with the proposed method suggests, besides a further simplification of the model compared to previous studies, that the method can get hit rate of more than 90% when classification is measured with the metric 1-away (when the sample is classified within 1 class of distance from the right class), and achieve a improvement in the prediction quality when compared to previous studies using the available database. Predição de bilheteria Recomendação Box-Office Filmes Random Forest Box Office Prediction Recommendation Box-Office Movies Random Forest
45	Uso de random forests e redes biológicas na associação de poliformismos à doença de Alzheimer ARAÚJO, Gilderlanio Santana de 07 March 2013 (has links) Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-10-18T19:17:10Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao -Gilderlanio Santana de Araujo.pdf: 9533988 bytes, checksum: 951b1cf090729a87ebf3a8741ff00ad4 (MD5) / Made available in DSpace on 2016-10-18T19:17:10Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao -Gilderlanio Santana de Araujo.pdf: 9533988 bytes, checksum: 951b1cf090729a87ebf3a8741ff00ad4 (MD5) Previous issue date: 2013-03-07 / FACEPE / O desenvolvimento de técnicas de genotipagem de baixo custo (SNP arrays) e as anotações de milhares de polimorfismos de nucleotídeo único (SNPs) em bancos de dados públicos têm originado um crescente número de estudos de associação em escala genômica (do inglês, Genome-Wide Associations Studies - GWAS). Nesses estudos, um enorme número de SNPs (centenas de milhares) são avaliados com métodos estatísticos univariados de forma a encontrar SNPs associados a um determinado fenótipo. Testes univariados são incapazes de capturar relações de alta ordem entre os SNPs, algo comum em doenças genéticas complexas e são afetados pela alta correlação entre SNPs na mesma região genômica. Métodos de aprendizado de máquina, como o Random Forest (RF), têm sido aplicados em dados de GWAS para realizar a previsão de riscos de doenças e capturar os SNPs associados às mesmas. Apesar de RF ser um método com reconhecido desempenho em dados de alta dimensionalidade e na captura de relações não-lineares, o uso de todos os SNPs presentes em um estudo GWAS é computacionalmente inviável. Neste estudo propomos o uso de redes biológicas para a seleção inicial de SNPs candidatos a serem usados pela RF. A partir de um conjunto inicial de genes já relacionados à doença na literatura, usamos ferramentas de redes de interação gene-gene, para encontrar novos genes que possam estar associados a doença. Logo, é possível extrair um número reduzido de SNPs tornando a aplicação do método RF viável. Os experimentos realizados nesse estudo concentram-se em investigar quais polimorfismos podem influenciar na suscetibilidade à doença de Alzheimer (DA) e ao comprometimento cognitivo leve (MCI). O resultado final das análises é a delineação de uma metodologia para o uso de RF, para a análise de dados de GWAS, assim como a caracterização de potenciais fatores de riscos da DA. / The development of low cost genotyping techniques (SNP arrays) and annotations of thousands of single nucleotide polymorphisms (SNPs) in public databases has led to an increasing number of Genome-Wide Associations Studies (GWAS). In these studies, a large number of SNPs (hundreds of thousands) are evaluated with univariate statistical methods in order to find SNPs associated with a particular phenotype. Univariate tests are unable to capture high-order relationships among SNPs, which are common in complex genetic diseases, and are affected by the high correlation between SNPs at the same genomic region. Machine learning methods, such as the Random Forest (RF), have been applied to GWAS data to perform the prediction of the risk of diseases and capture a set of SNPs associated with them. Although, RF is a method with recognized performance in high dimensional data and capacity to capture non-linear relationships, the use of all SNPs present in GWAS data is computationally intractable. In this study we propose the use of biological networks for the initial selection of candidate SNPs to be used by RF. From an initial set of genes already related to a disease based on the literature, we use tools for construct gene-gene interaction networks, to find novel genes that might be associated with disease. Therefore, it is possible to extract a small number of SNPs making the method RF feasible. The experiments conducted in this study focus on investigating which polymorphisms may influence the susceptibility of Alzheimer’s disease (AD) and mild cognitive impairment (MCI). This work presents a delineation of a methodology on using RF for analysis of GWAS data, and characterization of potential risk factors for AD.
46	Le rôle des facteurs environnementaux sur la concentration des métaux-tracesdans les lacs urbains -Lac de Pampulha, Lac de Créteil et 49 lacs péri-urbains d’Ile de France / The role of environmental factors on trace-metalconcentrations in urban lakes - Lake Pampulha, Lake Créteil and 49 lakes in the Ile-de-France region Tran khac, Viet 19 December 2016 (has links) Les lacs jouent un rôle particulier dans le cycle de l’eau dans les bassins versants urbains. La stratification thermique et le temps de séjour de l’eau élevé favorisent le développement phytoplanctonique. La plupart des métaux sont naturellement présents dans l’environnement à l’état de traces. Ils sont essentiels pour les organismes vivants. Néanmoins, certains métaux sont connus pour leurs effets toxiques sur les animaux et les humains. La concentration totale des métaux ne reflète pas leur toxicité. Elle dépend de leurs propriétés et de leur spéciation (fractions particulaires, dissoutes: labiles ou biodisponibles et inertes). Dans les systèmes aquatiques, les métaux peuvent être absorbés par des ligands organiques ou minéraux. Leur capacité à se complexer avec la matière organique dissoute (MOD), particulièrement les substances humiques, a été largement étudiée. Dans les lacs, le développement phytoplanctonique peut produire de la MOD non-humique, connue pour sa capacité complexante des métaux. Pourtant, peu de recherche sur la spéciation des métaux dans la colonne d’eau des lacs urbains a été réalisée jusqu’à présent.Les objectifs principaux de cette thèse sont (1) d’obtenir une base de données fiables des concentrations en métaux traces dans la colonne d’eau de lacs urbains représentatifs; (2) d’évaluer leur biodisponibilité via une technique de spéciation adéquate ; (3) d’analyser leur évolution saisonnière et spatiale et leur spéciation; (4) d’étudier l’impact des variables environnementales, en particulier de la MOD autochtone sur leur biodisponibilité; (5) de lier la concentration des métaux au mode d’occupation du sol du bassin versant.Notre méthodologie est basée sur un suivi in-situ des lacs en complément d’analyses spécifiques en laboratoire. L’étude a été conduite sur trois sites: le lac de Créteil (France), le lac de Pampulha (Brésil) et 49 lacs péri-urbains (Ile de France). Sur le lac de Créteil, plusieurs dispositifs de mesure en continu nous ont fourni une partie de la base de données limnologiques. Dans le bassin versant du lac de Pampulha, la pression anthropique est très importante. Le climat et le régime hydrologique des 2 lacs sont très différents. Les 49 lacs de la région d’Ile de France ont été échantillonnés une fois pendant trois étés successifs (2011-2013). Ces lacs nous ont fourni une base de données synoptique, représentative de la contamination métallique à l’échelle d’une région fortement anthropisée.Afin d’expliquer le rôle des variables environnementales sur la concentration métallique, le modèle Random Forest a été appliqué sur les bases de données du lac de Pampulha et des 49 lacs urbains avec 2 objectifs spécifiques: (1) dans le lac de Pampulha, comprendre le rôle des variables environnementales sur la fraction labile des métaux traces, potentiellement biodisponible et (2) dans les 49 lacs, comprendre la relation des variables environnementales, particulièrement au niveau du bassin versant, sur la concentration dissoute des métaux. L’analyse des relations entre métaux et variables environnementales constitue l’un des principaux résultats de cette thèse. Dans le lac de Pampulha, environ 80% de la variance du cobalt labile est expliqué par des variables limnologiques: Chla, O2, pH et P total. Pour les autres métaux, le modèle n’a pas réussi à expliquer plus de 50 % de la relation entre fraction labile et variables limnologiques. Dans les 49 lacs, le modèle Random Forest a donné un bon résultat pour le cobalt (60% de la variance expliquée) et un très bon résultat pour le nickel (86% de la variance expliquée). Pour Ni les variables explicatives sont liées au mode d’occupation du sol : « Activités » (Equipements pour l’eau et l’assainissement, entrepôts logistiques, bureaux…) et « Décharge ». Ce résultat est en accord avec le cas du lac de Créteil où la concentration en Ni dissous est très élevée et où les catégories d’occupation du sol « Activités » et « Décharges » sont dominantes / Lakes have a particular influence on the water cycle in urban catchments. Thermal stratification and a longer water residence time in the lake boost the phytoplankton production. Most metals are naturally found in the environment in trace amounts. Trace metals are essential to growth and reproduction of organisms. However, some are also well known for their toxic effects on animals and humans. Total metal concentrations do not reflect their ecotoxicity that depends on their properties and speciation (particulate, dissolved: labile or bioavailable and inert fractions). Trace metals can be adsorbed to various components in aquatic systems including inorganic and organic ligands. The ability of metal binding to dissolved organic matter (DOM), in particular humic substances, has been largely studied. In urban lakes, the phytoplankton development can produce autochthonous DOM, non humic substances that can have the ability of metal binding.. But there are few studies about trace metal speciation in lake water column.The main objectives of this thesis are (1) to obtain a consistent database of trace metal concentrations in the water column of representative urban lakes; (2) to access their bioavailability through an adapted speciation technique; (3) to analyze the seasonal and spatial evolution of the metals and their speciation; (4) to study the potential impact of environmental variables, particularly of dissolved organic matter related to phytoplankton production on metal bioavailability and (5) to link the metal concentrations to the land use in the lake watershed.Our methodology is based on a dense field survey of the water bodies in addition to specific laboratory analysis. The research has been conducted on three study sites: Lake Créteil (France), Lake Pampulha (Brazil) and a panel of 49 peri-urban lakes (Ile de France). Lake Créteil is an urban lake impacted by anthropogenic pollution. It benefits of a large number of monitoring equipment, which allowed us to collect a part of the data set. In Lake Pampulha catchment, the anthropogenic pressure is high. Lake Pampulha has to face with many pollution point and non-point sources. The climate and limnological characteristics of the lakes are also very different. The panel of 49 lakes of Ile de France was sampled once during three successive summers (2011-2013); they provided us with a synoptic, representative data set of the regional metal contamination in a densely anthropized region.In order to explain the role of the environmental variables on the metal concentrations, we applied the Random Forest model on the Lake Pampulha dataset and on the 49 urban lake dataset with 2 specific objectives: (1) in Lake Pampulha, understanding the role of environmental variables on the trace metal labile concentration, considered as potentially bioavailable and (2) in the 49 lakes, understanding the relationship of the environmental variables, more particularly the watershed variables, on the dissolved metal concentrations. The analysis of the relationships between the trace metal speciation and the environmental variables provided the following key results of this thesis.In Lake Pampulha, around 80% of the variance of the labile cobalt is explained by some limnological variables: Chl a, O2, pH, and total phosphorus. For the other metals, the RF model did not succeed in explaining more than 50% of the relationships between the metals and the limnological variables.In the 49 urban lakes in Ile de France, the RF model gave a good result for Co (66% of explained variance) and very satisfying for Ni (86% of explained variance). For Ni, the best explanatory variables are landuse variables such as “activities” (facilities for water, sanitation and energy, logistical warehouses, shops, office…) and “landfill”. This result fits with Lake Creteil where dissolved Ni concentration is particularly high and where the “activities” and “landfill” landuse categories are the highest Lacs urbains Métaux traces Spéciation Biodisponibilité Random Forest model Urban lakes Trace metal Speciation Bioavailability Landuse Random Forest model
47	Contribution to automatic adjustments of vertebrae landmarks on x-ray images for 3D reconstruction and quantification of clinical indices / Contribution aux ajustements automatiques de points anatomiques des vertèbres pour la reconstruction 3D et la quantification d’indices cliniques Ebrahimi, Shahin 12 December 2017 (has links) L’exploitation de données radiographiques, en particulier pour la reconstruction 3D du rachis de patients scoliotiques, est un prérequis à la modélisation personnalisée. Les méthodes actuelles, bien qu’assez robustes pour la routine clinique, reposent sur des ajustements manuels fastidieux. Dans ce contexte, ce travail de thèse vise à la détection automatisée de points anatomiques spécifiques des vertèbres, permettant ainsi des ajustements automatisés. Nous avons développé premièrement une méthode originale de localisation de coins de vertèbres cervicales et lombaires sur les radiographies sagittales. L’évaluation rigoureuse de cette méthode suggère sa robustesse et sa précision. Nous avons ensuite développé un algorithme pour le problème pertinent cliniquement de localisation des pédicules sur les radiographies coronales. Cet algorithme se compare favorablement aux méthodes similaires dans la littérature, qui nécessitent une saisie manuelle. Enfin, nous avons soulevé les problèmes, relativement peu étudiés, de détection, identification et segmentation des apophyses épineuses du rachis cervical dans les radiographies sagittales. Toutes les tâches mentionnées ont été réalisées grâce à une combinaison originale de descripteurs visuels et une classification multi-classe par Random Forest, menant à une nouvelle et puissante approche de localisation et de segmentation. Les méthodes proposées dans cette thèse suggèrent un grand potentiel pour être intégré à la reconstruction 3D du rachis, utilisée quotidiennement en routine clinique. / Exploitation of spine radiographs, in particular for 3D spine shape reconstruction of scoliotic patients, is a prerequisite for personalized modelling. Current methods, even though robust enough to be used in clinical routine, still rely on tedious manual adjustments. In this context, this PhD thesis aims toward automated detection of specific vertebrae landmarks in spine radiographs, enabling automated adjustments. In the first part, we developed an original Random Forest based framework for vertebrae corner localization that was applied on sagittal radiographs of both cervical and lumbar spine regions. A rigorous evaluation of the method confirms robustness and high accuracy of the proposed method. In the second part, we developed an algorithm for the clinically-important task of pedicle localization in the thoracolumbar region on frontal radiographs. The proposed algorithm compares favourably to similar methods from the literature while relying on less manual supervision. The last part of this PhD tackled the scarcely-studied task of joint detection, identification and segmentation of spinous processes of cervical vertebrae in sagittal radiographs, with again high precision performance. All three algorithmic solutions were designed around a generic framework exploiting dedicated visual feature descriptors and multi-class Random Forest classifiers, proposing a novel solution with computational and manual supervision burdens aiming for translation into clinical use. Overall, the presented frameworks suggest a great potential of being integrated in current spine 3D reconstruction frameworks that are used in daily clinical routine. Rachis Radiographies Vertèbres Random forest Descripteurs visuels Descripteurs contextuels Spine X-Ray Vertebrae landmarks Random Forest Visual features Contextuel Features
48	Ensemble Models for Trend Investing / Ensemble modeller för trendinvesteringar Book, Emil, Gnem, Emil January 2021 (has links) Portfolio strategies focusing on following the trend, so called momentum based strategies, have been popular for a long time among investors and have had many academic studies, however with varying results. This study sets out to investigate different momentum trading signals as well as combining them in ensemble models such as Random Forest and the unique Dim Switch portfolio and then compare them to set benchmarks. Only one of the benchmarks, the 100% equity portfolio, is found to have better returns than the constructed momentum based strategies, however the momentum based strategies show a lot of potential with high risk-adjusted returns and good performance with regards to Expected Shortfall, Value at Risk and Maximum Drawdown. The most common momentum trading signal, the momentum rule with 9 months lookback, was found to have the highest risk-adjusted returns compared to both the benchmarks and the ensemble models, but it was also found to have slightly heavier left tail than the ensemble models. / Portföljstrategier som baserar sig på att följa trenden, så kallade momentumstrategier, har varit populära länge bland investerare. Många akademiska studier har gjorts om ämnet med varierande resultat. Denna studie utreder olika trendsignaler och kombinerar dem för att forma så kallade ensemble modeller, mer specifikt Random Forest och den unika "Dim Switch"-approachen, för att sedan jämföra dessa strategier mot benchmark portföljer. Endast en av benchmark portföljerna, 100% aktier i en ''buy and hold''-portfölj hade bättre avkastning än de momentumbaserade ensemble modellerna i studien. Däremot har momentumbaserade ensemble modellerna högre riskjusterad avkastning, Expected Shortfall, Value at Risk och Maximum drawdown. Den mest återkommande trendsignalen ''Momentum rule'' med nio månaders lookback hade extremt hög riskjusterad avkastning jämfört med benchmarks och ensemble modellerna, men det kom med kostnaden av högre risker i svansen. Momentum Machine Learning Random Forest Trend Investing Dim Switch Momentum Maskininlärning Random Forest Trendinvestering Dim Switch Other Mathematics Annan matematik
49	Differentially Private Random Forests for Network Intrusion Detection in a Federated Learning Setting Frid, Alexander January 2023 (has links) För varje dag som går möter stora industrier en ökad mängd intrång i sina IT-system. De flesta befintliga verktyg som använder sig utav maskininlärning är starkt beroende av stora mängder data, vilket innebär risker under dataöverföringen. Därför har syftet med denna studie varit att undersöka om en decentraliserad integritetsbevarande strategi kan vara ett bra alternativ för att minska effektiviteten av dessa attacker. Mer specifikt skulle användningen av Random Forests, en av de mest populära algoritmerna för maskininlärning, kunna utökas med decentraliseringstekniken Federated Learning tisammans med Differential Privacy, för att skapa en ideal metod för att upptäcka nätverksintrång? Med hjälp av befintliga kodbibliotek för maskininlärnings och verklighetsbaserad data har detta projekt konstruerat olika modeller för att simulera hur väl olika decentraliserade och integritetsbevarande modeller kan jämföras med traditionella alternativ. De skapade modellerna innehåller antingen Federated Learning, Differential Privacy eller en kombination av båda. Huvuduppgiften för dessa modeller är att förbättra integriteten och samtidigt minimera minskningen av precision. Resultaten indikerar att båda teknikerna kommer med en liten minskning i noggrannhet jämfört med traditionella alternativ. Huruvida precisionsförlusten är acceptabel eller beror på det specifika användningsområdet. Det utvecklade kombinerade alternativet lyckades dock inte nå acceptabel precision vilket hindrar oss från att dra några slutsatser. / With each passing day, large industries face an increasing amount of intrusions into their IT environments. Most existing machine learning countermeasures heavily rely on large amounts of data which introduces risk during the data-transmission. Therefore, the objective of this study has been to investigate whether a decentralized privacy-preserving approach could be a sensible alternative to decrease the effectiveness of these attacks. More specifically could the use of Random Forests, one of the most popular machine learning algorithms, be extended using the decentralization technique Federated Learning in cooperation with Differential Privacy, in order to create an ideal approach for network intrusion detection? With the assistance of existing machine learning code-libraries and real-life data, this thesis has constructed various experimental models to simulates how well different decentralized and privacy-preserving approaches compare to traditional ones. The models created incorporate either Federated Learning, Differential Privacy or a combination of both. The main task of these models is to enhance privacy while minimizing the decrease in accuracy. The results indicate that both techniques comes with a small decrease in accuracy compared to traditional alternatives. whether the accuracy loss is acceptable or not may depend on the specific scenario. The developed combined approach however, failed to reach acceptable accuracy which prevents us from drawing any conclusions. Machine Learning Random Forest Federated Learning Differential Privacy Maskininlärning Random Forest Federated Learning Differential Privacy Software Engineering Programvaruteknik
50	Churnprediktion baserat på kundens första köp / Churn prediction based on the customer's first purchase Ivarsson Orrelid, Christoffer, Pettersson, Oskar, Thornander, Jonathan January 2022 (has links) Många företag drabbas regelbundet av churn, ett tillstånd som innebär att existerande kunder slutar handla hos företaget eller använda företagets tjänster för att istället vända sig till konkurrenter. För att säkerställa lojalitet bland kunderna behöver företag därför etablera metoder för att tidigt vinna kundens tillit. Med hjälp av maskininlärning kan processen att identifiera churn automatiseras, så kallad churnprediktion. Mycket forskning finns kring churnprediktion, framförallt inom telekomsektorn och inom företag som erbjuder prenumerationstjänster. Majoriteten av tidigare exempel bygger dock på kunddata som samlats in från flera tidpunkter och syftar till att predicera churn inom en längre tidsperiod, vanligtvis inom ett år. Det finns färre exempel inom kontexten e-handeln, samt forskning om hur maskininlärning kan tillämpas för att enbart utifrån data från kundens första köp och inom en kortare tidsperiod identifiera churn. I denna studie har två maskininlärningsmodeller utvecklats baserat på Random Forest-algoritmen och Logistisk Regression-algoritmen. Syftet var att undersöka vilken algoritm som är bäst lämpad för att predicera om en given kund kommer handla igen eller inte inom en tremånadersperiod, enbart med data från kundens första köp. Undersökningen baserades på data från ett svenskt e-handelsföretag. Modellerna utvärderades med mått för klassificeringsproblem, bland annat Cohen’s kappa och AUC. Trots att Logistisk regression visar sig prestera något bättre tyder resultaten på att båda modellerna har generellt svårt att avgöra om kunden kommer utsätta företaget för churn eller ej. En möjlig förklaring anses vara datamängdens restriktivitet som endast innehåller data från kundens första köp. Däremot konstateras båda modellernas möjlighet att filtrera ut kunder som löper hög risk att utsätta företaget för churn, där Random Forest visar sig vara något bättre på detta. Slutligen konstaterades att modellerna inte påvisar kraftig förbättring jämfört med en naiv lösning där alla kunder antas utsätta företaget för churn, men eftersom även små förbättringar innebär att företaget kan spara pengar kan dock modellernas användbarhet motiveras. / Companies are continuously affected by churn, a condition where existing customers turn to competitors instead using the company’s services. To ensure customer loyalty, it is vital for the company to establish methods to gain the customers trust early on. With the help of machine learning, the process for identifying churn can be automated, known as churn prediction. Research on churn prediction is abundant, especially concerning the telecom sector and subscription-based services. Most of these articles, however, are based on additional, historical data surrounding the customer, aiming to predict churn within a longer time frame, usually a year. The articles focusing on e-commerce, combined with how machine learning can be applied to identify churn within a short period, based solely on data from the customer’s first purchase, are scarce. Two machine learning models are developed based on the Random Forest-algorithm and the Logistic Regression-algorithm. These are tested to see which algorithm is best suited for predicting whether a given customer will buy again or not within a three-month period, with only data from the customer's first purchase from a Swedish e-commerce company. The models were then evaluated with classification metrics, including Cohen’s kappa and AUC. Despite the fact that Logistic Regression performed slightly better, the results showed that both models struggled with the churn prediction. A possible explanation is the restrictiveness of the data set. However, with the option of changing the calibration points on the models’ confidence, allowing the filtration of customers who have a greater chance of leading to churn, both models performed better with Random Forest being slightly superior. The models are considered a slight improvement to a naïve solution where all customers are treated as possible churn. They are also useful given the context, where even minor prevention of churn can lead to profit for the company. Random Forest Logistic Regression Churn Prediction E-commerce Random Forest Logistisk Regression Churnprediktion E-handel Information Systems

Search results