1 |
Improving detection of promising unrefined protein docking complexesRörbrink, Malin January 2016 (has links)
Understanding protein-protein interaction (PPI) is important in order to understand cellular processes. X-ray crystallography and mutagenesis, expensive methods both in time and resources, are the most reliable methods for detecting PPI. Computational approaches could, therefore, reduce resources and time spent on detecting PPIs. During this master thesis a method, cProQPred, was created for scoring how realistic coarse PPI models are. cProQPred use the machine learning method Random Forest trained on previously calculated features from the programs ProQDock and InterPred. By combining some of ProQDock’s features and the InterPred score from InterPred the cProQPred method generated a higher performance than both ProQDock and InterPred. This work also tried to predict the quality of the PPI model after refinement and the chance for a coarse PPI model to succeed at refinement. The result illustrated that the predicted quality of a coarse PPI model also was a relatively good prediction of the quality the coarse PPI model would get after refinement. Prediction of the chance for a coarse PPI model to succeed at refinement was, however, without success.
|
2 |
Automated application-specific optimisation of interconnects in multi-core systemsAlmer, Oscar Erik Gabriel January 2012 (has links)
In embedded computer systems there are often tasks, implemented as stand-alone devices, that are both application-specific and compute intensive. A recurring problem in this area is to design these application-specific embedded systems as close to the power and efficiency envelope as possible. Work has been done on optimizing singlecore systems and memory organisation, but current methods for achieving system design goals are proving limited as the system capabilities and system size increase in the multi- and many-core era. To address this problem, this thesis investigates machine learning approaches to managing the design space presented in the interconnect design of embedded multi-core systems. The design space presented is large due to the system scale and level of interconnectivity, and also feature inter-dependant parameters, further complicating analysis. The results presented in this thesis demonstrate that machine learning approaches, particularly wkNN and random forest, work well in handling the complexity of the design space. The benefits of this approach are in automation, saving time and effort in the system design phase as well as energy and execution time in the finished system.
|
3 |
Studying the ability of finding single and interaction effects with Random Forest, and its application in psychiatric geneticsNeira Gonzalez, Lara Andrea January 2018 (has links)
Psychotic disorders such as schizophrenia and bipolar disorder have a strong genetic component. The aetiology of psychoses is known to be complex, including additive effects from multiple susceptibility genes, interactions between genes, environmental risk factors, and gene by environment interactions. With the development of new technologies such as genome-wide association studies and imputation of ungenotyped variants, the amount of genomic data has increased dramatically leading to the necessary use of Machine Learning techniques. Random Forest has been widely used to study the underlying genetic factors of psychiatric disorders such as epistasis and gene-gene interactions. Several authors have investigated the ability of this algorithm in finding single and interaction effects, but have reported contradictory results. Therefore, in order to examine Random Forest ability of detecting single and interaction effects based on different variable importance measures, I conducted a simulation study assessing whether the algorithm was able to detect single and interaction models under different correlation conditions. The results suggest that the optimal Variable Importance Measures to use in real situations under correlation is the unconditional unscaled permutation variable importance measure. Several studies have shown bias in one of the most popular variable importance measures, the Gini importance. Hence, in a second simulation study I study whether the Gini variable importance is influenced by the variability of predictors, the precision of measuring them, and the variability of the error. Evidence of other biases in this variable importance was found. The results from the first simulation study were used to study whether genes related to 29 molecular biomarkers, which have been associated with schizophrenia, influence risk for schizophrenia in a case-control study of 26476 cases and 31804 controls from 39 different European ancestry cohorts. Single effects from ACAT2 and TNC genes were detected to contribute risk for schizophrenia. ACAT2 is a gene in the chromosome 6 which is related to energy metabolism. Transcriptional differences have been shown in schizophrenia brain tissue studies. TNC is expressed in the brain where is involved in the migration of the neurons and axons. In addition, we also used the simulation results to examine whether interactions between genes associated with abnormal emotion/affect behaviour influence risk for psychosis and cognition in humans, in a case-control study of 2049 cases and 1794 controls. Before correcting for multiple testing, significant interactions between CRHR1 and ESR1, and between MAPT and ESR1, and among CRHR1, ESR1 and TOM1L2, and among MAPT, ESR1 and TOM1L2 were observed in abnormal fear/anxiety-related behaviour pathway. There was no evidence for epistasis after Bonferroni correction.
|
4 |
A Random Forest Based Method for Urban Land Cover Classification using LiDAR Data and Aerial ImageryJin, Jiao 22 May 2012 (has links)
Urban land cover classification has always been crucial due to its ability to link many elements of human and physical environments. Timely, accurate, and detailed knowledge of the urban land cover information derived from remote sensing data is increasingly required among a wide variety of communities. This surge of interest has been predominately driven by the recent innovations in data, technologies, and theories in urban remote sensing. The development of light detection and ranging (LiDAR) systems, especially incorporated with high-resolution camera component, has shown great potential for urban classification. However, the performance of traditional and widely used classification methods is limited in this context, due to image interpretation complexity. On the other hand, random forests (RF), a newly developed machine learning algorithm, is receiving considerable attention in the field of image classification and pattern recognition. Several studies have shown the advantages of RF in land cover classification. However, few have focused on urban areas by fusion of LiDAR data and aerial images.
The performance of the RF based feature selection and classification methods for urban areas was explored and compared to other popular feature selection approach and classifiers. Evaluation was based on several criteria: classification accuracy, impact of different training sample size, and computational speed. LiDAR data and aerial imagery with 0.5-m resolution were used to classify four land categories in the study area located in the City of Niagara Falls (ON, Canada). The results clearly demonstrate that the use of RF improved the classification performance in terms of accuracy and speed. Support vector machines (SVM) based and RF based classifiers showed similar accuracies. However, RF based classifiers were much quicker than SVM based methods. Based on the results from this work, it can be concluded that the RF based method holds great potential for recent and future urban land cover classification problem with LiDAR data and aerial images.
|
5 |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysisWålinder, Andreas January 2014 (has links)
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable. There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection. Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test. Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance. We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.
|
6 |
A Random Forest Based Method for Urban Land Cover Classification using LiDAR Data and Aerial ImageryJin, Jiao 22 May 2012 (has links)
Urban land cover classification has always been crucial due to its ability to link many elements of human and physical environments. Timely, accurate, and detailed knowledge of the urban land cover information derived from remote sensing data is increasingly required among a wide variety of communities. This surge of interest has been predominately driven by the recent innovations in data, technologies, and theories in urban remote sensing. The development of light detection and ranging (LiDAR) systems, especially incorporated with high-resolution camera component, has shown great potential for urban classification. However, the performance of traditional and widely used classification methods is limited in this context, due to image interpretation complexity. On the other hand, random forests (RF), a newly developed machine learning algorithm, is receiving considerable attention in the field of image classification and pattern recognition. Several studies have shown the advantages of RF in land cover classification. However, few have focused on urban areas by fusion of LiDAR data and aerial images.
The performance of the RF based feature selection and classification methods for urban areas was explored and compared to other popular feature selection approach and classifiers. Evaluation was based on several criteria: classification accuracy, impact of different training sample size, and computational speed. LiDAR data and aerial imagery with 0.5-m resolution were used to classify four land categories in the study area located in the City of Niagara Falls (ON, Canada). The results clearly demonstrate that the use of RF improved the classification performance in terms of accuracy and speed. Support vector machines (SVM) based and RF based classifiers showed similar accuracies. However, RF based classifiers were much quicker than SVM based methods. Based on the results from this work, it can be concluded that the RF based method holds great potential for recent and future urban land cover classification problem with LiDAR data and aerial images.
|
7 |
Uma abordagem para a construção de uma única árvore a partir de uma Random Forest para classificação de bases de expressão gênica / An approach to the construction of a single tree from Random Forest to classification of gene expression databasesOshiro, Thais Mayumi 27 August 2013 (has links)
Random Forest é uma técnica computacionalmente eciente que pode operar rapida-mente sobre grandes bases de dados. Ela tem sido usada em muitos projetos de pesquisa recentes e aplicações do mundo real em diversos domínios, entre eles a bioinformática uma vez que a Random Forest consegue lidar com bases que apresentam muitos atributos e poucos exemplos. Porém, ela é de difícil compreensão para especialistas humanos de diversas áreas. A pesquisa de mestrado aqui relatada tem como objetivo criar um modelo simbólico, ou seja, uma única árvore a partir da Random Forest para a classicação de bases de dados de expressão gênica. Almeja-se assim, aumentar a compreensão por parte dos especialistas humanos sobre o processo que classica os exemplos no mundo real tentando manter um bom desempenho. Os resultados iniciais obtidos com o algoritmo aqui proposto são pro-missores, uma vez que ela apresenta, em alguns casos, desempenho melhor do que outro algoritmo amplamente utilizado (J48) e um pouco inferior à Random Forest. Além disso, a árvore criada apresenta, no geral, tamanho menor do que a árvore criada pelo algoritmo J48. / Random Forest is a computationally ecient technique which can operate quickly over large datasets. It has been used in many research projects and recent real-world applications in several elds, including bioinformatics since Random Forest can handle datasets having many attributes, and few examples. However, it is dicult for human experts to understand it. The research reported here aims to create a symbolic model, i.e. a single tree from a Random Forest for the classication of gene expression datasets. Thus, we hope to increase the understanding by human experts on the process that classies the examples in the real world trying to keep a good performance. Initial results obtained from the proposed algorithm are promising since it presents in some cases performance better than other widely used algorithm (J48) and a slightly lower than a Random Forest. Furthermore, the induced tree presents, in general, a smaller size than the tree built by the algorithm J48.
|
8 |
Mapeamento semiautomático por meio de padrão espectro-temporal de áreas agrícolas e alvos permanentes com evi/modis no Paraná / Semiautomatic mapping of agricultural areas and targets permanent by profile spectrum-temporary of evi / modis in ParanaVerica, Weverton Rodrigo 16 February 2018 (has links)
Submitted by Neusa Fagundes (neusa.fagundes@unioeste.br) on 2018-09-06T19:38:50Z
No. of bitstreams: 2
Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-09-06T19:38:50Z (GMT). No. of bitstreams: 2
Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2018-02-16 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Knowledge of location and quantity of areas for agriculture or either native or planted forests is relevant for public managers to make their decisions based on reliable data. In addition, part of ICMS revenues from the Municipal Participation Fund (FPM) depends on agricultural production data, number of rural properties and the environmental factor. The objective of this research was to design an objective and semiautomatic methodology to map agricultural areas and targets permanent, and later to identify areas of soybean, corn 1st and 2nd crops, winter crops, semi-perennial agriculture, forests and other permanent targets in the state of Paraná for the harvest years (2013/14 to 2016/17), using temporal series of EVI/Modis vegetation indexes. The proposed methodology follows the steps of the Knowledge Discovery Process in Database – KDD, in which the classification task was performed by the Random Forest algorithm. For the validation of the mappings, samples extracted from Landsat-8 images were used, obtaining the global accuracy indices greater than 84.37% and a kappa index ranging from 0.63 to 0.98, hence considered mappings with good or excellent spatial accuracy. The municipal data of the area of soybean, corn 1st crop, corn 2nd crop and winter crops mapped were confronted with the official statistics obtaining coefficients of linear correlation between 0.61 to 0.9, indicating moderate or strong correlation with the data officials. In this way, the proposed semi-automatic methodology was successful in the mapping, as well as the automation of the process of elaboration of the metrics, thus generating a script in the software R in order to facilitate future mappings with low processing time. / O conhecimento da localização e da quantidade de áreas destinadas a agricultura ou a
florestas nativas ou plantadas é relevante para que os gestores públicos tomem suas
decisões pautadas em dados fidedignos com a realidade. Além disto, parte das receitas de
ICMS advindas do Fundo de Participação aos Municípios (FPM) depende de dados de
produção agropecuária, número de propriedades rurais e fator ambiental. Diante disso, esta
dissertação teve como objetivo elaborar uma metodologia objetiva e semiautomática para
mapear áreas agrícolas e alvos permanente e posteriormente identificar áreas de soja, milho
1ª e 2ª safras, culturas de inverno, agricultura semi-perene, florestas e demais alvos
permanentes no estado do Paraná para os anos-safra (2013/14 a 2016/17), utilizando séries
temporais de índices de vegetação EVI/Modis. A metodologia proposta segue os passos do
Processo de descoberta de conhecimento em base de dados – KDD, sendo que para isso
foram elaboradas métricas extraídas do perfil espectro temporal de cada pixel e foi
empregada a tarefa de classificação, realizada pelo algoritmo Random Forest. Para a
validação dos mapeamentos utilizaram-se amostras extraídas de imagens Landsat-8,
obtendo-se os índices de exatidão global maior que 84,37% e um índice kappa variando entre
0,63 e 0,98, sendo, portanto, considerados mapeamentos com boa ou excelente acurácia
espacial. Os dados municipais da área de soja, milho 1ª safra, milho 2ª safra e culturas de
inverno mapeada foram confrontados com as estatísticas oficiais obtendo-se coeficientes de
correlação linear entre 0,61 a 0,9, indicando moderada ou forte correlação com os dados
oficiais. Desse modo, a metodologia semiautomática proposta obteve êxito na realização do
mapeamento, bem como a automatização do processo de elaboração das métricas, gerando,
com isso um script no software R de maneira a facilitar mapeamentos futuros com baixo
tempo de processamento.
|
9 |
Uma abordagem para a construção de uma única árvore a partir de uma Random Forest para classificação de bases de expressão gênica / An approach to the construction of a single tree from Random Forest to classification of gene expression databasesThais Mayumi Oshiro 27 August 2013 (has links)
Random Forest é uma técnica computacionalmente eciente que pode operar rapida-mente sobre grandes bases de dados. Ela tem sido usada em muitos projetos de pesquisa recentes e aplicações do mundo real em diversos domínios, entre eles a bioinformática uma vez que a Random Forest consegue lidar com bases que apresentam muitos atributos e poucos exemplos. Porém, ela é de difícil compreensão para especialistas humanos de diversas áreas. A pesquisa de mestrado aqui relatada tem como objetivo criar um modelo simbólico, ou seja, uma única árvore a partir da Random Forest para a classicação de bases de dados de expressão gênica. Almeja-se assim, aumentar a compreensão por parte dos especialistas humanos sobre o processo que classica os exemplos no mundo real tentando manter um bom desempenho. Os resultados iniciais obtidos com o algoritmo aqui proposto são pro-missores, uma vez que ela apresenta, em alguns casos, desempenho melhor do que outro algoritmo amplamente utilizado (J48) e um pouco inferior à Random Forest. Além disso, a árvore criada apresenta, no geral, tamanho menor do que a árvore criada pelo algoritmo J48. / Random Forest is a computationally ecient technique which can operate quickly over large datasets. It has been used in many research projects and recent real-world applications in several elds, including bioinformatics since Random Forest can handle datasets having many attributes, and few examples. However, it is dicult for human experts to understand it. The research reported here aims to create a symbolic model, i.e. a single tree from a Random Forest for the classication of gene expression datasets. Thus, we hope to increase the understanding by human experts on the process that classies the examples in the real world trying to keep a good performance. Initial results obtained from the proposed algorithm are promising since it presents in some cases performance better than other widely used algorithm (J48) and a slightly lower than a Random Forest. Furthermore, the induced tree presents, in general, a smaller size than the tree built by the algorithm J48.
|
10 |
Investigating the Performance of Random Forest Classification for Stock TradingNordfjell, Oscar, Ring, Gustav January 2023 (has links)
We show that with the implementation presented in this paper, the Random Forest Classification model was able to predict whether or not a stock was going to increase in value during the coming day with an accuracy higher than 50\% for all stocks included in this study. Furthermore, we show that the active trading strategy presented in this paper generated higher returns and higher risk-adjusted returns than the passive investment in the stocks underlying the strategy. Therefore, we conclude \textit{(i)} that a Random Forest Classification model can be used to provide valuable insight on publicly traded stocks, and \textit{(ii)} that it is probably possible to create a profitable trading strategy based on a Random Forest Classifier, but that this requires a more sophisticated implementation than the one presented in this paper.
|
Page generated in 0.0677 seconds