51 |
An investigation into applications of canonical polyadic decomposition & ensemble learning in forecasting thermal data streams in direct laser deposition processesStorey, Jonathan 08 December 2023 (has links) (PDF)
Additive manufacturing (AM) is a process of creating objects from 3D model data by adding layers of material. AM technologies present several advantages compared to traditional manufacturing technologies, such as producing less material waste and being capable of producing parts with greater geometric complexity. However, deficiencies in the printing process due to high process uncertainty can affect the microstructural properties of a fabricated part leading to defects. In metal AM, previous studies have linked defects in parts with melt pool temperature fluctuations, with the size of the melt pool and the scan pattern being key factors associated with part defects. Thus being able to adjust certain process parameters during a part's fabrication, and knowing when to adjust these parameters, is critical to producing reliable parts. To know when to effectively adjust these parameters it is necessary to have models that can both identify when a defect has occurred and forecast the behavior of the process to identify if a defect will occur. This study focuses on the development of accurate forecasting models of the melt pool temperature distribution. Researchers at Mississippi State University have collected in-situ pyrometer data of a direct laser deposition process which captures the temperature distribution of the melt pool. The high-dimensionality and noise of the data pose unique challenges in developing accurate forecasting models. To overcome these challenges, a tensor decomposition modeling framework is developed that can actively learn and adapt to new data. The framework is evaluated on two datasets which demonstrates its ability to generate accurate forecasts and adjust to new data.
|
52 |
Sublimation temperature prediction of OLED materials : using machine learningNorinder, Niklas January 2023 (has links)
Organic light-emitting diodes (OLED) are and have been the future of display technology for a minute. Looking back, display technology has moved from cathode-ray tube displays (CRTs) to liquid crystal displays (LCDs). Whereas CRT displays were clunky and had quite high powerconsumption, LCDs were thinner, lighter and consumed less energy. This technological shift has made it possible to create smaller and more portable screens, aiding in the development of personal electronics. Currently, however, LCDs place at the top of the display hierarchy is being challenged by OLED displays, providing higher pixel density and overall higher performance.OLED displays consist of thin layers of organic semiconductors, and are instrumental in the development of folding displays; small displays for virtual reality and augmented reality applications; as well as development of displays that are energy-efficient. In the creation of OLED displays, the organic semiconducting material is vaporized and adhered to a thin film through vapor deposition techniques. One way of aiding in the creation of organic electroluminescent (OEL) materials and OLEDs is through in silico analysis of sublimationtemperatures through machine learning. This master’s thesis inhabits that space, aiming to create a deeper understanding of the OEL materials through sublimation temperature prediction using ensemble learning (light gradient-boosting machine) and deep learning (convolutional neural network) methods. Through analysis of experimental OEL data, it is found that the sublimation temperatures of OLED materials can be predicted with machine learning regression using molecular descriptors, with an R2 score of ~0.86, Mean Absolute Error of ~13°C, Mean Absolute Percentage Error of ~3.1%, and Normalized Mean Absolute Error of ~0.56.
|
53 |
Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor modelsHenriksson, Erik, Werlinder, Kristopher January 2021 (has links)
The aim of this research project is to investigate how an XGBoost regressor compares to a Random Forest regressor in terms of predictive performance of housing prices with the help of two data sets. The comparison considers training time, inference time and the three evaluation metrics R2, RMSE and MAPE. The data sets are described in detail together with background about the regressor models that are used. The method makes substantial data cleaning of the two data sets, it involves hyperparameter tuning to find optimal parameters and 5foldcrossvalidation in order to achieve good performance estimates. The finding of this research project is that XGBoost performs better on both small and large data sets. While the Random Forest model can achieve similar results as the XGBoost model, it needs a much longer training time, between 2 and 50 times as long, and has a longer inference time, around 40 times as long. This makes it especially superior when used on larger sets of data. / Målet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar.
|
54 |
OPTIMIZING DECISION TREE ENSEMBLES FOR GENE-GENE INTERACTION DETECTIONAssareh, Amin 27 November 2012 (has links)
No description available.
|
55 |
Naive semi-supervised deep learning med sammansättning av pseudo-klassificerare / Naive semi-supervised deep learning with an ensemble of pseudo-labelersKarlsson, Erik, Nordhammar, Gilbert January 2019 (has links)
Ett vanligt problem inom supervised learning är brist på taggad träningsdata. Naive semi-supervised deep learning är en träningsteknik som ämnar att mildra detta problem genom att generera pseudo-taggad data och därefter låta ett neuralt nätverk träna på denna samt en mindre mängd taggad data. Detta arbete undersöker om denna teknik kan förbättras genom användandet av röstning. Flera neurala nätverk tränas genom den framtagna tekniken, naive semi-supervised deep learning eller supervised learning och deras träffsäkerhet utvärderas därefter. Resultaten visade nästan enbart försämringar då röstning användes. Dock verkar inte förutsättningarna för röstning ha varit särskilt goda, vilket gör det svårt att dra en säker slutsats kring effekterna av röstning. Även om röstning inte gav förbättringar har NSSDL visat sig vara mycket effektiv. Det finns flera applikationsområden där tekniken i framtiden skulle kunna användas med goda resultat.
|
56 |
Método baseado em rotação e projeção otimizadas para a construção de ensembles de modelos / Ensemble method based on optimized rotation and projectionFerreira, Ednaldo José 31 May 2012 (has links)
O desenvolvimento de novas técnicas capazes de produzir modelos de predição com erros de generalização relativamente baixos é uma constante em aprendizado de máquina e áreas correlatas. Nesse sentido, a composição de um conjunto de modelos no denominado ensemble merece destaque por seu potencial teórico e empírico de minimizar o erro de generalização. Diversos métodos para construção de ensembles de modelos são encontrados na literatura. Dentre esses, o método baseado em rotação (RB) tem apresentado desempenho superior a outros clássicos. O método RB utiliza a técnica de extração de características da análise de componentes principais (PCA) como estratégia de rotação para provocar acurácia e diversidade entre os modelos componentes. Contudo, essa estratégia não assegura que a direção resultante será apropriada para a técnica de aprendizado supervisionado (SLT) escolhida. Adicionalmente, o método RB não é adequado com SLTs invariantes à rotação e não foi amplamente validado com outras estáveis. Esses aspectos tornam-no inadequado e/ou restrito a algumas SLTs. Nesta tese, é proposta uma nova abordagem de extração baseada na concatenação de rotação e projeção otimizadas em prol da SLT (denominada roto-projeção otimizada). A abordagem utiliza uma metaheurística para otimizar os parâmetros da transformação de roto-projeção e minimizar o erro da técnica diretora da otimização. Mais enfaticamente, propõe-se a roto-projeção otimizada como parte fundamental de um novo método de ensembles, denominado ensemble baseado em roto-projeção otimizada (ORPE). Os resultados obtidos mostram que a roto-projeção otimizada pode reduzir a dimensionalidade e a complexidade dos dados e do modelo, além de aumentar o desempenho da SLT utilizada posteriormente. O método ORPE superou, com relevância estatística, o RB e outros com SLTs estáveis e instáveis em bases de classificação e regressão de domínio público e privado. O ORPE mostrou-se irrestrito e altamente eficaz assumindo a primeira posição em todos os ranqueamentos de dominância realizados / The development of new techniques capable of inducing predictive models with low generalization errors has been a constant in machine learning and other related areas. In this context, the composition of an ensemble of models should be highlighted due to its theoretical and empirical potential to minimize the generalization error. Several methods for building ensembles are found in the literature. Among them, the rotation-based (RB) has become known for outperforming other traditional methods. RB method applies the principal components analysis (PCA) for feature extraction as a rotation strategy to provide diversity and accuracy among base models. However, this strategy does not ensure that the resulting direction is appropriate for the supervised learning technique (SLT). Moreover, the RB method is not suitable for rotation-invariant SLTs and also it has not been evaluated with stable ones, which makes RB inappropriate and/or restricted to the use with only some SLTs. This thesis proposes a new approach for feature extraction based on concatenation of rotation and projection optimized for the SLT (called optimized roto-projection). The approach uses a metaheuristic to optimize the parameters from the roto-projection transformation, minimizing the error of the director technique of the optimization process. More emphatically, it is proposed the optimized roto-projection as a fundamental part of a new ensemble method, called optimized roto-projection ensemble (ORPE). The results show that the optimized roto-projection can reduce the dimensionality and the complexities of the data and model. Moreover, optimized roto-projection can increase the performance of the SLT subsequently applied. The ORPE outperformed, with statistical significance, RB and others using stable and unstable SLTs for classification and regression with databases from public and private domains. The ORPE method was unrestricted and highly effective holding the first position in every dominance rankings
|
57 |
Adaptivni sistem za automatsku polu-nadgledanu klasifikaciju podataka / Adaptive System for Automated Semi-supervised Data ClassificationSlivka Jelena 23 December 2014 (has links)
<p>Cilj – Cilj istraživanja u okviru doktorske disertacije je razvoj sistema za automatsku polu-nadgledanu klasifikaciju podataka. Sistem bi trebao biti primenljiv na širokom spektru domena gde je neophodna klasifikacija podataka, a teško je, ili čak nemoguće, doći do dovoljno velikog i raznovrsnog obučavajućeg skupa podataka<br />Metodologija – Modeli opisani u disertaciji se baziraju na kombinaciji ko-trening algoritma i tehnika učenja sa grupom hipoteza. Prvi korak jeste obučavanje grupe klasifikatora velike raznolikosti i kvaliteta. Sa ovim ciljem modeli eksploatišu primenu različitih konfiguracija ko-trening algoritma na isti skup podataka. Prednost ovog pristupa je mogućnost korišćenja značajno manjeg anotiranog obučavajućeg skupa za inicijalizaciju algoritma.<br />Skup nezavisno obučenih ko-trening klasifikatora se kreira generisanjem predefinisanog broja slučajnih podela obeležja polaznog skupa podataka. Nakon toga se, polazeći od istog inicijalnog obučavajućeg skupa, ali korišćenjem različitih kreiranih podela obeležja, obučava grupa ko-trening klasifikatora. Nakon ovoga, neophodno je kombinovati predikcije nezavisno obučenih klasifikatora.<br />Predviđena su dva načina kombinovanja predikcija. Prvi način se zasniva na klasifikaciji zapisa na osnovu većine glasova grupe ko-trening klasifikatora. Na ovaj način se daje predikcija za svaki od zapisa koji su pripadali grupi neanotiranih primera korišćenih u toku obuke ko-treninga. Potom se primenjuje genetski algoritam u svrhu selekcije najpouzdanije klasifikovanih zapisa ovog skupa. Konačno,<br />163<br />najpouzdanije klasifikovani zapisi se koriste za obuku finalnog klasifikatora. Ovaj finalni klasifikator se koristi za predikciju klase zapisa koje je neophodno klasifikovati. Opisani algoritam je nazvan Algoritam Statistike Slučajnih Podela (Random Split Statistics algorithm, RSSalg).<br />Drugi način kombinovanja nezavisno obučenih ko-trening klasifikatora se zasniva na GMM-MAPML tehnici estimacije tačnih klasnih obeležja na osnovu višestrukih obeležja pripisanih od strane različitih anotatora nepoznatog kvaliteta. U ovom algoritmu, nazvanom Integracija Višestrukih Ko-treninranih Klasifikatora (Integration of Multiple Co-trained Classifiers, IMCC), svaki od nezavisno treniranih ko-trening klasifikatora daje predikciju klase za svaki od zapisa koji je neophodno klasifikovati. U ovoj postavci se svaki od ko-trening klasifikatora tretira kao jedan od anotatora čiji je kvalitet nepoznat, a svakom zapisu, za koga je neophodno odrediti klasno obeležje, se dodeljuje više klasnih obeležja. Na kraju se primenjuje GMM-MAPML tehnika, kako bi se na osnovu dodeljenih višestrukih klasnih obeležja za svaki od zapisa izvršila estimacija stvarnog klasnog obeležja zapisa.<br />Rezultati – U disertaciji su razvijena dva modela, Integracija Višestrukih Ko-treninranih Klasifikatora (IMCC) i Algoritam Statistike Slučajnih Podela (RSSalg), bazirana na ko-trening algoritmu, koja rešavaju zadatak automatske klasifikacije u slučaju nepostojanja dovoljno velikog anotiranog korpusa za obuku. Modeli predstavljeni u disertaciji dizajnirani su tako da omogućavaju primenu ko-trening algoritma na skupove podataka bez prirodne podele obeležja, kao i da unaprede njegove performanse. Modeli su na više skupova podataka različite veličine, dimenzionalnosti i redudantnosti poređeni sa postojećim ko-trening alternativama. Pokazano je da razvijeni modeli na testiranim skupovima podataka postižu bolje performanse od testiranih ko-trening alternativa.<br />Praktična primena – Razvijeni modeli imaju široku mogućnost primene u svim domenima gde je neophodna klasifikacija podataka, a anotiranje podataka dugotrajno i skupo. U disertaciji je prikazana i primena razvijenih modela u nekoliko konkretnih<br />164<br />situacija gde su modeli od posebne koristi: detekcija subjektivnosti, više-kategorijska klasifikacija i sistemi za davanje preporuka.<br />Vrednost – Razvijeni modeli su korisni u širokom spektru domena gde je neophodna klasifikacija podataka, a anotiranje podataka dugotrajno i skupo. Njihovom primenom se u značajnoj meri smanjuje ljudski rad neophodan za anotiranje velikih skupova podataka. Pokazano je da performanse razvijenih modela prevazilaze performanse postojećih alternativa razvijenih sa istim ciljem relaksacije problema dugotrajne i mukotrpne anotacije velikih skupova podataka.</p> / <p>Aim – The research presented in this thesis is aimed towards the development of the system for automatic semi-supervised classification. The system is designed to be applicable on the broad spectrum of practical domains where automatic classification of data is needed but it is hard or impossible to obtain a large enough training set.<br />Methodology – The described models combine co-training algorithm with ensemble learning with the aim to overcome the problem of co-training application on the datasets without the natural feature split. The first step is to create the ensemble of co-training classifiers. For this purpose the models presented in this thesis apply different configurations of co-training on the same training set. Compared to existing similar approaches, this approach requires a significantly smaller initial training set.<br />The ensemble of independently trained co-training classifiers is created by generating a predefined number of random feature splits of the initial training set. Using the same initial training set, but different feature splits, a group of co-training classifiers is trained. The two models differ in the way the predictions of different co-training classifiers are combined.<br />The first approach is based on majority voting: each instance recorded in the enlarged training sets resulting from co-training application is classified by majority voting of the group of obtained co-training classifiers. After this, the genetic algorithm is applied in order to select the group of most reliably classified instances from this set. The most reliable instances are used in<br />167<br />order to train a final classifier which is used to classify new instances. The described algorithm is called Random Split Statistic Algorithm (RSSalg).<br />The other approach of combining single predictions of the group of co-training classifiers is based on GMM-MAPML technique of estimating the true hidden label based on the multiple labels assigned by multiple annotators of unknown quality. In this model, called the Integration of Multiple Co-trained Classifiers (IMCC), each of the independently trained co-training classifiers predicts the label for each test instance. Each co-training classifier is treated as one of the annotators of unknown quality and each test instance is assigned multiple labels (one by each of the classifiers). Finally, GMM-MAPML technique is applied in order to estimate the true hidden label in the multi-annotator setting.<br />Results – In the dissertation the two models are developed: the Integration of Multiple Co-trained Classifiers (IMCC) and Random Split Statistic Algorithm (RSSalg). The models are based on co-training and aimed towards enabling automatic classification in the cases where the existing training set is insufficient for training a quality classification model. The models are designed to enable the application of co-training algorithm on datasets that lack the natural feature split needed for its application, as well as with the goal to improve co-training performance. The models are compared to their co-training alternatives on multiple datasets of different size, dimensionality and feature redundancy. It is shown that the developed models exhibit superior performance compared to considered co-training alternatives.<br />Practical application – The developed models are applicable on the wide spectrum of domains where there is a need for automatic classification and training data is insufficient. The dissertation presents the successful application of models in several concrete situations where they are highly<br />168<br />beneficial: subjectivity detection, multicategory classification and recommender systems.<br />Value – The models can greatly reduce the human effort needed for long and tedious annotation of large datasets. The conducted experiments show that the developed models are superior to considered alternatives.</p>
|
58 |
Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping / Investigation des problèmes des données d'apprentissage en classification ensembliste basée sur le concept de marge : application à la cartographie d'occupation du solFeng, Wei 19 July 2017 (has links)
La classification a été largement étudiée en apprentissage automatique. Les méthodes d’ensemble, qui construisent un modèle de classification en intégrant des composants d’apprentissage multiples, atteignent des performances plus élevées que celles d’un classifieur individuel. La précision de classification d’un ensemble est directement influencée par la qualité des données d’apprentissage utilisées. Cependant, les données du monde réel sont souvent affectées par les problèmes de bruit d’étiquetage et de déséquilibre des données. La marge d'ensemble est un concept clé en apprentissage d'ensemble. Elle a été utilisée aussi bien pour l'analyse théorique que pour la conception d'algorithmes d'apprentissage automatique. De nombreuses études ont montré que la performance de généralisation d'un classifieur ensembliste est liée à la distribution des marges de ses exemples d'apprentissage. Ce travail se focalise sur l'exploitation du concept de marge pour améliorer la qualité de l'échantillon d'apprentissage et ainsi augmenter la précision de classification de classifieurs sensibles au bruit, et pour concevoir des ensembles de classifieurs efficaces capables de gérer des données déséquilibrées. Une nouvelle définition de la marge d'ensemble est proposée. C'est une version non supervisée d'une marge d'ensemble populaire. En effet, elle ne requière pas d'étiquettes de classe. Les données d'apprentissage mal étiquetées sont un défi majeur pour la construction d'un classifieur robuste que ce soit un ensemble ou pas. Pour gérer le problème d'étiquetage, une méthode d'identification et d'élimination du bruit d'étiquetage utilisant la marge d'ensemble est proposée. Elle est basée sur un algorithme existant d'ordonnancement d'instances erronées selon un critère de marge. Cette méthode peut atteindre un taux élevé de détection des données mal étiquetées tout en maintenant un taux de fausses détections aussi bas que possible. Elle s'appuie sur les valeurs de marge des données mal classifiées, considérant quatre différentes marges d'ensemble, incluant la nouvelle marge proposée. Elle est étendue à la gestion de la correction du bruit d'étiquetage qui est un problème plus complexe. Les instances de faible marge sont plus importantes que les instances de forte marge pour la construction d'un classifieur fiable. Un nouvel algorithme, basé sur une fonction d'évaluation de l'importance des données, qui s'appuie encore sur la marge d'ensemble, est proposé pour traiter le problème de déséquilibre des données. Cette méthode est évaluée, en utilisant encore une fois quatre différentes marges d'ensemble, vis à vis de sa capacité à traiter le problème de déséquilibre des données, en particulier dans un contexte multi-classes. En télédétection, les erreurs d'étiquetage sont inévitables car les données d'apprentissage sont typiquement issues de mesures de terrain. Le déséquilibre des données d'apprentissage est un autre problème fréquent en télédétection. Les deux méthodes d'ensemble proposées, intégrant la définition de marge la plus pertinente face à chacun de ces deux problèmes majeurs affectant les données d'apprentissage, sont appliquées à la cartographie d'occupation du sol. / Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers.
|
59 |
Cost-sensitive boosting : a unified approachNikolaou, Nikolaos January 2016 (has links)
In this thesis we provide a unifying framework for two decades of work in an area of Machine Learning known as cost-sensitive Boosting algorithms. This area is concerned with the fact that most real-world prediction problems are asymmetric, in the sense that different types of errors incur different costs. Adaptive Boosting (AdaBoost) is one of the most well-studied and utilised algorithms in the field of Machine Learning, with a rich theoretical depth as well as practical uptake across numerous industries. However, its inability to handle asymmetric tasks has been the subject of much criticism. As a result, numerous cost-sensitive modifications of the original algorithm have been proposed. Each of these has its own motivations, and its own claims to superiority. With a thorough analysis of the literature 1997-2016, we find 15 distinct cost-sensitive Boosting variants - discounting minor variations. We critique the literature using {\em four} powerful theoretical frameworks: Bayesian decision theory, the functional gradient descent view, margin theory, and probabilistic modelling. From each framework, we derive a set of properties which must be obeyed by boosting algorithms. We find that only 3 of the published Adaboost variants are consistent with the rules of all the frameworks - and even they require their outputs to be calibrated to achieve this. Experiments on 18 datasets, across 21 degrees of cost asymmetry, all support the hypothesis - showing that once calibrated, the three variants perform equivalently, outperforming all others. Our final recommendation - based on theoretical soundness, simplicity, flexibility and performance - is to use the original Adaboost algorithm albeit with a shifted decision threshold and calibrated probability estimates. The conclusion is that novel cost-sensitive boosting algorithms are unnecessary if proper calibration is applied to the original.
|
60 |
Método baseado em rotação e projeção otimizadas para a construção de ensembles de modelos / Ensemble method based on optimized rotation and projectionEdnaldo José Ferreira 31 May 2012 (has links)
O desenvolvimento de novas técnicas capazes de produzir modelos de predição com erros de generalização relativamente baixos é uma constante em aprendizado de máquina e áreas correlatas. Nesse sentido, a composição de um conjunto de modelos no denominado ensemble merece destaque por seu potencial teórico e empírico de minimizar o erro de generalização. Diversos métodos para construção de ensembles de modelos são encontrados na literatura. Dentre esses, o método baseado em rotação (RB) tem apresentado desempenho superior a outros clássicos. O método RB utiliza a técnica de extração de características da análise de componentes principais (PCA) como estratégia de rotação para provocar acurácia e diversidade entre os modelos componentes. Contudo, essa estratégia não assegura que a direção resultante será apropriada para a técnica de aprendizado supervisionado (SLT) escolhida. Adicionalmente, o método RB não é adequado com SLTs invariantes à rotação e não foi amplamente validado com outras estáveis. Esses aspectos tornam-no inadequado e/ou restrito a algumas SLTs. Nesta tese, é proposta uma nova abordagem de extração baseada na concatenação de rotação e projeção otimizadas em prol da SLT (denominada roto-projeção otimizada). A abordagem utiliza uma metaheurística para otimizar os parâmetros da transformação de roto-projeção e minimizar o erro da técnica diretora da otimização. Mais enfaticamente, propõe-se a roto-projeção otimizada como parte fundamental de um novo método de ensembles, denominado ensemble baseado em roto-projeção otimizada (ORPE). Os resultados obtidos mostram que a roto-projeção otimizada pode reduzir a dimensionalidade e a complexidade dos dados e do modelo, além de aumentar o desempenho da SLT utilizada posteriormente. O método ORPE superou, com relevância estatística, o RB e outros com SLTs estáveis e instáveis em bases de classificação e regressão de domínio público e privado. O ORPE mostrou-se irrestrito e altamente eficaz assumindo a primeira posição em todos os ranqueamentos de dominância realizados / The development of new techniques capable of inducing predictive models with low generalization errors has been a constant in machine learning and other related areas. In this context, the composition of an ensemble of models should be highlighted due to its theoretical and empirical potential to minimize the generalization error. Several methods for building ensembles are found in the literature. Among them, the rotation-based (RB) has become known for outperforming other traditional methods. RB method applies the principal components analysis (PCA) for feature extraction as a rotation strategy to provide diversity and accuracy among base models. However, this strategy does not ensure that the resulting direction is appropriate for the supervised learning technique (SLT). Moreover, the RB method is not suitable for rotation-invariant SLTs and also it has not been evaluated with stable ones, which makes RB inappropriate and/or restricted to the use with only some SLTs. This thesis proposes a new approach for feature extraction based on concatenation of rotation and projection optimized for the SLT (called optimized roto-projection). The approach uses a metaheuristic to optimize the parameters from the roto-projection transformation, minimizing the error of the director technique of the optimization process. More emphatically, it is proposed the optimized roto-projection as a fundamental part of a new ensemble method, called optimized roto-projection ensemble (ORPE). The results show that the optimized roto-projection can reduce the dimensionality and the complexities of the data and model. Moreover, optimized roto-projection can increase the performance of the SLT subsequently applied. The ORPE outperformed, with statistical significance, RB and others using stable and unstable SLTs for classification and regression with databases from public and private domains. The ORPE method was unrestricted and highly effective holding the first position in every dominance rankings
|
Page generated in 0.063 seconds