Global ETD Search

21	Um estudo de limpeza em base de dados desbalanceada e com sobreposição de classes Machado, Emerson Lopes 04 1900 (has links) Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2007. / Submitted by Luis Felipe Souza (luis_felas@globo.com) on 2008-12-10T18:56:04Z No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / Approved for entry into archive by Georgia Fernandes(georgia@bce.unb.br) on 2009-03-04T12:18:48Z (GMT) No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / Made available in DSpace on 2009-03-04T12:18:48Z (GMT). No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / O objetivo geral desta pesquisa é analisar técnicas para aumentar a acurácia de classificadores construídos a partir de bases de dados desbalanceadas. Uma base de dados é desbalanceada quando possui muito mais casos de uma classe do que das outras, portanto possui classes raras. O desbalanceamento também pode ser em uma mesma classe se a distribuição dos valores dos atributos for muito assimétrica, levando à ocorrência de casos raros. Algoritmos classificadores são muito sensíveis a estes tipos de desbalanceamentos e tendem a valorizar as classes (ou casos) predominantes e a ignorar as classes (ou casos) de menor freqüência. Modelos gerados para bases de dados com classes raras apresentam baixa acurácia para estas classes, o que é problemático quando elas são classes de interesse (ou quando uma delas é a classe de interesse). Já os casos raros podem ser ignorados pelos algoritmos classificadores, o que é problemático quando tais casos pertencem à classe (ou às classes) de interesse. Uma nova proposição de algoritmo é o Cluster-based Smote, que se baseia na combinação dos métodos de Cluster-based Oversampling (oversampling por replicação de casos guiada por clusters) e no SMOTE (oversampling por geração de casos sintéticos). O método Cluster-based Oversampling visa melhorar a aprendizagem de pequenos disjuntos, geralmente relacionados a casos raros, mas causa overfitting do modelo ao conjunto de treinamento. O método SMOTE gera novos casos sintéticos ao invés de replicar casos existentes, mas não enfatiza casos raros. A combinação desses algoritmos, chamada de Clusterbased Smote, apresentou resultados melhores do que a aplicação deles em separado em oito das nove bases de dados utilizadas proposta nesta pesquisa. A outra abordagem proposta nesta pesquisa visa a diminuir a sobreposição de classes possivelmente provocada pela aplicação do método SMOTE. Intuitivamente, esta abordagem consiste em guiar a aplicação do SMOTE com a aprendizagem não supervisionada proporcionada pela clusterização. O método implementado sob esta abordagem, denominado de C-clear, resultou em melhora significativa em relação ao SMOTE em três das nove bases testadas e empatou nas demais. Foi também proposta uma nova abordagem para limpeza de dados baseada na aprendizagem não supervisionada, a qual foi incorporada ao C-clear. Esta limpeza somente surtiu melhora em uma base de dados, sendo este baixo desempenho oriundo possivelmente da escolha não adequada de seus parâmetros de limpeza. A aprendizagem destes parâmetros a partir dos dados ficou como trabalho futuro. ___________________________________________________________________________________________ ABSTRACT / It is intended in this work to research methods that improve the accuracy of classifiers applied to data set with class imbalance (high skew in class distribution causing rare classes) and within-class imbalance (high skew in data within-class distribution causing care cases). Standard classifier algorithms are strongly affected by these characteristics and their generated model are biased to the majority classes (or cases), in detriment of classes (or cases) underrepresented. Generally, models generated with imbalanced data set suffer from low accuracy for the minority classes, which is a problem when the target class is one of them. Eventually, rare cases are likely of being ignored by inductors, which is a problem when they belong to the interesting class (or classes). A new method is proposed in this work, Cluster-based Smote, which combines the methods Cluster-based Oversampling (oversampling by replication of positive cases guided by clusters) and SMOTE (Synthetic Minority Oversampling Technique). Cluster-based Oversampling addresses small disjuncts, but overfits the model to the training set. The method SMOTE addresses the overfit problem of random oversampling, but does not treat rare cases. The combination of them proposed in this research, named Cluster-based Smote, presented better results in eight out of nine datasets, compared to the applying of them all alone. Another approach proposed in this research aims at reducing the class overlap problem possibly caused by applying SMOTE. The main idea is to guide the SMOTE process by non-supervised learning (with clustering techniques). The method implemented under this approach, named Cclear, resulted in significant improvement over SMOTE in three out of nine datasets. A cleaning method based in the non-supervised learning was also proposed and has been incorporated in the C-clear method. The cleaning method improved the results in only one dataset, probably because of the not so well values chosen as cleaning parameters. The learning of these parameters from the data is left as a future work. Mineração de dados (Computação) Desbalanceamento de classe Sobreposição de classe SMOTE Cluster-based Oversampling Cluster-based Smote C-clear
22	A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine Learning Tumati, Saini 05 October 2021 (has links) No description available. Computer Science Imbalanced datasets Multi-class imbalanced datasets Oversampling Concept drifts Machine learning ensemble learning
23	The impact of missing data imputation on HCC survival prediction : Exploring the combination of missing data imputation with data-level methods such as clustering and oversampling Abdul Jalil, Walid, Dalla Torre, Kvin January 2018 (has links) The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification. missing data imputation HCC survival prediction oversampling Engineering and Technology Teknik och teknologier
24	Enhancing Telecom Churn Prediction: Adaboost with Oversampling and Recursive Feature Elimination Approach Tran, Long Dinh 01 June 2023 (has links) (PDF) Churn prediction is a critical task for businesses to retain their valuable customers. This paper presents a comprehensive study of churn prediction in the telecom sector using 15 approaches, including popular algorithms such as Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, and AdaBoost. The study is segmented into three sets of experiments, each focusing on a different approach to building the churn prediction model. The model is constructed using the original training set in the first set of experiments. The second set involves oversampling the training set to address the issue of imbalanced data. Lastly, the third set combines oversampling with recursive feature selection to enhance the model's performance further. The results demonstrate that the Adaptive Boost classifier, implemented with oversampling and recursive feature selection, outperforms the other 14 techniques. It achieves the highest rank in all three evaluation metrics: recall (0.841), f1-score (0.655), and roc_auc (0.793), further indicating that the proposed approach effectively predicts churn and provides valuable insights into customer behavior. Churn Prediction Unbalanced Datasets Oversampling SMOTE Recursive Feature Selection RFE Machine Learning
25	Predicting SNI Codes from Company Descriptions : A Machine Learning Solution Lindholm, Erik, Nilsson, Jonas January 2023 (has links) This study aims to develop an automated solution for assigning area of industry codes to businesses based on the contents of their business descriptions. The Swedish standard industrial classification (SNI) is a system used by Statistics Sweden (SCB) for categorizing businesses for their statistics reports. Assignment of SNI codes has so far been done manually by the person registering a new company, but this is a far from optimal solution. Some of the 88 main group areas of industry are hard to tell apart from one another, and this often leads to incorrect assignments. Our approach to this problem was to train a machine learning model using the Naive Bayes and SVM classifier algorithms and conduct an experiment. In 2019, Dahlqvist and Strandlund had attempted this and reached an accuracy score of 52 percent by use of the gradient boosting classifier, but this was considered too low for real-world implementation. Our main goal was to achieve a higher accuracy than that of Dahlqvist and Strandlund, which we eventually succeeded in - our best-performing SVM model reached a score of 60.11 percent. Similarly to Dahlqvist and Strandlund, we concluded that the low quality of the dataset was the main obstacle for achieving higher scores. The dataset we used was severely imbalanced, and much time was spent on investigating and applying oversampling and undersampling as strategies for mitigating this problem. However, we found during the testing phase that none of these strategies had any positive effect on the accuracy scores. Machine learning text classification SNI Naive Bayes SVM oversampling undersampling Computer Sciences Datavetenskap (datalogi)
26	Návrh diskrétního delta-sigma modulátoru pro audio aplikace nízkého řádu s vysokým koeficientem převzorkování / Design of low order high OSR discrete time delta-sigma modulator for audio applications Dohnal, Jaroslav January 2020 (has links) Tato diplomová práce si klade za cíl seznámit čtenáře se základním konceptem a principy jednosmyčkových modulátorů . Diplomová práce ozřejmuje čtenáři problematiku delta-sigma () modulátorů s jednou zpětnovazební smyčkou. Zabývá se základními principy převzorkování u číslicově-analogových převodníků a rozšiřuje je o teorii tvarování spektra šumu. Vycházeje z této teorie jsou navrženy tři jednosmyčkové modulátory, které běží na 1024 OSR jako alternativa k běžně používáným modulátorům vysokých řádů. Modulátory jsou implementovány do FPGA společně s rekonstrukčním filtrem a podpůrnými bloky. Nakonec byl zkonstruován hardwarový prototyp pro vyhodnocení implementace navrženého DAC.
27	Zvukové rozhraní pro průmyslový počítač / Audio Interface for Embedded PC Staroň, Martin January 2011 (has links) The scope of my master thesis is a designing computer sound interface including measurement of audio performance. This work is concerning both design analog front - ends and digital support circuits. The sigma delta Analog to Digital (ADC) and Digital to Analog (DAC) converters is included in this conception. Those converters has been made into two separate printed circuit boards. All signal paths in this circuitry are utilizing differential mode that are quoted as balanced among audio engineers. Modern circuit components are used in this design, such as fully differential operational amplifiers, electronically controlled gain preamplifiers, low drop linear stabilizers with low noise level, DC component suppression circuits and low jitter active components. Theoretical part of this thesis contains specification of choosed sound defitions, questioning audio program loudness leveling. Further criteria of suitable active and passive components are included. In this thesis the simulations of fundamental circuits block are meant likewise. Practical part involve complete layout of printed circuit boards of and prototyping. Designed prototype device has wide application usage. It is intended to use not only as industrial computers, but also as dedicated sound converters, measurement cards, mixing consoles, switching matrixes, active loudspeakers, embedded systems.
28	ALL-OPTICAL DELTA-SIGMA MODULATOR DESIGN AND IMPLEMENTATION TAFAZOLI MEHRJERDI, MOHAMAD 01 December 2015 (has links) (PDF) In this research an approach to design and implement all-optical delta-sigma modulator (ODSM) has been expanded. The two main blocks of this modulator are “leaky integrator” and “bi-stable switch” designed and implemented by using active element like semiconductor optical amplifier (SOA) and other passive elements like optical filter, isolator and coupler. All experiments are done on optical table and proper results achieved. Thus the new bi-stable switch is designed and implemented by using “inverted bistable switch” and “non-inverted bi-stable switch”. This switch is made by five ring lasers. Right wavelengths have chosen for each ring laser to achieve a novel characteristic called “Proteresis”. All control parameters of this switch was investigated The major impact of this research will be in the area communication system, which need high resolution and fast modulation speed with less noise in their systems. Asynchronous delta–sigma modulator Binary delta–sigma modulator Optical A/D converte Optical delta–sigma modulator Oversampling Proteresis
29	The impact of missing data imputation on HCC survival prediction : Exploring the combination of missing data imputation with data-level methods such as clustering and oversampling Dalla Torre, Kevin, Abdul Jalil, Walid January 2018 (has links) The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification. / Forskningen kring data imputation, processen där man ersätter saknade data med substituerade värden, har varit omfattande de senaste åren. Litteraturen om den praktiska inverkan som data imputation metoder har på klassificering är dock otillräcklig. Det här kandidatexamensarbetet utforskar den inverkan som de nyare imputation metoderna har på HCC överlevnads klassificering i kombination med andra data-nivå metoder så som översampling. Mer specifikt, så utforskar denna studie imputations metoder för heterogena dataset och deras inverkan på ett specifikt HCC dataset. Tidigare forskning har visat att de nyare, mer sofistikerade imputations metoderna presterar bättre än de mer enkla metoderna när de utvärderas med normalized root mean square error (NRMSE). I motsats till intuition, så visar resultaten i denna studie att när imputation kombineras med andra data-nivå metoder så som översampling och klustring, så påverkas inte klassificeringen alltid på ett meningsfullt sätt. Detta kan förklaras med att brus introduceras i datasetet när syntetiska punkter genereras i översampling processen. Resultaten visar också att en av de mer sofistikerade imputation metoderna, nämligen MICE, är starkt beroende på tidigare antaganden som görs om de underliggande fördelningarna i datasetet. När dessa antaganden är inkorrekta så presterar imputations metoden dåligt och har en negativ inverkan på klassificering. missing data imputation HCC survival prediction oversampling saknade data imputation HCC överlevnads klassificering översampling Engineering and Technology Teknik och teknologier
30	[en] BLIND RECEPTION OF SEQUECIES EXPLORING OVERSAMPLING / [pt] RECEPÇÃO CEGA DE SEQUÊNCIAS EXPLORANDO SUPERAMOSTRAGEM ERNESTO LEITE PINTO 12 December 2005 (has links) [pt] Propõe-se alternativas para exploração da amostragem a taxas múltiplas da taxa símbolos (superamostragem) em receptores cegos com decisão de sequências de símbolos, a fim de se obter melhoria de desempenho em canais com desvanecimento rápido e seletivo em frequência. O trabalho se centra em esquemas de recepção do tipo MLSE/PSP (maximum likelihood sequence estimation/per-survivor processing) baseados na modelagem estocástica do canal. Deduz-se um modelo em espaço de estado para a geração das amostras do sinal recebido, cujos parâmetros podem ser facilmente associados ao sistema de transmissão. Investiga- se também uma estratégia de ataque ao problema de recepção diante de ruído colorido, intrinsecamente associado à superamostragem. O desempenho dos esquemas de recepção propostos é avaliado através de simulação computacional. Os resultados obtidos mostram que a exploração da superamostragem produz ganhos significativos de desempenho na recepção cega de sequências, em relação aos esquemas de recepção MLSE/PSP com amostragem síncrona (taxa de amostragem igual à taxa de símbolos). / [en] In this work we propose exploting the received signal oversampling in order to improve the performance of blind receivers over fast fequency selective fading channels.The work focus attention on MLSE/PSP (maximum-likelihood sequence estimation/per survivor processing) receivers and statistical modeling of the channel output. A state-space model for oversampled received signal in developed from a generic continous time transmission system model. Two MLSE/PSP receiving schemes relying on this model are proposed. One of them is specially suitable for dealing with the colored noise produced by oversampling. Computer simulations were conducted in order to evaluate the performance of the proposed blind receivers. The results thus obtained show that these schemes significantly outperform the synchronous MLSE/PSP receiver (for which the sampling rate is equal to the sysmbol rate). [pt] DESVANECIMENTO SELETIVO [pt] DESVANECIMENTO RAPIDO [pt] RECEPCAO CEGA [pt] SUPERAMOSTRAGEM [en] SELECTIVE FADING [en] FAST FADE-OUT [en] BLIND RECEPTION [en] OVERSAMPLING

Search results