Global ETD Search

1	Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics Ding, Zejin 07 May 2011 (has links) In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. Machine learning Classification Imbalanced data learning Diversified ensemble Active learning Artificial data generation Bioinformatics Protein methylation
2	Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics DING, ZEJIN 07 May 2011 (has links) In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. Machine learning Classification Imbalanced data learning Diversified ensemble Active learning Artificial data generation Bioinformatics Protein methylation Computer Sciences
3	以資料採礦的方法探索影響台灣地區女性戶長的原因李孟謙, LEE, MENG CHIEN Unknown Date (has links) 「資料採礦」(Data Mining)為一種結合統計分析、資訊工程和各領域間專業知識的一種新興分析技術，例如：產業界的市場分析，金融界的財務分析，保險業的風險管理，生物科技界的疾病分析以及政府的人口統計，在各行各業使用資料採礦技術的人員日益增加。然而，正因資料採礦屬於新興發展的領域，仍有不少事項尚待開發，例如：不同型態的資料如何處理。本文即探討兩種不同型態的資料：資料量多、變數少以及資料量少、變數多兩種，以監督學習(Supervised Learning)和分類(Classification)的概念，分別對觀察值較多的2000年台灣地區戶口普查資料探討影響女性戶長的因素，而對變數較多的攝謢腺癌資料詮釋血清的病症類型，研究不同的類型資料可能的處理步驟。本文主要的結論為：1.當資料量多時，引入抽樣的概念，資料採礦可利用抽樣將資料量縮減，減少處理時間，並且抽樣資料和全部資料在分類錯誤率的差異頗為相近，因此抽樣為一種可行的處理方式。以研究女性戶長為例，資料量最少的東部資料為抽樣代表，在不失分類準確性的前提下，抽樣3%資料的分析結果與使用整體資料的結果相差不多，達到合乎經濟效應。2.當資料量少時，引入變數縮減的想法，使用敘述性統計量和不均度的17個指標統計量，能替代全部變數進行分析，運用羅吉斯迴歸方法，分類錯誤率的結果在可接受範圍內，並且解決在傳統分析上自由度不夠的問題。以研究攝護腺癌症為例，在不損失太多分類正確性的原則下，將血清透過質譜儀所反映的強度，透過變數縮減的技巧提高分析效率；另外，縮減變數後自由度充足，傳統的統計方法可運用在攝護腺癌的資料上，使分析的工具有較廣泛的選擇。女性戶長攝謢腺癌資料採礦男女平權分類監督學習 householder prostate cancer data mining equal right classification data learning
4	Design and assessment of a computer-assisted artificial intelligence system for predicting preterm labor in women attending regular check-ups. Emphasis in imbalance data learning technique Nieto del Amor, Félix 18 December 2023 (has links) Tesis por compendio / [ES] El parto prematuro, definido como el nacimiento antes de las 37 semanas de gestación, es una importante preocupación mundial con implicaciones para la salud de los recién nacidos y los costes económicos. Afecta aproximadamente al 11% de todos los nacimientos, lo que supone más de 15 millones de individuos en todo el mundo. Los métodos actuales para predecir el parto prematuro carecen de precisión, lo que conduce a un sobrediagnóstico y a una viabilidad limitada en entornos clínicos. La electrohisterografía (EHG) ha surgido como una alternativa prometedora al proporcionar información relevante sobre la electrofisiología uterina. Sin embargo, los sistemas de predicción anteriores basados en EHG no se han trasladado de forma efectiva a la práctica clínica, debido principalmente a los sesgos en el manejo de datos desbalanceados y a la necesidad de modelos de predicción robustos y generalizables. Esta tesis doctoral pretende desarrollar un sistema de predicción del parto prematuro basado en inteligencia artificial utilizando EHG y datos obstétricos de mujeres sometidas a controles prenatales regulares. Este sistema implica la extracción de características relevantes, la optimización del subespacio de características y la evaluación de estrategias para abordar el reto de los datos desbalanceados para una predicción robusta. El estudio valida la eficacia de las características temporales, espectrales y no lineales para distinguir entre casos de parto prematuro y a término. Las nuevas medidas de entropía, en concreto la dispersión y la entropía de burbuja, superan a las métricas de entropía tradicionales en la identificación del parto prematuro. Además, el estudio trata de maximizar la información complementaria al tiempo que minimiza la redundancia y las características de ruido para optimizar el subespacio de características para una predicción precisa del parto prematuro mediante un algoritmo genético. Además, se ha confirmado la fuga de información entre el conjunto de datos de entrenamiento y el de prueba al generar muestras sintéticas antes de la partición de datos, lo que da lugar a una capacidad de generalización sobreestimada del sistema predictor. Estos resultados subrayan la importancia de particionar y después remuestrear para garantizar la independencia de los datos entre las muestras de entrenamiento y de prueba. Se propone combinar el algoritmo genético y el remuestreo en la misma iteración para hacer frente al desequilibrio en el aprendizaje de los datos mediante el enfoque de particio'n-remuestreo, logrando un área bajo la curva ROC del 94% y una precisión media del 84%. Además, el modelo demuestra un F1-score y una sensibilidad de aproximadamente el 80%, superando a los estudios existentes que consideran el enfoque de remuestreo después de particionar. Esto revela el potencial de un sistema de predicción de parto prematuro basado en EHG, permitiendo estrategias orientadas al paciente para mejorar la prevención del parto prematuro, el bienestar materno-fetal y la gestión óptima de los recursos hospitalarios. En general, esta tesis doctoral proporciona a los clínicos herramientas valiosas para la toma de decisiones en escenarios de riesgo materno-fetal de parto prematuro. Permite a los clínicos diseñar estrategias orientadas al paciente para mejorar la prevención y el manejo del parto prematuro. La metodología propuesta es prometedora para el desarrollo de un sistema integrado de predicción del parto prematuro que pueda mejorar la planificación del embarazo, optimizar la asignación de recursos y reducir el riesgo de parto prematuro. / [CA] El part prematur, definit com el naixement abans de les 37 setmanes de gestacio', e's una important preocupacio' mundial amb implicacions per a la salut dels nounats i els costos econo¿mics. Afecta aproximadament a l'11% de tots els naixements, la qual cosa suposa me's de 15 milions d'individus a tot el mo'n. Els me¿todes actuals per a predir el part prematur manquen de precisio', la qual cosa condueix a un sobrediagno¿stic i a una viabilitat limitada en entorns cl¿'nics. La electrohisterografia (EHG) ha sorgit com una alternativa prometedora en proporcionar informacio' rellevant sobre l'electrofisiologia uterina. No obstant aixo¿, els sistemes de prediccio' anteriors basats en EHG no s'han traslladat de manera efectiva a la pra¿ctica cl¿'nica, degut principalment als biaixos en el maneig de dades desequilibrades i a la necessitat de models de prediccio' robustos i generalitzables. Aquesta tesi doctoral prete'n desenvolupar un sistema de prediccio' del part prematur basat en intel·lige¿ncia artificial utilitzant EHG i dades obste¿triques de dones sotmeses a controls prenatals regulars. Aquest sistema implica l'extraccio' de caracter¿'stiques rellevants, l'optimitzacio' del subespai de caracter¿'stiques i l'avaluacio' d'estrate¿gies per a abordar el repte de les dades desequilibrades per a una prediccio' robusta. L'estudi valguda l'efica¿cia de les caracter¿'stiques temporals, espectrals i no lineals per a distingir entre casos de part prematur i a terme. Les noves mesures d'entropia, en concret la dispersio' i l'entropia de bambolla, superen a les me¿triques d'entropia tradicionals en la identificacio' del part prematur. A me's, l'estudi tracta de maximitzar la informacio' complementa¿ria al mateix temps que minimitza la redunda¿ncia i les caracter¿'stiques de soroll per a optimitzar el subespai de caracter¿'stiques per a una prediccio' precisa del part prematur mitjan¿cant un algorisme gene¿tic. A me's, hem confirmat la fugida d'informacio' entre el conjunt de dades d'entrenament i el de prova en generar mostres sinte¿tiques abans de la particio' de dades, la qual cosa dona lloc a una capacitat de generalitzacio' sobreestimada del sistema predictor. Aquests resultats subratllen la importa¿ncia de particionar i despre's remostrejar per a garantir la independe¿ncia de les dades entre les mostres d'entrenament i de prova. Proposem combinar l'algorisme gene¿tic i el remostreig en la mateixa iteracio' per a fer front al desequilibri en l'aprenentatge de les dades mitjan¿cant l'enfocament de particio'-remostrege, aconseguint una a¿rea sota la corba ROC del 94% i una precisio' mitjana del 84%. A me's, el model demostra una puntuacio' F1 i una sensibilitat d'aproximadament el 80%, superant als estudis existents que consideren l'enfocament de remostreig despre's de particionar. Aixo¿ revela el potencial d'un sistema de prediccio' de part prematur basat en EHG, permetent estrate¿gies orientades al pacient per a millorar la prevencio' del part prematur, el benestar matern-fetal i la gestio' o¿ptima dels recursos hospitalaris. En general, aquesta tesi doctoral proporciona als cl¿'nics eines valuoses per a la presa de decisions en escenaris de risc matern-fetal de part prematur. Permet als cl¿'nics dissenyar estrate¿gies orientades al pacient per a millorar la prevencio' i el maneig del part prematur. La metodologia proposada e's prometedora per al desenvolupament d'un sistema integrat de prediccio' del part prematur que puga millorar la planificacio' de l'embara¿s, optimitzar l'assignacio' de recursos i millorar la qualitat de l'atencio'. / [EN] Preterm delivery, defined as birth before 37 weeks of gestation, is a significant global concern with implications for the health of newborns and economic costs. It affects approximately 11% of all births, amounting to more than 15 million individuals worldwide. Current methods for predicting preterm labor lack precision, leading to overdiagnosis and limited practicality in clinical settings. Electrohysterography (EHG) has emerged as a promising alternative by providing relevant information about uterine electrophysiology. However, previous prediction systems based on EHG have not effectively translated into clinical practice, primarily due to biases in handling imbalanced data and the need for robust and generalizable prediction models. This doctoral thesis aims to develop an artificial intelligence based preterm labor prediction system using EHG and obstetric data from women undergoing regular prenatal check-ups. This system entails extracting relevant features, optimizing the feature subspace, and evaluating strategies to address the imbalanced data challenge for robust prediction. The study validates the effectiveness of temporal, spectral, and non-linear features in distinguishing between preterm and term labor cases. Novel entropy measures, namely dispersion and bubble entropy, outperform traditional entropy metrics in identifying preterm labor. Additionally, the study seeks to maximize complementary information while minimizing redundancy and noise features to optimize the feature subspace for accurate preterm delivery prediction by a genetic algorithm. Furthermore, we have confirmed leakage information between train and test data set when generating synthetic samples before data partitioning giving rise to an overestimated generalization capability of the predictor system. These results emphasize the importance of using partitioning-resampling techniques for ensuring data independence between train and test samples. We propose to combine genetic algorithm and resampling method at the same iteration to deal with imbalanced data learning using partition-resampling pipeline, achieving an Area Under the ROC Curve of 94% and Average Precision of 84%. Moreover, the model demonstrates an F1-score and recall of approximately 80%, outperforming existing studies on partition-resampling pipeline. This finding reveals the potential of an EHG-based preterm birth prediction system, enabling patient-oriented strategies for enhanced preterm labor prevention, maternal-fetal well-being, and optimal hospital resource management. Overall, this doctoral thesis provides clinicians with valuable tools for decision-making in preterm labor maternal-fetal risk scenarios. It enables clinicians to design a patient-oriented strategies for enhanced preterm birth prevention and management. The proposed methodology holds promise for the development of an integrated preterm birth prediction system that can enhance pregnancy planning, optimize resource allocation, and ultimately improve the outcomes for both mother and baby. / Nieto Del Amor, F. (2023). Design and assessment of a computer-assisted artificial intelligence system for predicting preterm labor in women attending regular check-ups. Emphasis in imbalance data learning technique [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/200900 / Compendio Predicción de parto prematuro Electrohisterografía Desequilibrio de datos Algoritmo genético Métodos de remuestreo Electromiografía uterina Aprendizaje automático Preterm labor prediction Electrohysterography Imbalance data learning Genetic algorithm Resampling methods Uterine electromyography Machine learning TECNOLOGIA ELECTRONICA
5	Towards meaningful and data-efficient learning : exploring GAN losses, improving few-shot benchmarks, and multimodal video captioning Huang, Gabriel 09 1900 (has links) Ces dernières années, le domaine de l’apprentissage profond a connu des progrès énormes dans des applications allant de la génération d’images, détection d’objets, modélisation du langage à la réponse aux questions visuelles. Les approches classiques telles que l’apprentissage supervisé nécessitent de grandes quantités de données étiquetées et spécifiques à la tâches. Cependant, celles-ci sont parfois coûteuses, peu pratiques, ou trop longues à collecter. La modélisation efficace en données, qui comprend des techniques comme l’apprentissage few-shot (à partir de peu d’exemples) et l’apprentissage self-supervised (auto-supervisé), tentent de remédier au manque de données spécifiques à la tâche en exploitant de grandes quantités de données plus “générales”. Les progrès de l’apprentissage profond, et en particulier de l’apprentissage few-shot, s’appuient sur les benchmarks (suites d’évaluation), les métriques d’évaluation et les jeux de données, car ceux-ci sont utilisés pour tester et départager différentes méthodes sur des tâches précises, et identifier l’état de l’art. Cependant, du fait qu’il s’agit de versions idéalisées de la tâche à résoudre, les benchmarks sont rarement équivalents à la tâche originelle, et peuvent avoir plusieurs limitations qui entravent leur rôle de sélection des directions de recherche les plus prometteuses. De plus, la définition de métriques d’évaluation pertinentes peut être difficile, en particulier dans le cas de sorties structurées et en haute dimension, telles que des images, de l’audio, de la parole ou encore du texte. Cette thèse discute des limites et des perspectives des benchmarks existants, des fonctions de coût (training losses) et des métriques d’évaluation (evaluation metrics), en mettant l’accent sur la modélisation générative - les Réseaux Antagonistes Génératifs (GANs) en particulier - et la modélisation efficace des données, qui comprend l’apprentissage few-shot et self-supervised. La première contribution est une discussion de la tâche de modélisation générative, suivie d’une exploration des propriétés théoriques et empiriques des fonctions de coût des GANs. La deuxième contribution est une discussion sur la limitation des few-shot classification benchmarks, certains ne nécessitant pas de généralisation à de nouvelles sémantiques de classe pour être résolus, et la proposition d’une méthode de base pour les résoudre sans étiquettes en phase de testing. La troisième contribution est une revue sur les méthodes few-shot et self-supervised de détection d’objets , qui souligne les limites et directions de recherche prometteuses. Enfin, la quatrième contribution est une méthode efficace en données pour la description de vidéo qui exploite des jeux de données texte et vidéo non supervisés. / In recent years, the field of deep learning has seen tremendous progress for applications ranging from image generation, object detection, language modeling, to visual question answering. Classic approaches such as supervised learning require large amounts of task-specific and labeled data, which may be too expensive, time-consuming, or impractical to collect. Data-efficient methods, such as few-shot and self-supervised learning, attempt to deal with the limited availability of task-specific data by leveraging large amounts of general data. Progress in deep learning, and in particular, few-shot learning, is largely driven by the relevant benchmarks, evaluation metrics, and datasets. They are used to test and compare different methods on a given task, and determine the state-of-the-art. However, due to being idealized versions of the task to solve, benchmarks are rarely equivalent to the original task, and can have several limitations which hinder their role of identifying the most promising research directions. Moreover, defining meaningful evaluation metrics can be challenging, especially in the case of high-dimensional and structured outputs, such as images, audio, speech, or text. This thesis discusses the limitations and perspectives of existing benchmarks, training losses, and evaluation metrics, with a focus on generative modeling—Generative Adversarial Networks (GANs) in particular—and data-efficient modeling, which includes few-shot and self-supervised learning. The first contribution is a discussion of the generative modeling task, followed by an exploration of theoretical and empirical properties of the GAN loss. The second contribution is a discussion of a limitation of few-shot classification benchmarks, which is that they may not require class semantic generalization to be solved, and the proposal of a baseline method for solving them without test-time labels. The third contribution is a survey of few-shot and self-supervised object detection, which points out the limitations and promising future research for the field. Finally, the fourth contribution is a data-efficient method for video captioning, which leverages unsupervised text and video datasets, and explores several multimodal pretraining strategies. self-supervised learning few-shot classification few-shot object detection low-data learning object detection instance segmentation representation learning residual network visual transformer Faster R-CNN DETR parametric adversarial divergence generative adversarial network variational auto-encoder maximum-likelihood structured prediction optimal discriminator mutual information implicit generative model multimodal pretraining dense video captioning cross-attention YouCook2 HowTo-100M Youtube-8M Recipe-1M Pascal VOC MSCOCO LVIS mutual information neural estimation apprentissage auto-supervisé classification few-shot détection d'objets few-shot apprentissage efficace en données segmentation en instances apprentissage de représentation réseau résiduel transformer visual divergences antagonistes paramétriques auto-encodeur variationnel maximum de vraisemblance prédiction structurée discriminateur optimal information mutuelle modèle génératif implicite pré-apprentissage multi-modal description dense de vidéo attention croisée ResNet ViT GAN VAE MINE

1

Page generated in 0.0627 seconds