• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 128
  • 25
  • 20
  • 17
  • 4
  • 4
  • 3
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 251
  • 251
  • 77
  • 53
  • 53
  • 52
  • 35
  • 33
  • 31
  • 25
  • 25
  • 24
  • 23
  • 20
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
161

Development and Evaluation of Infilling Methods for Missing Hydrologic and Chemical Watershed Monitoring Data

Johnston, Carey Andrew 30 September 1999 (has links)
Watershed monitoring programs generally do not have perfect data collection success rates due to a variety of field and laboratory factors. A major source of error in many stream-gaging records is lost or missing data caused by malfunctioning stream-side equipment. Studies estimate that between 5 and 20 percent of stream-gaging data may be marked as missing for one reason or another. Reconstructing or infilling missing data methods generate larger sets of data. These larger data sets generally generate better estimates of the sampled parameter and permit practical applications of the data in hydrologic or water quality calculations. This study utilizes data from a watershed monitoring program operating in the Northern Virginia area to: (1) identify and summarize the major reasons for the occurrence of missing data; (2) provide recommendations for reducing the occurrence of missing data; (3) describe methods for infilling missing chemical data; (4) develop and evaluate methods for infilling values to replace missing chemical data; and (5) recommend different infilling methods for various conditions. An evaluation of different infilling methods for chemical data over a variety of factors (e.g., amount of annual rainfall, whether the missing chemical parameter is strongly correlated with flow, amount of missing data) is performed using Monte Carlo modeling. Using the results of the Monte Carlo modeling, a Decision Support System (DSS) is developed for easy application of the most appropriate infilling method. / Master of Science
162

A Comparison of Techniques for Handling Missing Data in Longitudinal Studies

Bogdan, Alexander R 07 November 2016 (has links)
Missing data are a common problem in virtually all epidemiological research, especially when conducting longitudinal studies. In these settings, clinicians may collect biological samples to analyze changes in biomarkers, which often do not conform to parametric distributions and may be censored due to limits of detection. Using complete data from the BioCycle Study (2005-2007), which followed 259 premenopausal women over two menstrual cycles, we compared four techniques for handling missing biomarker data with non-Normal distributions. We imposed increasing degrees of missing data on two non-Normally distributed biomarkers under conditions of missing completely at random, missing at random, and missing not at random. Generalized estimating equations were used to obtain estimates from complete case analysis, multiple imputation using joint modeling, multiple imputation using chained equations, and multiple imputation using chained equations and predictive mean matching on Day 2, Day 13 and Day 14 of a standardized 28-day menstrual cycle. Estimates were compared against those obtained from analysis of the completely observed biomarker data. All techniques performed comparably when applied to a Normally distributed biomarker. Multiple imputation using joint modeling and multiple imputation using chained equations produced similar estimates across all types and degrees of missingness for each biomarker. Multiple imputation using chained equations and predictive mean matching consistently deviated from both the complete data estimates and the other missing data techniques when applied to a biomarker with a bimodal distribution. When addressing missing biomarker data in longitudinal studies, special attention should be given to the underlying distribution of the missing variable. As biomarkers become increasingly Normal, the amount of missing data tolerable while still obtaining accurate estimates may also increase when data are missing at random. Future studies are necessary to assess these techniques under more elaborate missingness mechanisms and to explore interactions between biomarkers for improved imputation models.
163

Modeling Unbalanced Nested Repeated Measures Data In The Presence of Informative Drop-out with Application to Ambulatory Blood Pressure Monitoring Data

Ghulam, Enas M., Ph.D. 01 October 2019 (has links)
No description available.
164

Domain Adaptation Applications to Complex High-dimensional Target Data

Stanojevic, Marija, 0000-0001-8227-6577 January 2023 (has links)
In the last decade, machine learning models have increased in size and amount of data they are using, which has led to improved performance on many tasks. Most notably, there has been a significant development in end-to-end deep learning and reinforcement learning models with new learning algorithms and architectures proposed frequently. Furthermore, while previous methods were focused on supervised learning, in the last five years, many models were proposed that learn in semi-supervised or self-supervised ways. The model is then fine-tuned to a specific task or different data domain. Adapting machine learning models learned on one type of data to similar but different data is called domain adaptation. This thesis discusses various challenges in the domain adaptation of machine learning models to specific tasks and real-world applications and proposes solutions for those challenges. Data in real-world applications have different properties than clean machine-learning datasets commonly used for the experimental evaluation of proposed models. Learning appropriate representations from high-dimensional complex data with internal dependencies is arduous due to the curse of dimensionality and spurious correlation. However, most real-world data have these properties in addition to a small number of labeled samples since labeling is expensive and tedious. Additionally, accuracy drops drastically if models are applied to domain-specific datasets and unbalanced problems. Moreover, state-of-the-art models are not able to handle missing data. In this thesis, I strive to create frameworks that can learn a good representation of high-dimensional small data with correlations between variables. The first chapter of this thesis describes the motivation, background, and research objectives. It also gives an overview of contributions and publications. A background needed to understand this thesis is provided in the second chapter and an introduction to domain adaptation is described in chapter three. The fourth chapter discusses domain adaptation with small target data. It describes the algorithm for semi-supervised learning over domain-specific short texts such as reviews or tweets. The proposed framework achieves up to 12.6% improvement when only 5000 labeled examples are available. The fifth chapter explores the influence of unanticipated bias in fine-tuning data. This chapter outlines how the bias in news data influences the classification performance of domain-specific text, where the domain is U.S. politics. It is shown that fine-tuning with domain-specific data is not always beneficial, especially if bias towards one label is present. The sixth chapter examines domain adaptation on datasets with high missing rates. It reviews a system created to learn from high-dimensional small data from psychological studies, which have up to 70% missingness. The proposed framework is achieving 9.3% smaller imputation and 33% lower prediction errors. The seventh chapter discusses the curse of dimensionality problem in domain adaptation. It presents a methodology for discovering research articles containing evolutionary timetrees. That system can search for, download, and filter research articles in which timetrees are imported. It scans 5 million articles in a few days. The proposed method also decreases the error of finding research papers by 21% compared to the baseline, which cannot work with high-dimensional data properly. The last, eighth chapter, summarizes the findings of this thesis and suggests future prospects. / Computer and Information Science
165

The impact of missing data imputation on HCC survival prediction : Exploring the combination of missing data imputation with data-level methods such as clustering and oversampling

Dalla Torre, Kevin, Abdul Jalil, Walid January 2018 (has links)
The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification. / Forskningen kring data imputation, processen där man ersätter saknade data med substituerade värden, har varit omfattande de senaste åren. Litteraturen om den praktiska inverkan som data imputation metoder har på klassificering är dock otillräcklig. Det här kandidatexamensarbetet utforskar den inverkan som de nyare imputation metoderna har på HCC överlevnads klassificering i kombination med andra data-nivå metoder så som översampling. Mer specifikt, så utforskar denna studie imputations metoder för heterogena dataset och deras inverkan på ett specifikt HCC dataset. Tidigare forskning har visat att de nyare, mer sofistikerade imputations metoderna presterar bättre än de mer enkla metoderna när de utvärderas med normalized root mean square error (NRMSE). I motsats till intuition, så visar resultaten i denna studie att när imputation kombineras med andra data-nivå metoder så som översampling och klustring, så påverkas inte klassificeringen alltid på ett meningsfullt sätt. Detta kan förklaras med att brus introduceras i datasetet när syntetiska punkter genereras i översampling processen. Resultaten visar också att en av de mer sofistikerade imputation metoderna, nämligen MICE, är starkt beroende på tidigare antaganden som görs om de underliggande fördelningarna i datasetet. När dessa antaganden är inkorrekta så presterar imputations metoden dåligt och har en negativ inverkan på klassificering.
166

Effects of Full Information Maximum Likelihood, Expectation Maximization, Multiple Imputation, and Similar Response Pattern Imputation on Structural Equation Modeling with Incomplete and Multivariate Nonnormal Data

Li, Jian 22 October 2010 (has links)
No description available.
167

Context-aware Learning from Partial Observations

Gligorijevic, Jelena January 2018 (has links)
The Big Data revolution brought an increasing availability of data sets of unprecedented scales, enabling researchers in machine learning and data mining communities to escalate in learning from such data and providing data-driven insights, decisions, and predictions. However, on their journey, they are faced with numerous challenges, including dealing with missing observations while learning from such data or making predictions on previously unobserved or rare (“tail”) examples, which are present in a large span of domains including climate, medical, social networks, consumer, or computational advertising domains. In this thesis, we address this important problem and propose tools for handling partially observed or completely unobserved data by exploiting information from its context. Here, we assume that the context is available in the form of a network or sequence structure, or as additional information to point-informative data examples. First, we propose two structured regression methods for dealing with missing values in partially observed temporal attributed graphs, based on the Gaussian Conditional Random Fields (GCRF) model, which draw power from the network/graph structure (context) of the unobserved instances. Marginalized Gaussian Conditional Random Fields (m-GCRF) model is designed for dealing with missing response variable value (labels) in graph nodes, whereas Deep Feature Learning GCRF is able to deal with missing values in explanatory variables while learning feature representation jointly with learning complex interactions of nodes in a graph and together with the overall GCRF objective. Next, we consider unsupervised and supervised shallow and deep neural models for monetizing web search. We focus on two sponsored search tasks here: (i) query-to-ad matching, where we propose novel shallow neural embedding model worLd2vec with improved local query context (location) utilization and (ii) click-through-rate prediction for ads and queries, where Deeply Supervised Semantic Match model is introduced for dealing with unobserved and tail queries click-through-rate prediction problem, while jointly learning the semantic embeddings of a query and an ad, as well as their corresponding click-through-rate. Finally, we propose a deep learning approach for ranking investigators based on their expected enrollment performance on new clinical trials, that learns from both, investigator and trial-related heterogeneous (structured and free-text) data sources, and is applicable to matching investigators to new trials from partial observations, and for recruitment of experienced investigators, as well as new investigators with no previous experience in enrolling patients in clinical trials. Experimental evaluation of the proposed methods on a number of synthetic and diverse real-world data sets shows surpassing performance over their alternatives. / Computer and Information Science
168

Chemometric Approaches for Systems Biology

Folch Fortuny, Abel 23 January 2017 (has links)
The present Ph.D. thesis is devoted to study, develop and apply approaches commonly used in chemometrics to the emerging field of systems biology. Existing procedures and new methods are applied to solve research and industrial questions in different multidisciplinary teams. The methodologies developed in this document will enrich the plethora of procedures employed within omic sciences to understand biological organisms and will improve processes in biotechnological industries integrating biological knowledge at different levels and exploiting the software packages derived from the thesis. This dissertation is structured in four parts. The first block describes the framework in which the contributions presented here are based. The objectives of the two research projects related to this thesis are highlighted and the specific topics addressed in this document via conference presentations and research articles are introduced. A comprehensive description of omic sciences and their relationships within the systems biology paradigm is given in this part, jointly with a review of the most applied multivariate methods in chemometrics, on which the novel approaches proposed here are founded. The second part addresses many problems of data understanding within metabolomics, fluxomics, proteomics and genomics. Different alternatives are proposed in this block to understand flux data in steady state conditions. Some are based on applications of multivariate methods previously applied in other chemometrics areas. Others are novel approaches based on a bilinear decomposition using elemental metabolic pathways, from which a GNU licensed toolbox is made freely available for the scientific community. As well, a framework for metabolic data understanding is proposed for non-steady state data, using the same bilinear decomposition proposed for steady state data, but modelling the dynamics of the experiments using novel two and three-way data analysis procedures. Also, the relationships between different omic levels are assessed in this part integrating different sources of information of plant viruses in data fusion models. Finally, an example of interaction between organisms, oranges and fungi, is studied via multivariate image analysis techniques, with future application in food industries. The third block of this thesis is a thoroughly study of different missing data problems related to chemometrics, systems biology and industrial bioprocesses. In the theoretical chapters of this part, new algorithms to obtain multivariate exploratory and regression models in the presence of missing data are proposed, which serve also as preprocessing steps of any other methodology used by practitioners. Regarding applications, this block explores the reconstruction of networks in omic sciences when missing and faulty measurements appear in databases, and how calibration models between near infrared instruments can be transferred, avoiding costs and time-consuming full recalibrations in bioindustries and research laboratories. Finally, another software package, including a graphical user interface, is made freely available for missing data imputation purposes. The last part discusses the relevance of this dissertation for research and biotechnology, including proposals deserving future research. / Esta tesis doctoral se centra en el estudio, desarrollo y aplicación de técnicas quimiométricas en el emergente campo de la biología de sistemas. Procedimientos comúnmente utilizados y métodos nuevos se aplican para resolver preguntas de investigación en distintos equipos multidisciplinares, tanto del ámbito académico como del industrial. Las metodologías desarrolladas en este documento enriquecen la plétora de técnicas utilizadas en las ciencias ómicas para entender el funcionamiento de organismos biológicos y mejoran los procesos en la industria biotecnológica, integrando conocimiento biológico a diferentes niveles y explotando los paquetes de software derivados de esta tesis. Esta disertación se estructura en cuatro partes. El primer bloque describe el marco en el cual se articulan las contribuciones aquí presentadas. En él se esbozan los objetivos de los dos proyectos de investigación relacionados con esta tesis. Asimismo, se introducen los temas específicos desarrollados en este documento mediante presentaciones en conferencias y artículos de investigación. En esta parte figura una descripción exhaustiva de las ciencias ómicas y sus interrelaciones en el paradigma de la biología de sistemas, junto con una revisión de los métodos multivariantes más aplicados en quimiometría, que suponen las pilares sobre los que se asientan los nuevos procedimientos aquí propuestos. La segunda parte se centra en resolver problemas dentro de metabolómica, fluxómica, proteómica y genómica a partir del análisis de datos. Para ello se proponen varias alternativas para comprender a grandes rasgos los datos de flujos metabólicos en estado estacionario. Algunas de ellas están basadas en la aplicación de métodos multivariantes propuestos con anterioridad, mientras que otras son técnicas nuevas basadas en descomposiciones bilineales utilizando rutas metabólicas elementales. A partir de éstas se ha desarrollado software de libre acceso para la comunidad científica. A su vez, en esta tesis se propone un marco para analizar datos metabólicos en estado no estacionario. Para ello se adapta el enfoque tradicional para sistemas en estado estacionario, modelando las dinámicas de los experimentos empleando análisis de datos de dos y tres vías. En esta parte de la tesis también se establecen relaciones entre los distintos niveles ómicos, integrando diferentes fuentes de información en modelos de fusión de datos. Finalmente, se estudia la interacción entre organismos, como naranjas y hongos, mediante el análisis multivariante de imágenes, con futuras aplicaciones a la industria alimentaria. El tercer bloque de esta tesis representa un estudio a fondo de diferentes problemas relacionados con datos faltantes en quimiometría, biología de sistemas y en la industria de bioprocesos. En los capítulos más teóricos de esta parte, se proponen nuevos algoritmos para ajustar modelos multivariantes, tanto exploratorios como de regresión, en presencia de datos faltantes. Estos algoritmos sirven además como estrategias de preprocesado de los datos antes del uso de cualquier otro método. Respecto a las aplicaciones, en este bloque se explora la reconstrucción de redes en ciencias ómicas cuando aparecen valores faltantes o atípicos en las bases de datos. Una segunda aplicación de esta parte es la transferencia de modelos de calibración entre instrumentos de infrarrojo cercano, evitando así costosas re-calibraciones en bioindustrias y laboratorios de investigación. Finalmente, se propone un paquete software que incluye una interfaz amigable, disponible de forma gratuita para imputación de datos faltantes. En la última parte, se discuten los aspectos más relevantes de esta tesis para la investigación y la biotecnología, incluyendo líneas futuras de trabajo. / Aquesta tesi doctoral es centra en l'estudi, desenvolupament, i aplicació de tècniques quimiomètriques en l'emergent camp de la biologia de sistemes. Procediments comúnment utilizats i mètodes nous s'apliquen per a resoldre preguntes d'investigació en diferents equips multidisciplinars, tant en l'àmbit acadèmic com en l'industrial. Les metodologies desenvolupades en aquest document enriquixen la plétora de tècniques utilitzades en les ciències òmiques per a entendre el funcionament d'organismes biològics i milloren els processos en la indústria biotecnològica, integrant coneixement biològic a distints nivells i explotant els paquets de software derivats d'aquesta tesi. Aquesta dissertació s'estructura en quatre parts. El primer bloc descriu el marc en el qual s'articulen les contribucions ací presentades. En ell s'esbossen els objectius dels dos projectes d'investigació relacionats amb aquesta tesi. Així mateix, s'introduixen els temes específics desenvolupats en aquest document mitjançant presentacions en conferències i articles d'investigació. En aquesta part figura una descripació exhaustiva de les ciències òmiques i les seues interrelacions en el paradigma de la biologia de sistemes, junt amb una revisió dels mètodes multivariants més aplicats en quimiometria, que supossen els pilars sobre els quals s'assenten els nous procediments ací proposats. La segona part es centra en resoldre problemes dins de la metabolòmica, fluxòmica, proteòmica i genòmica a partir de l'anàlisi de dades. Per a això es proposen diverses alternatives per a compendre a grans trets les dades de fluxos metabòlics en estat estacionari. Algunes d'elles estàn basades en l'aplicació de mètodes multivariants propostos amb anterioritat, mentre que altres són tècniques noves basades en descomposicions bilineals utilizant rutes metabòliques elementals. A partir d'aquestes s'ha desenvolupat software de lliure accés per a la comunitat científica. Al seu torn, en aquesta tesi es proposa un marc per a analitzar dades metabòliques en estat no estacionari. Per a això s'adapta l'enfocament tradicional per a sistemes en estat estacionari, modelant les dinàmiques dels experiments utilizant anàlisi de dades de dues i tres vies. En aquesta part de la tesi també s'establixen relacions entre els distints nivells òmics, integrant diferents fonts d'informació en models de fusió de dades. Finalment, s'estudia la interacció entre organismes, com taronges i fongs, mitjançant l'anàlisi multivariant d'imatges, amb futures aplicacions a la indústria alimentària. El tercer bloc d'aquesta tesi representa un estudi a fons de diferents problemes relacionats amb dades faltants en quimiometria, biologia de sistemes i en la indústria de bioprocessos. En els capítols més teòrics d'aquesta part, es proposen nous algoritmes per a ajustar models multivariants, tant exploratoris com de regressió, en presencia de dades faltants. Aquests algoritmes servixen ademés com a estratègies de preprocessat de dades abans de l'ús de qualsevol altre mètode. Respecte a les aplicacions, en aquest bloc s'explora la reconstrucció de xarxes en ciències òmiques quan apareixen valors faltants o atípics en les bases de dades. Una segona aplicació d'aquesta part es la transferència de models de calibració entre instruments d'infrarroig proper, evitant així costoses re-calibracions en bioindústries i laboratoris d'investigació. Finalment, es proposa un paquet software que inclou una interfície amigable, disponible de forma gratuïta per a imputació de dades faltants. En l'última part, es discutixen els aspectes més rellevants d'aquesta tesi per a la investigació i la biotecnologia, incloent línies futures de treball. / Folch Fortuny, A. (2016). Chemometric Approaches for Systems Biology [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/77148 / Premios Extraordinarios de tesis doctorales
169

應用資料採礦技術於資料庫加值中的插補方法比較 / Imputation of value-added database in data mining

黃雅芳 Unknown Date (has links)
資料在企業資訊來源中扮演了極為重要的角色,特別是在現今知識與技術的世代裡。如果對於一個有意義且具有代表性資料庫中的遺漏值能夠正確的處理,那麼對於企業資訊而言,是一個大有可為的突破。 然而,有時我們或許會遇到一些不是那麼完善的資料庫,當資料庫中的資料有遺漏值時,從這樣資料庫中所獲得的結果,或許會是一些有偏差或容易令人誤解的結果。因此,本研究的目的在於插補遺漏值為資料庫加值,進而根據遺漏值類型建立插補模型。 如果遺漏值為連續型,用迴歸模型和倒傳遞類神經模型來進行插補;如果遺漏值為類別型,採用邏輯斯迴歸、倒傳遞類神經和決策樹進行插補分析。經由模擬的結果顯示,對於連續型的遺漏值,迴歸模型提供了最佳的插補估計;而類別型的遺漏值,C5.0決策樹是最佳的選擇。此外,對於資料庫中的稀少資料,當連續型的遺漏值,倒傳遞類神經模型提供了最佳的插補估計;而類別型的遺漏值,亦是C5.0決策樹是最佳的選擇。 / Data plays a vital role as source of information to the organization especially in the era of information and technology. A meaningful, qualitative and representative database if properly handled could mean a promising breakthrough to the organizations. However, from time to time, we may encounter a not so perfect database, that is we have the situation where the data in the database is missing. With the incomplete database, the results obtained from such database may provide biased or misleading solutions. Therefore, the purpose of this research is to place its emphasis on imputing missing data of the value-added database then builds the model in accordance to the type of data. If the missing data type is continuous, regression model and BPNN neural network is applied. If the missing data type is categorical, logistic regression, BPNN neural network and decision tree is chosen for the application. Our result has shown that for the continuous missing data, the regression model proved to deliver the best estimate. For the categorical missing data, C5.0 decision tree model is the chosen one. Besides, as regards the rare data missing in the database, our result has shown that for the continuous missing data, the BPNN neural network proved to deliver the best estimate. For the categorical missing data, C5.0 decision tree model is the chosen one.
170

Bayesian Cluster Analysis : Some Extensions to Non-standard Situations

Franzén, Jessica January 2008 (has links)
The Bayesian approach to cluster analysis is presented. We assume that all data stem from a finite mixture model, where each component corresponds to one cluster and is given by a multivariate normal distribution with unknown mean and variance. The method produces posterior distributions of all cluster parameters and proportions as well as associated cluster probabilities for all objects. We extend this method in several directions to some common but non-standard situations. The first extension covers the case with a few deviant observations not belonging to one of the normal clusters. An extra component/cluster is created for them, which has a larger variance or a different distribution, e.g. is uniform over the whole range. The second extension is clustering of longitudinal data. All units are clustered at all time points separately and the movements between time points are modeled by Markov transition matrices. This means that the clustering at one time point will be affected by what happens at the neighbouring time points. The third extension handles datasets with missing data, e.g. item non-response. We impute the missing values iteratively in an extra step of the Gibbs sampler estimation algorithm. The Bayesian inference of mixture models has many advantages over the classical approach. However, it is not without computational difficulties. A software package, written in Matlab for Bayesian inference of mixture models is introduced. The programs of the package handle the basic cases of clustering data that are assumed to arise from mixture models of multivariate normal distributions, as well as the non-standard situations.

Page generated in 0.0685 seconds