Global ETD Search

11	A Comparison for Longitudinal Data Missing Due to Truncation Liu, Rong 01 January 2006 (has links) Many longitudinal clinical studies suffer from patient dropout. Often the dropout is nonignorable and the missing mechanism needs to be incorporated in the analysis. The methods handling missing data make various assumptions about the missing mechanism, and their utility in practice depends on whether these assumptions apply in a specific application. Ramakrishnan and Wang (2005) proposed a method (MDT) to handle nonignorable missing data, where missing is due to the observations exceeding an unobserved threshold. Assuming that the observations arise from a truncated normal distribution, they suggested an EM algorithm to simplify the estimation.In this dissertation the EM algorithm is implemented for the MDT method when data may include missing at random (MAR) cases. A data set, where the missing data occur due to clinical deterioration and/or improvement is considered for illustration. The missing data are observed at both ends of the truncated normal distribution. A simulation study is conducted to compare the performance of other relevant methods. The factors chosen for the simulation study included, the missing data mechanisms, the forms of response functions, missing at one or two time points, dropout rates, sample sizes and different correlations with AR(1) structure. It was found that the choice of the method for dealing with the missing data is important, especially when a large proportion is missing. The MDT method seems to perform the best when there is reason to believe that the assumption of truncated normal distribution is appropriate.A multiple imputation (MI) procedure under the MDT method to accommodate the uncertainty introduced by imputation is also proposed. The proposed method combines the MDT method with Rubin's (1987) MI method. A procedure to implement the MI method is described. missing data data sets patient dropout algorithm simulation MDT Biostatistics Physical Sciences and Mathematics Statistics and Probability
12	Pixel Oriented Visualization in XmdvTool Patro, Anilkumar G 07 September 2004 (has links) "Many approaches to the visualization of multivariate data have been proposed to date. Pixel oriented techniques map each attribute value of the data to a single colored pixel, theoretically yielding the display of the maximum possible information at a time. A large number of pixel layout methods have been proposed, each of which enables users to perform their visual exploration tasks to varying degrees. Pixel oriented techniques typically maintain the global view of large amounts of data while still preserving the perception of small regions of interest, which makes them particularly interesting for visualizing very large multidimensional data sets. Pixel based methods also provide feedback on the given query by presenting not only the data items fulfilling the query but also the data that approximately fulfill the query. The goal of this thesis was to extend XmdvTool, a public domain multivariate data visualization package, to incorporate pixel based techniques and to explore their strengths and weaknesses. The main challenge here was to seamlessly apply the interaction and distortion techniques used in other visualization methods within XmdvTool to pixel based methods and investigate the capabilities made possible by fusing the various multivariate visualization techniques." visualizing large data sets exploratory multivariate visualization Visualization Data processing Database management
13	An Exploration into Synthetic Data and Generative Aversarial Networks Unknown Date (has links) This Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data. Experimental results demonstrate the effectiveness of GAN-generated data as a pre-training metric. The other experiments discuss important characteristics of GAN models such as the refining of prior information, transferring generative models from large datasets to small data, and automating the design of Deep Neural Networks within the context of the GAN framework. This Thesis will provide readers with a complete introduction to Data Augmentation and Generative Adversarial Networks, as well as insights into the future of these techniques. / Includes bibliography. / Thesis (M.S.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection Neural networks (Computer science) Computer vision Images Generative adversarial networks Data sets
14	On the phylogenetic position of Myzostomida : can 77 genes get it wrong? Bleidorn, Christoph, Podsiadlowski, Lars, Zhong, Min, Eeckhaut, Igor, Hartmann, Stefanie, Halanych, Kenneth M., Tiedemann, Ralph January 2009 (has links) Background: Phylogenomic analyses recently became popular to address questions about deep metazoan phylogeny. Ribosomal proteins (RP) dominate many of these analyses or are, in some cases, the only genes included. Despite initial hopes, hylogenomic analyses including tens to hundreds of genes still fail to robustly place many bilaterian taxa. Results: Using the phylogenetic position of myzostomids as an example, we show that phylogenies derived from RP genes and mitochondrial genes produce incongruent results. Whereas the former support a position within a clade of platyzoan taxa, mitochondrial data recovers an annelid affinity, which is strongly supported by the gene order data and is congruent with morphology. Using hypothesis testing, our RP data significantly rejects the annelids affinity, whereas a platyzoan relationship is significantly rejected by the mitochondrial data. Conclusion: We conclude (i) that reliance of a set of markers belonging to a single class of macromolecular complexes might bias the analysis, and (ii) that concatenation of all available data might introduce conflicting signal into phylogenetic analyses. We therefore strongly recommend testing for data incongruence in phylogenomic analyses. Furthermore, judging all available data, we consider the annelid affinity hypothesis more plausible than a possible platyzoan affinity for myzostomids, and suspect long branch attraction is influencing the RP data. However, this hypothesis needs further confirmation by future analyses. Cirriferum myzostomida Mitochondrial genomes Transfer-rna Data sets Sequence Life sciences
15	Integration and validation of mass spectrometry proteomics data sets Prince, John Theodore, 1976- 25 January 2011 (has links) Mass spectrometry (MS) has been a key player in biological investigation for some time and is the instrument of choice for high throughput proteomics. However, the generation of large, inherently rich, proteomics data sets has far outpaced our ability to utilize them to produce biological knowledge. The ultimate utility of MS proteomics is closely tied to our ability to interpret, integrate and validate this voluminous data. By way of introduction, I discuss the creation of the Open Proteomics Database, which aims to increase publicly available data and to encourage broader contribution from the statistical and bioinformatic communities. Next, I detail research efforts in the integration of mass spectrometry data sets to increase the number of quantifiable peptides. Comparing peptide quantities between experiments (or subsequent chromatographic fractions) in large numbers requires the chromatographic alignment of MS signals, a challenging problem. We use Dynamic Time Warping (DTW) and a bijective (one-to-one) interpolant to create a smooth warp function amenable to multiple alignment. We test a wide variety of alignment scenarios coupled with high confidence, overlapping peptide identifications to optimize and compare alignment parameters. We determine an optimal spectral similarity function, show the importance of penalizing gaps in the alignment path, and demonstrate the utility of our algorithm for multiple alignments. Then, we introduce a method to independently validate large scale proteomics data sets. We use known biases in sample constitution including amino acid content, transmembrane sequence content, and protein abundance to estimate peptide false identification rates (FIRs) in what we term sample bias validation (SBV). We use SBV to compare the false identification rate accuracy (FIRA) and recall capabilities of widely used techniques for error estimation in MS based proteomics. Finally, we describe the open source package mspire (mass spectrometry proteomics in Ruby). Mspire offers unified interfaces for working with a variety of file formats across the analytical pipeline, much needed converters between key formats, and tools for FIR determination. The package eases the burden of working with MS proteomics data, reducing the barrier of entry to developers and offering useful tools to analysts of MS proteomics data. / text Mass spectrometry Proteomics data sets Peptides Alignment Mspire Mass spectrometry proteomics in Ruby
16	Estudo sobre annealing de traços de fissão em apatitas de diferentes composições químicas e em faces sem orientação cristalográfica preferencial / Study on the fission-track annealing in apatites with different chemical compositions and randomly oriented crystallographic faces Moreira, Pedro Augusto Franco Pinheiro 29 February 2008 (has links) Orientadores: Pedro Jose Iunes, Julio Cesar Hadler Neto / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Fisica Gleb Wataghin / Made available in DSpace on 2018-08-10T04:22:14Z (GMT). No. of bitstreams: 1 Moreira_PedroAugustoFrancoPinheiro_D.pdf: 1517543 bytes, checksum: 91ec16c076345cdfaec5e3c787cd13e2 (MD5) Previous issue date: 2008 / Resumo: Nesta tese estudou-se de forma geral sobre o annealing de traços de fissão em apatita, visando principalmente aplicações práticas da Termocronologia por Traços de Fissão. Para isso obteve se um conjunto de dados que possibilitasse que medidas de campo pudessem ser feitas em faces sem orientação cristalográfica preferencial, porque isso permite que seja considerado um número maior de traços fósseis nas "medidas de campo". Neste conjunto o amplo espectro de concentração de cloro encontrado nas apatitas naturais foi refletido, utilizando-se os extremos das concentrações de cloro (0,01 e 5 %), procurando-se incluir apatitas brasileiras. Outra característica marcante do presente conjunto foi à determinação das densidades de traços concomitantemente às medidas de comprimento. Para a escolha dos tratamentos térmicos para a confecção do conjunto, foi desenvolvida uma metodologia baseada em um algoritmo estatístico que foi aplicado a equações cinéticas com dados de annealing já estabelecidos antes dos dados apresentados neste trabalho. Cada amostra deste conjunto de dados de annealing foi submetida a dois ataques químicos diferentes: (1) durante 20 s em 5 M de HNO3 a 20° C e; (2) durante 45 s em 1,5 M de HNO3 a 20° C. Assim, comparou-se os efeitos destes dois ataques químicos em amostras que sofreram diferentes tratamentos térmicos. Os resultados indicaram que os comprimentos de traços encurtados pelo annealing não são influenciados por diferenças na concentração dos ataques padrões. Os tempos ótimos para a realização dos ataques foram estabelecidos através de três curvas de ataques químicos. A partir da interpretação dessas curvas desenvolveu-se um modelo cinético de ataque químico que descreve bem os dados apresentados nesta tese e baseia nos mesmos princípios do modelo cinético de annealing, que foi desenvolvido pelo Grupo de Cronologia da UNICAMP e contou com a colaboração deste autor. Este modelo de annealing foi ajustado aos dados de Carlson et al. (1999) permitindo compará-los com os dados aqui apresentado. No presente trabalho os resultados foram obtidos em faces sem orientação cristalográfica preferencial (onde foram medidos tanto traços-em-traços como traços em fraturas) e os resultados de Carlson et al. (1999) foram obtidos em faces prismáticas (onde foram medidos apenas traços-em-traços). Não houve sistematicidade na posiçao dos pontos obtidos neste trabalho com relação às curvas justadas aos dados de Carlson et al. (1999), porém, apontou uma dispersão relativamente grande deles em relação aos seus respectivos ajustes. Essa dispersão foi atribuída à anisotropia de ataque químico levando-se em conta que em faces cristalográficas sem orientação preferencial existem mais traços que podem ser confundidos com defeitos do que em faces prismáticas. Os resultados de uma forma geral indicam que medidas feitas em faces sem orientação cristalográfica preferencial (com traços-em-traços e traços em fraturas) podem ser consideradas em medidas de campo sem alterar de forma significativa as histórias térmicas, desde que os comprimentos reduzidos sejam maiores que 0,65 / Abstract: In this thesis the fission track annealing in apatite was studied in a general way, considering mainly practical applications of Fission Track Termochronology. A data set that allows the use of randomly oriented grains was done, because a greater number of fossil tracks could be considered in the "field measures". Apatites with a broad chlorine spectrum (0.01 and 5%) in their compositions were used and it was included Brazilian apatites. Density measures were determinated together with length ones in this data set. The heat treatments used in this data set was chosen through a methodology was developed based on a statistical algorithm. This algorithm has been applied to kinetic equations fitted for another annealing data set established before the presented one in this work. The data set two was done with two different chemical etching for each sample: (1) for 20 s at 5 M of HNO3 at 20° C and, (2) for 45 s at 1.5 M of HNO3 at 20° C. Thus, it was possible to compare effects of these two etchings in samples which suffered different heat treatments. The results indicated that annealing length data sets are not influenced by differences in the standard chemical concentration of these etchings. The optimal etching times were established through three etching paths. From the interpretation of these paths, it was developed a chemical etching kinetic model that describes well the data presented here. This model is based on the same principles as the annealing kinetic model which was elaborated by the Group of Chronology that counted with the collaboration of the author of this thesis. This annealing model allowed the comparison between the data presented here and the data set by Carlson et al. (1999) through the fit of this model in the set of Carlson. The results of this thesis were obtained in randomly oriented grains where were measured track-in-track and track-in-cleavage and results of Carlson et al. (1999) were obtained in prismatic faces in which were measured only track-in-track. The comparison between the results showed that there is no tendency in position of the points obtained from this work. However, there is a dispersion of them in relation to their respective fits relatively large. This dispersion has been attributed to etching anisotropy taking into account that tracks in randomly oriented grains can be easier confused with defects than in prismatic faces. The results in general show that measures made in randomly oriented grains (with track- in-track and track-in-cleavage) may be considered field measures without changing the thermal histories in significant way, provided that the reduced lengths are greater than 0.65 / Doutorado / Física Nuclear / Doutor em Ciências Traços de fissão Apatita Annealing Conjunto de dados Fission track Apatite Data sets
17	Solving Arabic Math Word Problems via Deep Learning Alghamdi, Reem A. 14 November 2021 (has links) This thesis studies to automatically solve Arabic Math Word Problems (MWPs) by deep learning models. MWP is a text description of a mathematical problem, which should be solved by deriving a math equation and reach the answer. Due to their strong learning capacity, deep learning based models can learn from the given problem description and generate the correct math equation for solving the problem. Effective models have been developed for solving MWPs in English and Chinese. However, Arabic MWPs are rarely studied. To initiate the study in Arabic MWPs, this thesis contributes the first large-scale dataset for Arabic MWPs, which contain 6,000 samples. Each sample is composed of an Arabic MWP description and the corresponding equation to solve this MWP. Arabic MWP solvers are then built with deep learning models, and verified on this dataset for their effectiveness. In addition, a transfer learning model is built to let the high-resource Chinese MWP solver to promote the performance of the low-resource Arabic MWP solver. This work is the first to use deep learning methods to solve Arabic MWP and the first to use transfer learning to solve MWP across different languages. The solver enhanced by transfer learning has accuracy 74.15%, which is 3% higher than the baseline that does not use transfer learning. In addition, the accuracy is more than 7% higher than the baseline for templates with few samples representing them. Furthermore, The model can generate new sequences that were not seen before during the training with an accuracy of 27% (11% higher than the baseline). math word problems deep learning transfer learning natural language processing data sets low-resource
18	An Algorithm for Generalized Principal Curves with Adaptive Topology in Complex Data Sets Balzuweit, Gerd, Der, Ralf, Herrmann, Michael, Welk, Martin 12 July 2019 (has links) Generalized principal curves are capable of representing complex data structures as they may have branching points or may consist of disconnected parts. For their construction using an unsupervised learning algorithm the templates need to be structurally adaptive. The present algorithm meets this goal by a combination of a competitive Hebbian learning scheme and a self-organizing map algorithm. Whereas the Hebbian scheme captures the main topological features of the data, in the map the neighborhood widths are automatically adjusted in order to suppress the noisy dimensions. It is noteworthy that the procedure which is natural in prestructured Kohonen nets could be carried over to a neural gas algorithm which does not use an initial connectivity. The principal curve is then given by an averaging procedure over the critical uctuations of the map exploiting noise-induced phase transitions in the neural gas. algorithm, complex data sets info:eu-repo/classification/ddc/004 ddc:004
19	On the Structural Link Between Ontologies and Organised Data Sets Marinache, Alicia January 2016 (has links) The proposed work focuses on articulating a mathematical framework to capture the structure of an ontology and relate it to organised data sets. In the discussed framework, the ontology structure captures the mereological relationships between concepts. It also uses other relationships relevant to the considered domain of application. The organised dataset component of the framework is represented using diagonal-free cylindric algebra. The proposed framework, called the domain-information structure, enables us to link concepts to data sets through a number of typed data operators. The new framework enhances concurrent reasoning on data for knowledge generation, which is essential for handling big data. We illustrate the advantage of the obtained framework by using it in generating new knowledge from an ontology and a given data set. / Thesis / Master of Applied Science (MASc)
20	Boosting for Learning From Imbalanced, Multiclass Data Sets Abouelenien, Mohamed 12 1900 (has links) In many real-world applications, it is common to have uneven number of examples among multiple classes. The data imbalance, however, usually complicates the learning process, especially for the minority classes, and results in deteriorated performance. Boosting methods were proposed to handle the imbalance problem. These methods need elongated training time and require diversity among the classifiers of the ensemble to achieve improved performance. Additionally, extending the boosting method to handle multi-class data sets is not straightforward. Examples of applications that suffer from imbalanced multi-class data can be found in face recognition, where tens of classes exist, and in capsule endoscopy, which suffers massive imbalance between the classes. This dissertation introduces RegBoost, a new boosting framework to address the imbalanced, multi-class problems. This method applies a weighted stratified sampling technique and incorporates a regularization term that accommodates multi-class data sets and automatically determines the error bound of each base classifier. The regularization parameter penalizes the classifier when it misclassifies instances that were correctly classified in the previous iteration. The parameter additionally reduces the bias towards majority classes. Experiments are conducted using 12 diverse data sets with moderate to high imbalance ratios. The results demonstrate superior performance of the proposed method compared to several state-of-the-art algorithms for imbalanced, multi-class classification problems. More importantly, the sensitivity improvement of the minority classes using RegBoost is accompanied with the improvement of the overall accuracy for all classes. With unpredictability regularization, a diverse group of classifiers are created and the maximum accuracy improvement reaches above 24%. Using stratified undersampling, RegBoost exhibits the best efficiency. The reduction in computational cost is significant reaching above 50%. As the volume of training data increase, the gain of efficiency with the proposed method becomes more significant. Boosting multi-class classifications stratified sampling regularization parameter imbalaced data sets

Search results