11 |
Application of a spatially referenced water quality model to predict E. coli flux in two Texas river basins, Deepti 15 May 2009 (has links)
Water quality models are applied to assess the various processes affecting the
concentrations of contaminants in a watershed. SPAtially Referenced Regression On
Watershed attributes (SPARROW) is a nonlinear regression based approach to predict
the fate and transport of contaminants in river basins. In this research SPARROW was
applied to the Guadalupe and San Antonio River Basins of Texas to assess E. coli
contamination. Since SPARROW relies on the measured records of concentrations of
contaminants collected at monitoring stations for the prediction, the effect of the
locations and selections of the monitoring stations was analyzed. The results of
SPARROW application were studied in detail to evaluate the contribution from the
statistically significant sources. For verification of SPARROW application, results were
compared to 303 (d) list of Clean Water Act, 2000. Further, a methodology to maintain
the monitoring records of the highly contaminated areas in the watersheds was explored
with the application of the genetic algorithm. In this study, the importance of the
available scale and details of explanatory variables (sources, land-water delivery and
reservoir/ stream attenuation factors) in predicting the water quality processes were also
analyzed. The effect of uncertainty in the monitored records on SPARROW application
was discussed. The application of SPARROW and genetic algorithm were explored to
design a monitoring network for the study area. The results of this study show that
SPARROW model can be used successfully to predict the pathogen contamination of
rivers. Also, SPARROW can be applied to design the monitoring network for the basins.
|
12 |
Cross-Validation for Model Selection in Model-Based ClusteringO'Reilly, Rachel 04 September 2012 (has links)
Clustering is a technique used to partition unlabelled data into meaningful groups. This thesis will focus on the area of clustering called model-based clustering, where it is assumed that data arise from a finite number of subpopulations, each of which follows a known statistical distribution. The number of groups and shape of each group is unknown in advance, and thus one of the most challenging aspects of clustering is selecting these features.
Cross-validation is a model selection technique which is often used in regression and classification, because it tends to choose models that predict well, and are not over-fit to the data. However, it has rarely been applied in a clustering framework. Herein, cross-validation is applied to select the number of groups and covariance structure within a family of Gaussian mixture models. Results are presented for both real and simulated data. / Ontario Graduate Scholarship Program
|
13 |
Using phylogenetics and model selection to investigate the evolution of RNA genes in genomic alignmentsAllen, James January 2013 (has links)
The diversity and range of the biological functions of non-coding RNA molecules (ncRNA) have only recently been realised, and phylogenetic analysis of the RNA genes that define these molecules can provide important insights into the evolutionary pressures acting on RNA genes, and can lead to a better understanding of the structure and function of ncRNA. An appropriate dataset is fundamental to any evolutionary analysis, and because existing RNA alignments are unsuitable, I describe a software pipeline to derive RNA gene datasets from genomic alignments. RNA gene prediction software has not previously been evaluated on such sets of known RNA genes, and I find that two popular methods fail to predict the genes in approximately half of the alignments. In addition, high numbers of predictions are made in flanking regions that lack RNA genes, and these results provide motivation for subsequent phylogenetic analyses, because a better understanding of RNA gene evolution should lead to improved methods of prediction. I analyse the RNA gene alignments with a range of evolutionary models of substitution and examine which models best describe the changes evident in the alignment. The best models are expected to provide more accurate trees, and their properties can also shed light on the evolutionary processes that occur in RNA genes. Comparing DNA and RNA substitution models is non-trivial however, because they describe changes between two different types of state, so I present a proof that allows models with different state spaces to be compared in a statistically valid manner. I find that a large proportion of RNA genes are well described by a single RNA model that includes parameters describing both nucleotides and RNA structure, highlighting the multiple levels of constraint that act on the genes. The choice of model affects the inference of a phylogenetic tree, suggesting that model selection, with RNA models, should be standard practice for analysis of RNA genes.
|
14 |
Problems in generalized linear model selection and predictive evaluation for binary outcomesTen Eyck, Patrick 15 December 2015 (has links)
This manuscript consists of three papers which formulate novel generalized linear model methodologies.
In Chapter 1, we introduce a variant of the traditional concordance statistic that is associated with logistic regression. This adjusted c − statistic as we call it utilizes the differences in predicted probabilities as weights for each event/non- event observation pair. We highlight an extensive comparison of the adjusted and traditional c-statistics using simulations and apply these measures in a modeling application.
In Chapter 2, we feature the development and investigation of three model selection criteria based on cross-validatory c-statistics: Model Misspecification Pre- diction Error, Fitting Sample Prediction Error, and Sum of Prediction Errors. We examine the properties of the corresponding selection criteria based on the cross- validatory analogues of the traditional and adjusted c-statistics via simulation and illustrate these criteria in a modeling application.
In Chapter 3, we propose and investigate an alternate approach to pseudo- likelihood model selection in the generalized linear mixed model framework. After outlining the problem with the pseudo-likelihood model selection criteria found using the natural approach to generalized linear mixed modeling, we feature an alternate approach, implemented using a SAS macro, that obtains and applies the pseudo-data from the full model for fitting all candidate models. We justify the propriety of the resulting pseudo-likelihood selection criteria using simulations and implement this new method in a modeling application.
|
15 |
On the Model Selection in a Frailty SettingLundell, Jill F. 01 May 1998 (has links)
When analyzing data in a survival setting, whether of people or objects, one of the assumptions made is that the population is homogeneous. This is not true in reality and certain adjustments can be made in the model to account for heterogeneity. Frailty is one method of dealing with some of this heterogeneity. It is not possible to measure frailty directly and hence it can be very difficult to determine which frailty model is appropriate for the data in interest. This thesis investigates three model selection methods in their effectiveness at determining which frailty distribution best describes a given set of data. The model selection methods used are the Bayes factor, neural networks, and classification trees. Results favored classification trees. Very poor results were observed with neural networks.
|
16 |
Decompressing the Mental Number LineYoung, Christopher John 28 September 2009 (has links)
No description available.
|
17 |
Outlier Detection in Gaussian Mixture ModelsClark, Katharine January 2020 (has links)
Unsupervised classification is a problem often plagued by outliers, yet there is a paucity of work on handling outliers in unsupervised classification. Mixtures of Gaussian distributions are a popular choice in model-based clustering. A single outlier can affect parameters estimation and, as such, must be accounted for. This issue is further complicated by the presence of multiple outliers. Predicting the proportion of outliers correctly is paramount as it minimizes misclassification error. It is proved that, for a finite Gaussian mixture model, the log-likelihoods of the subset models are distributed according to a mixture of beta-type distributions. This relationship is leveraged in two ways. First, an algorithm is proposed that predicts the proportion of outliers by measuring the adherence of a set of subset log-likelihoods to a beta-type mixture reference distribution. This algorithm removes the least likely points, which are deemed outliers, until model assumptions are met. Second, a hypothesis test is developed, which, at a chosen significance level, can test whether a dataset contains a single outlier. / Thesis / Master of Science (MSc)
|
18 |
Detection of Latent Heteroscedasticity and Group-Based Regression Effects in Linear Models via Bayesian Model SelectionMetzger, Thomas Anthony 22 August 2019 (has links)
Standard linear modeling approaches make potentially simplistic assumptions regarding the structure of categorical effects that may obfuscate more complex relationships governing data. For example, recent work focused on the two-way unreplicated layout has shown that hidden groupings among the levels of one categorical predictor frequently interact with the ungrouped factor. We extend the notion of a "latent grouping factor'' to linear models in general. The proposed work allows researchers to determine whether an apparent grouping of the levels of a categorical predictor reveals a plausible hidden structure given the observed data. Specifically, we offer Bayesian model selection-based approaches to reveal latent group-based heteroscedasticity, regression effects, and/or interactions. Failure to account for such structures can produce misleading conclusions. Since the presence of latent group structures is frequently unknown a priori to the researcher, we use fractional Bayes factor methods and mixture g-priors to overcome lack of prior information. We provide an R package, slgf, that implements our methodology in practice, and demonstrate its usage in practice. / Doctor of Philosophy / Statistical models are a powerful tool for describing a broad range of phenomena in our world. However, many common statistical models may make assumptions that are overly simplistic and fail to account for key trends and patterns in data. Specifically, we search for hidden structures formed by partitioning a dataset into two groups. These two groups may have distinct variability, statistical effects, or other hidden effects that are missed by conventional approaches. We illustrate the ability of our method to detect these patterns through a variety of disciplines and data layouts, and provide software for researchers to implement this approach in practice.
|
19 |
Unsupervised Signal Deconvolution for Multiscale Characterization of Tissue HeterogeneityWang, Niya 29 June 2015 (has links)
Characterizing complex tissues requires precise identification of distinctive cell types, cell-specific signatures, and subpopulation proportions. Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in studying individual subpopulations and repopulation dynamics. Tissue heterogeneity cannot be resolved directly by most global molecular and genomic profiling methods. While signal deconvolution has widespread applications in many real-world problems, there are significant limitations associated with existing methods, mainly unrealistic assumptions and heuristics, leading to inaccurate or incorrect results. In this study, we formulate the signal deconvolution task as a blind source separation problem, and develop novel unsupervised deconvolution methods within the Convex Analysis of Mixtures (CAM) framework, for characterizing multi-scale tissue heterogeneity. We also explanatorily test the application of Significant Intercellular Genomic Heterogeneity (SIGH) method.
Unlike existing deconvolution methods, CAM can identify tissue-specific markers directly from mixed signals, a critical task, without relying on any prior knowledge. Fundamental to the success of our approach is a geometric exploitation of tissue-specific markers and signal non-negativity. Using a well-grounded mathematical framework, we have proved new theorems showing that the scatter simplex of mixed signals is a rotated and compressed version of the scatter simplex of pure signals and that the resident markers at the vertices of the scatter simplex are the tissue-specific markers. The algorithm works by geometrically locating the vertices of the scatter simplex of measured signals and their resident markers. The minimum description length (MDL) criterion is applied to determine the number of tissue populations in the sample. Based on CAM principle, we integrated nonnegative independent component analysis (nICA) and convex matrix factorization (CMF) methods, developed CAM-nICA/CMF algorithm, and applied them to multiple gene expression, methylation and protein datasets, achieving very promising results validated by the ground truth or gene enrichment analysis. We integrated CAM with compartment modeling (CM) and developed multi-tissue compartment modeling (MTCM) algorithm, tested on real DCE-MRI data derived from mouse models with consistent and plausible results. We also developed an open-source R-Java software package that implements various CAM based algorithms, including an R package approved by Bioconductor specifically for tumor-stroma deconvolution.
While intercellular heterogeneity is often manifested by multiple clones with distinct sequences, systematic efforts to characterize intercellular genomic heterogeneity must effectively distinguish significant genuine clonal sequences from probabilistic fake derivatives. Based on the preliminary studies originally targeting immune T-cells, we tested and applied the SIGH algorithm to characterize intercellular heterogeneity directly from mixed sequencing reads. SIGH works by exploiting the statistical differences in both the sequencing error rates at different nucleobases and the read counts of fake sequences in relation to genuine clones of variable abundance. / Ph. D.
|
20 |
A model generalization study in localizing indoor cows with cow localization (colo) datasetDas, Mautushi 10 July 2024 (has links)
Precision livestock farming increasingly relies on advanced object localization techniques to monitor livestock health and optimize resource management. In recent years, computer vision-based localization methods have been widely used for animal localization. However, certain challenges still make the task difficult, such as the scarcity of data for model fine-tuning and the inability to generalize models effectively. To address these challenges, we introduces COLO (COw LOcalization), a publicly available dataset comprising localization data for Jersey and Holstein cows under various lighting conditions and camera angles. We evaluate the performance and generalization capabilities of YOLOv8 and YOLOv9 model variants using this dataset.
Our analysis assesses model robustness across different lighting and viewpoint configurations and explores the trade-off between model complexity, defined by the number of learnable parameters, and performance. Our findings indicate that camera viewpoint angle is the most critical factor for model training, surpassing the influence of lighting conditions. Higher model complexity does not necessarily guarantee better results; rather, performance is contingent on specific data and task requirements. For our dataset, medium complexity models generally outperformed both simpler and more complex models.
Additionally, we evaluate the performance of fine-tuned models across various pre-trained weight initialization. The results demonstrate that as the amount of training samples increases, the advantage of using weight initialization diminishes. This suggests that for large datasets, it may not be necessary to invest extra effort in fine-tuning models with custom weight initialization.
In summary, our study provides comprehensive insights for animal and dairy scientists to choose the optimal model for cow localization performance, considering factors such as lighting, camera angles, model parameters, dataset size, and different weight initialization criteria. These findings contribute to the field of precision livestock farming by enhancing the accuracy and efficiency of cow localization technology. The COLO dataset, introduced in this study, serves as a valuable resource for the research community, enabling further advancements in object detection models for precision livestock farming. / Master of Science / Cow localization is important for many reasons. Farmers want to monitor cows to understand their behavior, count cows in a scene, and track their activities such as eating and grazing. Popular technologies like GPS or other tracking devices need to be worn by cows in the form of collars, ear tags etc. This requires manually putting the device on each cow, which is labor-intensive and costly since each cow needs its own device.
In contrast, computer vision-based methods need only one camera to effectively track and monitor cows. We can use deep learning models and a camera to detect cows in a scene. This method is cost-effective and does not require strict maintenance.
However, this approach still has challenges. Deep learning models need a large amount of data to train, and there is a lack of annotated data in our community. Data collection and preparation for model training require human labor and technical skills. Additionally, to make the model robust, it needs to be adjusted effectively, a process called model generalization.
Our work addresses these challenges with two main contributions. First, we introduce a new dataset called COLO (COw LOcalization). This dataset consists of over 1,000 annotated images of Holstein and Jersey cows. Anyone can use this data to train their models. Second, we demonstrate how to generalize models. This model generalization method is not only applicable for cow localization but can also be adapted for other purposes whenever deep learning models are used.
In numbers, we found that the YOLOv8m model is the optimal model for cow localization using our dataset. Additionally, we discovered that camera angle is a crucial factor for model generalization. This means that where we place the camera on the farm is important for getting accurate predictions. We found that top angles (placing the camera above) provide better accuracy.
|
Page generated in 0.3071 seconds