Spelling suggestions: "subject:"sequencesimilarity"" "subject:"kernelsimilarity""
1 |
Contrasting sequence groups by emerging sequencesDeng, Kang 11 1900 (has links)
Group comparison per se is a fundamental task in many scientific endeavours but is also the basis of any classifier. Comparing groups of sequence data is a relevant task. To contrast sequence groups, we define Emerging Sequences (ESs) as subsequences that are frequent in sequences of one group and less frequent in another, and thus distinguishing sequences of different classes.
There are two challenges to distinguish sequence classes by ESs: the extraction of ESs is not trivially efficient and only exact matches of sequences are considered. In our work we address those problems by a suffix tree-based framework and a sliding window matching mechanism. A classification model based on ESs is also proposed.
Evaluating against several other learning algorithms, the experiments on two datasets show that our similar ESs-based classification model outperforms the baseline approaches. With the ESs' high discriminative power, our proposed model achieves satisfactory F-measures on classifying sequences.
|
2 |
Contrasting sequence groups by emerging sequencesDeng, Kang Unknown Date
No description available.
|
3 |
Multi-Regional Analysis of Contact Maps for Protein Structure PredictionAhmed, Hazem Radwan A. 24 April 2009 (has links)
1D protein sequences, 2D contact maps and 3D structures are three different
representational levels of detail for proteins. Predicting protein 3D
structures from their 1D sequences remains one of the complex challenges of
bioinformatics. The "Divide and Conquer" principle is applied in our
research to handle this challenge, by dividing it into two separate yet
dependent subproblems, using a Case-Based Reasoning (CBR) approach. Firstly,
2D contact maps are predicted from their 1D protein sequences; secondly, 3D
protein structures are then predicted from their predicted 2D contact maps.
We focus on the problem of identifying common substructural patterns of
protein contact maps, which could potentially be used as building blocks for
a bottom-up approach for protein structure prediction. We further
demonstrate how to improve identifying these patterns by combining both
protein sequence and structural information. We assess the consistency and
the efficiency of identifying common substructural patterns by conducting
statistical analyses on several subsets of the experimental results with
different sequence and structural information. / Thesis (Master, Computing) -- Queen's University, 2009-04-23 22:01:04.528
|
4 |
Exploring Frameworks for Rapid Visualization of Viral Proteins Common for a Given HostSubramaniam, Rajesh January 2019 (has links)
Viruses are unique organisms that lack the protein machinery necessary for its propagation (like polymerase) yet possess other proteins that facilitate its propagation (like host cell anchoring proteins). This study explores seven different frameworks to assist rapid visualization of proteins that are common to viruses residing in a given host. The proposed frameworks rely only on protein sequence information. It was found that the sequence similarity-based framework with an associated profile hidden Markov model was a better tool to assist visualization of proteins common to a given host than other proposed frameworks based only on amino acid composition or other amino acid properties. The lack of knowledge of profile hidden Markov models for many protein structures limit the utility of the proposed protein sequence similarity-based framework. The study concludes with an attempt to extrapolate the utility of the proposed framework to predict viruses that may pose potential human health risks.
|
5 |
Selection of antigens for antibody-based proteomicsBerglund, Lisa January 2008 (has links)
The human genome is predicted to contain ~20,500 protein-coding genes. The encoded proteins are the key players in the body, but the functions and localizations of most proteins are still unknown. Antibody-based proteomics has great potential for exploration of the protein complement of the human genome, but there are antibodies only to a very limited set of proteins. The Human Proteome Resource (HPR) project was launched in August 2003, with the aim to generate high-quality specific antibodies towards the human proteome, and to use these antibodies for large-scale protein profiling in human tissues and cells. The goal of the work presented in this thesis was to evaluate if antigens can be selected, in a high-throughput manner, to enable generation of specific antibodies towards one protein from every human gene. A computationally intensive analysis of potential epitopes in the human proteome was performed and showed that it should be possible to find unique epitopes for most human proteins. The result from this analysis was implemented in a new web-based visualization tool for antigen selection. Predicted protein features important for antigen selection, such as transmembrane regions and signal peptides, are also displayed in the tool. The antigens used in HPR are named protein epitope signature tags (PrESTs). A genome-wide analysis combining different protein features revealed that it should be possible to select unique, 50 amino acids long PrESTs for ~80% of the human protein-coding genes. The PrESTs are transferred from the computer to the laboratory by design of PrEST-specific PCR primers. A study of the success rate in PCR cloning of the selected fragments demonstrated the importance of controlled GC-content in the primers for specific amplification. The PrEST protein is produced in bacteria and used for immunization and subsequent affinity purification of the resulting sera to generate mono-specific antibodies. The antibodies are tested for specificity and approved antibodies are used for tissue profiling in normal and cancer tissues. A large-scale analysis of the success rates for different PrESTs in the experimental pipeline of the HPR project showed that the total success rate from PrEST selection to an approved antibody is 31%, and that this rate is dependent on PrEST length. A second PrEST on a target protein is somewhat less likely to succeed in the HPR pipeline if the first PrEST is unsuccessful, but the analysis shows that it is valuable to select several PrESTs for each protein, to enable generation of at least two antibodies, which can be used to validate each other. / QC 20100705
|
6 |
MICROBIAL GLYCOSIDE HYDROLASE MEDIATED MODIFICATION OF HOST CELL SURFACE GLYCANSPasupathi, Aarthi January 2023 (has links)
All cells and extracellular matrices of prokaryotes and eukaryotes are made up of glycans, the carbohydrate macromolecules that play a predominant role in cell-to-cell interaction, protection, stabilization, and barrier functions. Glycans are also central to human microbiome-host interactions where bacterial glycans are recognized by innate immune signaling pathways, and host mucins are a major nutrient source for various gut bacteria. Many microorganisms encode glycoside hydrolases (GHs) to utilize the available host cell surface glycans as a nutrient source and to modulate host protein function. The GHs are divided into families having conserved linkage specificity within each family and individual family members can be specific for dramatically divergent macromolecular substrates. In general, within a given GH family very few members have been biochemically characterized and the substrate specificity is poorly understood. GH genes are abundant in the human gut microbiome and culture-enriched metagenomics identified more than 10,000 distinct bacterial GH genes in an individual. The focus of this thesis is endo-β-N-acetylglucosaminidases (ENGases) encoded by GH18 and GH85 families. Bioinformatic analysis shows that the predicted proteins within each of these GH families fell into separate clusters in the Sequence Similarity Networks of each family. The hypothesis of this project is that human microbiome-encoded ENGases from the same GH family differ in their substrate specificities and within the SSN network of the same GH family, enzymes with similar substrate specificity may fall in the same cluster. In this work, I established conditions for overexpression of GH18 and GH85 proteins and investigated the activity of these enzymes on various substrates. / Thesis / Master of Science (MSc) / All the cell surfaces of animals, plants, and microbes are coated with sugars, also known as glycans. These sugars on the cell surface act as a barrier and protect them from the external environment. Glycans on the cells of both microbes and humans are essential for basic interactions between them. Many bacteria produce enzymes such as glycoside hydrolases to obtain nutrients from dietary sugars and alter the sugars on host proteins. There are various families of these enzymes, and they act on specific sugars and cleavage sites. The substrate specificities and characterization of these enzymes from most bacteria found in the human microbiome have not been studied in detail. My work focuses on developing standard enzyme assays for determining specific substrate specificities. This tool can be used to reshape glycans and understand their role in cell processes.
|
7 |
High Performance and Scalable Matching and Assembly of Biological SequencesAbu Doleh, Anas 21 December 2016 (has links)
No description available.
|
8 |
Multiple hypothesis testing and multiple outlier identification methodsYin, Yaling 13 April 2010
Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storeys method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storeys procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochbergs procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.<p>
Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the i-th null and alternative hypotheses are that the i-th observation is not an outlier versus it is, for i=1,...,m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the i-th observation under both null and alternative hypotheses are doubly noncentral t distributions. The outlyingness of the i-th observation is measured by the marginal posterior probability that the i-th observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaiks approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaiks approximation. The comparison of this algorithms output with Patnaiks approximation shows that the latter can save massive computation time without losing much accuracy.<p>
The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.<p>
In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with thefour extreme viruses deleted confirm the claim of Kanduc et al.
|
9 |
Multiple hypothesis testing and multiple outlier identification methodsYin, Yaling 13 April 2010 (has links)
Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storeys method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storeys procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochbergs procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.<p>
Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the i-th null and alternative hypotheses are that the i-th observation is not an outlier versus it is, for i=1,...,m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the i-th observation under both null and alternative hypotheses are doubly noncentral t distributions. The outlyingness of the i-th observation is measured by the marginal posterior probability that the i-th observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaiks approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaiks approximation. The comparison of this algorithms output with Patnaiks approximation shows that the latter can save massive computation time without losing much accuracy.<p>
The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.<p>
In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with thefour extreme viruses deleted confirm the claim of Kanduc et al.
|
10 |
Feature extraction and similarity-based analysis for proteome and genome databasesOzturk, Ozgur 20 September 2007 (has links)
No description available.
|
Page generated in 0.0768 seconds