Global ETD Search

51	GeneSieve: A Probe Selection Strategy for cDNA Microarrays Shukla, Maulik 14 September 2004 (has links) The DNA microarray is a powerful tool to study expression levels of thousands of genes simultaneously. Often, cDNA libraries representing expressed genes of an organism are available, along with expressed sequence tags (ESTs). ESTs are widely used as the probes for microarrays. Designing custom microarrays, rich in genes relevant to the experimental objectives, requires selection of probes based on their sequence. We have designed a probe selection method, called GeneSieve, to select EST probes for custom microarrays. To assign annotations to the ESTs, we cluster them into contigs using PHRAP. The larger contig sequences are then used for similarity search against known proteins in model organism such as Arabidopsis thaliana. We have designed three different methods to assign annotations to the contigs: bidirectional hits (BH), bidirectional best hits (BBH), and unidirectional best hits (UBH). We apply these methods to pine and potato EST sets. Results show that the UBH method assigns unambiguous annotations to a large fraction of contigs in an organism. Hence, we use UBH to assign annotations to ESTs in GeneSieve. To select a single EST from a contig, GeneSieve assigns a quality score to each EST based on its protein homology (PH), cross hybridization (CH), and relative length (RL). We use this quality score to rank ESTs according to seven different measures: length, 3' proximity, 5' proximity, protein homology, cross hybridization, relative length, and overall quality score. Results for pine and potato EST sets indicate that EST probes selected by quality score are relatively long and give better values for protein homology and cross hybridization. Results of the GeneSieve protocol are stored in a database and linked with sequence databases and known functional category schemes such as MIPS and GO. The database is made available via a web interface. A biologist is able to select large number of EST probes based on annotations or functional categories in a quick and easy way. / Master of Science EST annotation cDNA microarrays probe selection
52	Studies of the genome and regulatory processes of Vibrio parahaemolyticus Ingalls, Saylem Marquis 10 January 2011 (has links) Vibrio parahaemolyticus is considered to be an emerging, yet understudied, human pathogen. The V. parahaemolyticus BB22OP genome was sequenced to allow for a comparative analysis between the genome of BB22OP and another previously sequenced, pathogenic strain of V. parahaemolyticus, RIMD2210633. V. parahaemolyticus BB22OP is interesting because it exhibits a spontaneous phenotypic switch in colony morphology due to the loss of a functional OpaR; this also influences virulence. OpaR is the major quorum-sensing regulator in V. parahaemolyticus homologous to LuxR from V. harveyi. When opaR is removed from the RIMD2210633 genome, the same phenotypic switch is not seen indicating a difference between the quorum-sensing systems in these two strains. Understanding the regulatory variation in these two strains has the potential to provide key insights into the control of pathogenesis in this organism. Initially, the BB22OP genome sequencing results aligned into 125 contigs. The genome has now been assembled into two distinct chromosomes with only two gaps remaining to be filled. These gaps are located in the integron region, which is difficult to assemble due to its structure. The integron is a series of gene cassettes separated by inverted repeats that facilitate recombination events that build the integron. The integron region is further evidence of genetic differences between the two strains. The integron in the RIMD2210633 strain is comprised of 69 gene cassettes, while the BB22OP integron contains at least 86 gene cassettes. There are 313 genes novel to the BB22OP genome, which could result in the phenotypic differences seen in these two strains. Additionally five of the 313 genes are predicted to be transcriptional regulators indicating the potential for differential gene regulation. Further comparative analysis will likely reveal more phenotypic divergence between the physiology of RIMD2210633 and BB22OP. Additionally, the CsrA regulatory network was explored in RIMD2210633. CsrA was first characterized in E. coli as a global regulator of carbon storage and metabolism. RIMD2210633 contains a CsrA homolog and was predicted to contain four CsrA-regulating sRNAs (CsrB1-3 and CsrC), and this work confirmed that these sRNAs regulate CsrA in the same manner as in E. coli. CsrA and the same CsrA-regulating sRNAs were found in the BB22OP genome as well. Since CsrA is known to regulate glycogen production, a qualitative iodine-staining plate assay and a quantitative glycogen assay were used to indirectly measure CsrA activity in the presence and absence of individual regulatory sRNAs. The RIMD2210633 CsrA, CsrB1, CsrB2, CsrB3 and CsrC were shown to have the predicted physiological role in recombinant E. coli, with higher glycogen levels observed when CsrA was active and lower levels when each of the sRNAs was overexpressed. CsrA is also known to regulate biofilm production and virulence factors. In an attempt to develop a screening method for potential CsrA targets, a transcriptional/translational fusion system was developed. Transcriptional and translational fusions to Î²-galactosidase were created to PdksA, PglgC1 and PtoxR from RIMD2210633. CsrA or CsrB2 was overexpressed in recombinant E. coli containing each of the fusion constructs in order to see what happens to the gene expression from these promoters at low and high CsrA activity levels. Surprisingly, changing the activity levels of CsrA impacted both transcriptional and translational levels making the results of the assay difficult to interpret. Collectively these efforts have enhanced our understanding of V. parahaemolyticus. In particular, the sequencing of BB22OP has allowed for a comparative analysis between the BB22OP and RIMD2210633 strains. These strains have remarkably conserved genomes despite the phenotypic differences they exhibit. It appears there is variation in the quorum-sensing systems of these two strains. Further analysis will reveal how the quorum-sensing regulons differ and how this impacts the virulence of these two pathogenic V. parahaemolyticus strains. / Master of Science annotation genome sequencing Vibrio parahaemolyticus CsrA OpaR
53	Checking Metadata Usage for Enterprise Applications Zhang, Yaxuan 20 May 2021 (has links) It is becoming more and more common for developers to build enterprise applications on Spring framework or other other Java frameworks. While the developers are enjoying the convenient implementations of web frameworks, developers should pay attention to con- figuration deployment with metadata usage (i.e., Java annotations and XML deployment descriptors). Different formats of metadata can correspond to each other. Metadata usually exist in multiple files. Maintaining such metadata is challenging and time-consuming. Cur- rent compilers and research tools rarely inspect the XML files, not to say the corresponding relationship between Java annotations and XML files. To help developers ensure the quality of metadata, this work presents a Domain Specific Language, RSL, and its engine, MeEditor. RSL facilitates pattern definition for correct metadata usage. MeEditor can take in specified rules and check Java projects for any rule violations. Developer can define rules with RSL considering the metadata usage. Then, developers can run RSL script with MeEditor. 9 rules were extracted from Spring specification and are written in RSL. To evaluate the effectiveness and correctness of MeEditor, we mined 180 plus 500 open-source projects from Github. To evaluate the effectiveness and usefulness of MeEditor, we conducted our evaluation by taking two steps. First, we evaluated the effec- tiveness of MeEditor by constructing a know ground truth data set. Based on experiments of ground truth data set, MeEditor can identified the metadata misuse. MeEditor detected bug with 94% precision, 94% recall, 94% accuracy. Second, we evaluate the usefulness of MeEditor by applying it to real world projects (total 500 projects). For the latest version of these 500 projects, MeEditor gave 79% precision according to our manual inspection. Then, we applied MeEditor to the version histories of rule-adopted projects, which adopt the rule and is identified as correct project for latest version. MeEditor identified 23 bugs, which later fixed by developers. / Master of Science / It is becoming more and more common for developers to build enterprise applications on Spring framework or other other Java frameworks. While the developers are enjoying the convenient implementations of web frameworks, developers should pay attention to con- figuration deployment with metadata usage (i.e., Java annotations and XML deployment descriptors). Different formats of metadata can correspond to each other. Metadata usually exist in multiple files. Maintaining such metadata is challenging and time-consuming. Cur- rent compilers and research tools rarely inspect the XML files, not to say the corresponding relationship between Java annotations and XML files. To help developers ensure the quality of metadata, this work presents a Domain Specific Language, RSL, and its engine, MeEditor. RSL facilitates pattern definition for correct metadata usage. MeEditor can take in specified rules and check Java projects for any rule violations. Developer can define rules with RSL considering the metadata usage. Then, developers can run RSL script with MeEditor. 9 rules were extracted from Spring specification and are written in RSL. To evaluate the effectiveness and correctness of MeEditor, we mined 180 plus 500 open-source projects from Github. To evaluate the effectiveness and usefulness of MeEditor, we conducted our evaluation by taking two steps. First, we evaluated the effectiveness of MeEditor by constructing a know ground truth data set. Based on experiments of ground truth data set, MeEditor can identified the metadata misuse. MeEditor detected bug with 94% precision, 94% recall, 94% accuracy. Second, we evaluate the usefulness of MeEditor by applying it to real world projects (total 500 projects). For the latest version of these 500 projects, MeEditor gave 79% precision according to our manual inspection. Then, we applied MeEditor to the version histories of rule-adopted projects, which adopt the rule and is identified as correct project for latest version. MeEditor identified 23 bugs, which later fixed by developers. Software Domain Specific Language XML Annotation Spring
54	Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques Herbst, Alyssa Kathryn 20 January 2020 (has links) Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the cost of labeling the entire dataset. We refer to the scheme as Bounded Expectation of Label Assignment (BELA). BELA greedily queries an oracle (or human labeler) and partitions a dataset to find data subsets that have mostly the same label. BELA can then infer labels by majority vote of the known labels in each subset. BELA makes the decision to split or label from a subset by maximizing a lower bound on the expected number of correctly labeled examples. BELA improves upon existing hierarchical labeling schemes by using supervised models to partition the data, therefore avoiding reliance on unsupervised clustering methods that may not accurately group data by label. We design BELA with strategies to avoid bias that could be introduced through this adaptive partitioning. We evaluate BELA on labeling of four datasets and find that it outperforms existing strategies for adaptive labeling. / Master of Science / Most machine learning classifiers require data with both features and labels. The features of the data may be the pixel values for an image, the words in a text sample, the audio of a voice clip, and more. The labels of a dataset define the data. They place the data into one of several categories, such as determining whether a image is of a cat or dog, or adding subtitles to Youtube videos. The labeling of a dataset can be expensive, and usually requires a human to annotate. Human labeled data can be moreso expensive if the data requires an expert labeler, as in the labeling of medical images, or when labeling data is particularly time consuming. We introduce a scheme for labeling data that aims to lessen the cost of human labeled data by labeling a subset of an entire dataset and making an educated guess on the labels of the remaining unlabeled data. The labeled data generated from our approach may be then used towards the training of a classifier, or an algorithm that maps the features of data to a guessed label. This is based off of the intuition that data with similar features will also have similar labels. Our approach uses a game-like process of, at any point, choosing between one of two possible actions: we may either label a new data point, thus learning more about the dataset, or we may split apart the dataset into multiple subsets of data. We will eventually guess the labels of the unlabeled data by assigning each unlabeled data point the majority label of the data subset that it belongs to. The novelty in our approach is that we use supervised classifiers, or splitting techniques that use both the features and the labels of data, to split a dataset into new subsets. We use bias reduction techniques that enable us to use supervised splitting. Active Learning Machine learning Dataset Annotation
55	Visualization as a Key Factor for the Usability of Linguistic Annotation Tools Burghardt, Manuel 11 July 2024 (has links) Linguistic annotation is an important means of adding information to corpora of spoken or written language. While some less complex annota- tion tasks can be performed automatically, a great number of annotation tasks require manual annotation, which is typically very time-consuming and tedious. As a consequence, tools for manual annotation tasks should provide a user-friendly interface that makes the annotation process as convenient and efficient as possible; in other words, usability should play an important role in the design of such tools. This article contributes to the field of “visual linguis- tics” by investigating the role of visualization in linguistic annotation tools with regard to good and bad usability practices. While there are several studies that are dedicated to visualizing linguistic results, visualization in the context of linguistic annotation has so far been largely neglected. Accordingly, a heuristic walkthrough evaluation study with 11 annotation tools was conducted to find out about typical usability problems. It showed that many of the usability issues identified during the evaluation are related to aspects of interaction design. However, there are also a large number of usability issues that are directly con- nected to aspects of visualization and visual design. These aspects of good and bad visualization are discussed by means of existing usability heuristics, which can be used to illustrate and explain how and why visualization influences the usability of linguistic annotation tools.
56	Deciphering Demotic Digitally Korte, Jannik, Maderna-Sieben, Claudia, Wespi, Fabian 20 April 2016 (has links) (PDF) In starting the Demotic Palaeographical Database Project, we intend to build up an online database which pays special attention to the actual appearance of Demotic papyri and texts down to the level of the individual sign. Our idea is to analyse a papyrus with respect to its visual nature, inasmuch as it shall be possible to compare each Demotic sign to other representations of the same sign in other texts and to study its occurrences in different words. Words shall not only be analysed in their textual context but also by their orthography and it should be possible to study even the papyrus itself by means of its material features. Therefore, the Demotic Palaeographical Database Project aims for the creation of a modern and online accessible Demotic palaeography, glossary of word spellings and corpus of manuscripts, which will not only be a convenient tool for Egyptologists and researchers interested in the Demotic writing system or artefacts inscribed with Demotic script but also will serve the conservation of cultural heritage. In our paper, we will present our conceptual ideas and the preliminary version of the database in order to demonstrate its functionalities and possibilities. Demotisch Paläographie Annotation Datenbank digitale Edition Demotic palaeography annotation database digital edition ddc:930
57	Numbers, winds and stars Palladino, Chiara 17 March 2017 (has links) (PDF) No description available. Chronologie Geographie Griechisch Latein Treebank Semantische Annotation Chronology Geography Greek Latin Treebank Semantic Annotation ddc:930
58	Analyse bioinformatique du génome et de l’épigénome du pommier / Bioinformatic analysis of the apple genome and epigenome Daccord, Nicolas 27 November 2018 (has links) La pomme est l’un des fruits les plus consommés au monde. En utilisant les dernières technologies de séquençage (PacBio) et de cartes optiques (BioNano), nous avons généré un assemblage de novo de haute qualité du génome du pommier (Malus domestica Borkh.). Nous avons réalisé une annotation des gènes et des éléments transposables pour permettre à cet assemblage d’être utilisé en tant que génome de référence. La grande contiguité de l’assemblage a permis de détecter les éléments transposables de façon exhaustive, ce qui fournit une opportunité sans précédents d’étudier les régions non-caractérisées d’un génome d’arbre. Nous avons également trouvé que le génome du pommier est entièrement dupliqué, comme montré par les relations de synthénie entre les chromosomes. En utilisant du Whole Genome Bisulfite Sequencing (WGBS) ainsi que l’assemblage précédemment généré, nous avons montré des cartes de méthylation de l’ADN pour tout le génome et montré une corrélation générale entre la méthylation de l’ADN près des gènes et l’expression des gènes. De plus, nous avons identifié plusieurs Régions Différentiellement Méthylées (RDMs) entre les méthylomes de fruits et de feuilles du pommier, associées à des gènes candidats qui pourraient être impliqués dans des traits agronomiques importants tel que le développement du fruit. Enfin, nous avons développé un pipeline rapide, simple et complet qui prend entièrement en charge l’analyse des données WGBS, de l’alignement des reads au calcul des RDMs. / Apple is one of the most consumed fruits in the world. Using the latest sequencing (PacBio) and optical mapping (BioNano) technologies, we have generated a high-quality de novo assembly of the apple (Malus domestica Borkh.) genome. We performed a gene annotation as well as a transposable element annotation to allow this assembly to be used as a reference genome. The highcontiguity of the assembly allowed to exhaustively detect the transposable elements, which represented over half the assembly, thus providing an unprecedented opportunity to investigate the uncharacterized regions of a tree genome. We also found that the apple genome is entirely duplicated as showed by the synteny links between chromosomes. Using Whole Genome Bisulfite Sequencing (WGBS) and the previously generated assembly, we produced genome-wide DNA methylation maps and showed a general correlation between DNA methylation next to genes and gene expression. Moreover, we identified several Differentially Methylated Regions (DMRs) between apple fruits and leaf methylomes associated to candidate genes that could be involved in agronomically relevant traits such as apple fruit development. Finally, we developped a complete and easyto- use pipeline which aim is to handle the complete treatment of WGBS data, from the reads mapping to the DMRs computing. It can handle datasets having a low number of biological replicates. Assemblage de génome Annotation de genes Méthylation différentielle Genome assembly Gene annotation Epigenetics Differential methylation 630
59	Méthodes pour l'identification de domaines protéiques divergents / Functional annotation of divergent genomes : application to Leishmania parasite Ghouila, Amel 16 December 2013 (has links) L'étude de la composition des protéines en domaines est une étape clé pour la détermination de ses fonctions. Pfam est l'une des banques de domaines les plus répandues où chaque domaine est représenté par un HMM profil construit à partir d'un alignement multiple de protéines contenant le domaine. La méthode classique de recherche des domaines Pfam consiste à comparer la séquence cible à la librairie complète des HMM profils pour mesurer sa ressemblance aux différents modèles. Cependant, appliquée aux protéines d'organismes divergents, cette méthode manque de sensibilité. L'objectif de cette thèse est d'apporter de nouvelles méthodes pour améliorer le processus de prédictions des domaines plus adaptées à l'étude des protéines divergentes. Les premiers travaux ont consisté en l'adaptation et application de la méthode CODD, récemment proposée, à l'ensemble des pathogènes de la base de données EuPathDB. Une base de données nommée EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) recensant l'ensemble des domaines connus et ceux nouvellement prédits chez ces pathogènes a été mise en place à l'issue de ces travaux. Nous nous sommes ensuite attachés à proposer diverses améliorations. Nous proposons un algorithme ''CODD_exclusive'' qui utilise des informations d'incompatibilité de domaines pour améliorer la précision des prédictions. Nous proposons également une autre stratégie basée sur l'utilisation de règles d'association pour la détermination des co-occurrences de domaines utilisées dans le processus de certification. La dernière partie de cette thèse s'intéresse à l'utilisation des méthodes profil/profil pour annoter un génome entier. Couplée à la procédure d'annotation par co-occurrence, cette approche permet une amélioration notable en termes de nombre de domaines certifiés et également en termes de précision. / The determination of protein domain composition provides strong clues for the protein function prediction. One of the most widelyused domain scheme is the Pfam database in which each family is represented by a multiple sequence alignment and a profileHidden Markov Model (profile HMM). When analyzing a new sequence, each Pfam HMM is used to compute a score measuring the similarity between the sequenceand the domain. However, applied to divergent proteins, this strategy may miss several domains. This is the case for all eukaryotic pathogens, where noPfam domains are detected in half or even more of their proteins.The main objective of this thesis is to develop methods to improve the sensitivity of Pfam domain detection in divergent proteins. We first adapted the recently proposed CODD method to the whole set of pathogens in EupathDB. A public database named EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) gathers known and new domains detected by CODD, along with the associated confidence measurements and the GO annotations.We then proposed other methods to further improve domain detection in these organisms. We proposed ''CODD_exclusive'' algorithm that integrates domain exclusion information to prune false positive domains that are in conflict with other domains of the protein. We also suggested the use of association rules to determine the correlations between domains and used these informations in the certification process.In the last part of this thesis, we focused in the use of profile/profile methods to predict protein domains in a whole genome. Combined with the co-occurrence informations, it achieved high sensitivity and accuracy in predicting domains. Bioinformatique Annotation fonctionnelle Domaines protéiques Leishmania Plasmodium Pathogènes Bioinformatics Functional annotation Protein domains Leishmania Plasmodium Pathogens
60	Modèles et outils logiciels pour l'annotation sémantiquede documentspédagogiques Mille, Dominique 26 October 2005 (has links) (PDF) Cette thèse s'intéresse aux annotations produites par des apprenants sur des documents électroniques. Les annotations sont utiles tant pour mémoriser une démarche de compréhension que pour retrouver facilement des informations. Nous remarquons que l'annotation électronique est peu pratiquée en raison des inconforts de lecture et d'annotation. De plus les annotations sont porteuses d'une sémantique implicite, perdue lors des réutilisations, comme par exemple des liens entre couleur et objectif de l'annotation. Dans ce contexte, notre objectif est de proposer des formalismes et des outils efficaces pour l'annotation électronique de ressources pédagogiques par des apprenants. Cette efficacité signifie que les dispositifs doivent être adaptés aux niveaux logiciel et matériel, qu'ils anticipent les réutilisations pour éviter l'impression systématique des documents, et qu'ils offrent les avantages d'un traitement informatisé. Elle signifie également que les annotations doivent être conservées dans leur intégralité : il est donc nécessaire d'expliciter leur sémantique tant au niveau d'une représentation formelle qu'au niveau d'un annoteur.<br />Plus précisément, notre travail comporte une proposition de représentation formelle de l'annotation, que nous implantons et testons lors d'expérimentations écologiques. En résultat, nous produisons une spécification d'annoteur efficace basé sur les métaphores du papier et de la trousse : le lecteur conserve ses habitudes papier lors de la lecture et de la création et il bénéficie des avantages d'un traitement informatisé pour la valuation, la recherche et le partage. [INFO:INFO_OH] Computer Science/Other EIAH annotation sémantique annotation manuelle lecture active lecture électronique mémoire de formation

Search results