Spelling suggestions: "subject:"biolological databases"" "subject:"bybiological databases""
1 |
Διερεύνηση της βάσης βιολογικών δεδομένων COGENT για την πρόσθεση πληροφοριών βιβλιογραφικής ύλης και πληροφοριών νουκλεοτιδικής αλληλουχίας (DNA)Χριστοπούλου, Δέσποινα 09 October 2009 (has links)
Σήμερα υπάρχει ελεύθερη πρόσβαση μέσω του internet σε εκατοντάδες δημόσιες βάσεις βιολογικών δεδομένων. Παραταύτα, η προσπάθεια του να εκμεταλλευτεί κάποιος τα αποθηκευμένα δεδομένα ανομοιογενών βάσεων δεδομένων, καταλήγει να αποτελεί μια διαδικασία ιδιαίτερα δύσκολη και χρονοβόρα λόγω ποικίλων αιτιάσεων. Στις αιτίες αυτές συμπεριλαμβάνονται ο χαοτικός όγκος των βιολογικών δεδομένων, ο ολοένα αυξανόμενος αριθμός βιολογικών βάσεων δεδομένων, η υπεραφθονία τύπων και μορφών δεδομένων (format), η ποικιλομορφία βιοπληροφορικών τεχνικών πρόσβασης στα δεδομένα και βέβαια η διαφορετικότητα των βάσεων βιολογικών δεδομένων.
Χάρη στις διεθνείς προσπάθειες ολοκλήρωσης αλληλουχιών (sequencing), οι ομάδες γονιδιακών δεδομένων έχουν αυξηθεί γεωμετρικά την τελευταία δεκαετία. Το έτος 2003 για παράδειγμα, η βάση βιολογικών δεδομένων Genbank διπλασιάστηκε σε μέγεθος μέσα σε 15 μήνες. Με τόσο γρήγορη ανάπτυξη, τα γενωμικά δεδομένα και οι συνδεόμενες με αυτά δομές έχουν αποκτήσει τεράστιο μέγεθος για να χωρέσουν στην κεντρική μνήμη ενός υπολογιστή. Το σημαντικότερο πρόβλημα που ανακύπτει έγκειται στο ότι μεγάλο μέρος της πληροφορίας που αναζητείται μέσα στο τεράστιο και ολοένα αυξανόμενο σε μέγεθος ορυχείο των δεδομένων εν τέλει χάνεται.
Η ανάγκη κατασκευής των κατάλληλων εργαλείων εξ’ όρυξης της ζητούμενης πληροφορίας από το ορυχείο αυτό είναι μονόδρομος.
Η παρούσα διπλωματική εργασία επικεντρώνεται στην διεύρυνση μιας υπάρχουσας βάσης βιολογικών δεδομένων ολοκληρωμένων γονιδιωμάτων, της COGENT. Η COGENT αναπτύχθηκε το 2003 από την Ομάδα Υπολογιστικής Γενωμικής (Computational Genomics Group – CGG), στο Ευρωπαϊκό Ινστιτούτο Βιοπληροφορικής (European Bioinformatics Institute – EBI), και τελικός τεχνικός στόχος της διπλωματικής εργασίας αποτελεί η προσθήκη βιβλιογραφικών δεδομένων καθώς και νουκλεοτιδικών πληροφοριών αλληλουχίας (DNA) στην βάση COGENT. / Today, hundreds of public biological databases are accessible via the Internet
However taking advantage of data stored in heterogeneous biological databases can be
a difficult, time consuming task for a multitude of reasons. These reasons include the
vast volume of biological data, the growing number of biological databases, the rapid
rate in the growth of data, the overabundance of data types and formats, the wide
Variety of bioinformatics data access techniques, and database heterogeneity.
Thanks to international sequencing efforts, genome data sets have been
growing exponentially in the past few years. The GenBank database, for example, has
doubled every 15 months. With such a rapid growth, genome data and the associated
access structures have become too large to fit in the main memory of a computer,
leading to a large number of disk accesses (and therefore, slow response times) for
homology searches and other queries. Much of the important information in this
enormous and exponentially growing gold mine will be wasted if we do not develop
proper tools to access and mine them efficiently.
The focus of this thesis was to extend an existing biological database for the
complete tracking of genomes, the COGENT database, which the Computational
Genomics Group at the European Bioinformatics Institute in Cambridge produced in
2003, so that it can incorporate literature and DNA sequence information.
|
2 |
Genômica translacional: integrando dados clínicos e biomoleculares / Translational genomics: integrating clinical and biomolecular dataMiyoshi, Newton Shydeo Brandão 06 February 2013 (has links)
A utilização do conhecimento científico para promoção da saúde humana é o principal objetivo da ciência translacional. Para que isto seja possível, faz-se necessário o desenvolvimento de métodos computacionais capazes de lidar com o grande volume e com a heterogeneidade da informação gerada no caminho entre a bancada e a prática clínica. Uma barreira computacional a ser vencida é o gerenciamento e a integração dos dados clínicos, sócio-demográficos e biológicos. Neste esforço, as ontologias desempenham um papel essencial, por serem um poderoso artefato para representação do conhecimento. Ferramentas para gerenciamento e armazenamento de dados clínicos na área da ciência translacional que têm sido desenvolvidas, via de regra falham por não permitir a representação de dados biológicos ou por não oferecer uma integração com as ferramentas de bioinformática. Na área da genômica existem diversos modelos de bancos de dados biológicos (tais como AceDB e Ensembl), os quais servem de base para a construção de ferramentas computacionais para análise genômica de uma forma independente do organismo de estudo. Chado é um modelo de banco de dados biológicos orientado a ontologias, que tem ganhado popularidade devido a sua robustez e flexibilidade, enquanto plataforma genérica para dados biomoleculares. Porém, tanto Chado quanto os outros modelos de banco de dados biológicos não estão preparados para representar a informação clínica de pacientes. Este projeto de mestrado propõe a implementação e validação prática de um framework para integração de dados, com o objetivo de auxiliar a pesquisa translacional integrando dados biomoleculares provenientes das diferentes tecnologias omics com dados clínicos e sócio-demográficos de pacientes. A instanciação deste framework resultou em uma ferramenta denominada IPTrans (Integrative Platform for Translational Research), que tem o Chado como modelo de dados genômicos e uma ontologia como referência. Chado foi estendido para permitir a representação da informação clínica por meio de um novo Módulo Clínico, que utiliza a estrutura de dados entidade-atributo-valor. Foi desenvolvido um pipeline para migração de dados de fontes heterogêneas de informação para o banco de dados integrado. O framework foi validado com dados clínicos provenientes de um Hospital Escola e de um banco de dados biomoleculares para pesquisa de pacientes com câncer de cabeça e pescoço, assim como informações de experimentos de microarray realizados para estes pacientes. Os principais requisitos almejados para o framework foram flexibilidade, robustez e generalidade. A validação realizada mostrou que o sistema proposto satisfaz as premissas, levando à integração necessária para a realização de análises e comparações dos dados. / The use of scientific knowledge to promote human health is the main goal of translational science. To make this possible, it is necessary to develop computational methods capable of dealing with the large volume and heterogeneity of information generated on the road between bench and clinical practice. A computational barrier to be overcome is the management and integration of clinical, biological and socio-demographics data. In this effort, ontologies play a crucial role, being a powerful artifact for knowledge representation. Tools for managing and storing clinical data in the area of translational science that have been developed, usually fail due to the lack on representing biological data or not offering integration with bioinformatics tools. In the field of genomics there are many different biological databases (such as AceDB and Ensembl), which are the basis for the construction of computational tools for genomic analysis in an organism independent way. Chado is a ontology-oriented biological database model which has gained popularity due to its robustness and flexibility, as a generic platform for biomolecular data. However, both Chado as other models of biological databases are not prepared to represent the clinical information of patients. This project consists in the proposal, implementation and validation of a practical framework for data integration, aiming to help translational research integrating data coming from different omics technologies with clinical and socio-demographic characteristics of patients. The instantiation of the designed framework resulted in a computational tool called IPTrans (Integrative Platform for Translational Research), which has Chado as template for genomic data and uses an ontology reference. Chado was extended to allow the representation of clinical information through a new Clinical Module, which uses the data structure entity-attribute-value. We developed a pipeline for migrating data from heterogeneous sources of information for the integrated database. The framework was validated with clinical data from a School Hospital and a database for biomolecular research of patients with head and neck cancer. The main requirements were targeted for the framework flexibility, robustness and generality. The validation showed that the proposed system satisfies the assumptions leading to integration required for the analysis and comparisons of data.
|
3 |
Ανάπτυξη ολοκληρωμένου συστήματος εξόρυξης και οπτικοποίησης γνώσης από βιολογικά δεδομέναΓκαντούνα, Βασιλική 25 January 2012 (has links)
Στα τέλη του 20ου αιώνα, οι παράλληλες εξελίξεις και η ανάπτυξη καινοτόμων μεθόδων και εργαλείων σε διαφορετικές ερευνητικές περιοχές είχε ως αποτέλεσμα την εμφάνιση των λεγόμενων "αναδυόμενων τεχνολογιών" (emerging technologies). Σε αυτό το πλαίσιο λοιπόν, των αναδυόμενων τεχνολογιών, εμφανίστηκε στο προσκήνιο η επιστήμη της Βιοπληροφορικής (Bioinformatics) η οποία αποτελεί την τομή των επιστημών της βιολογίας και της πληροφορικής. Η ραγδαία ανάπτυξη της τεχνολογίας έχει οδηγήσει στην εκρηκτική αύξηση του ρυθμού παραγωγής βιολογικών δεδομένων, γεγονός που καθιστά επιτακτική την ανάγκη της αποδοτικής και αποτελεσματικής διαχείρισης τους. Για την κάλυψη αυτής ακριβώς της ανάγκης δημιουργήθηκαν οι βιολογικές βάσεις δεδομένων που έχουν σήμερα εξαιρετική δυναμική και περιθώρια εφαρμογών.
Οι βασικοί τομείς έρευνας στο πλαίσιο των βιολογικών βάσεων δεδομένων μπορούν να ταξινομηθούν σε τρεις μεγάλες κατηγορίες. Η πρώτη κατηγορία αφορά στην όσο το δυνατόν πιο αποδοτική οργάνωση των βιολογικών δεδομένων ώστε να είναι δυνατή η αποτελεσματική αποθήκευση τους. Αυτός ακριβώς είναι και ο λόγος δημιουργίας των βιολογικών βάσεων δεδομένων. Η δεύτερη κατηγορία αφορά στην ανάπτυξη εργαλείων και μεθόδων που επιτρέπουν την ανάλυση και την επεξεργασία των βιολογικών δεδομένων έτσι ώστε να διευκολυνθεί η διαδικασία ανακάλυψης γνώσης από αυτά. Σε αυτή την κατηγορία, σημαντικό ρόλο παίζουν οι τεχνικές εξόρυξης γνώσης οι οποίες εφαρμόζονται πάνω σε μεγάλες συλλογές βιολογικών δεδομένων και συνήθως οδηγούν στην ανακάλυψη νέων σχέσεων και προτύπων που κρύβονται ανάμεσα στα δεδομένα. Τέλος, η τρίτη κατηγορία αφορά στην ανάπτυξη εργαλείων που διευκολύνουν την διαδικασία της βιολογικής ερμηνείας των αποτελεσμάτων της εξόρυξης. Εδώ, ουσιαστικό ρόλο κατέχουν οι τεχνικές οπτικοποίησης της παραγόμενης γνώσης για την όσο το δυνατόν πιο κατανοητή παρουσίαση των συμπερασμάτων στον άνθρωπο ο οποίος στην συνέχεια θα επιλέξει ποια από αυτά είναι πραγματικά χρήσιμα.
Η δημιουργία ενός ολοκληρωμένου συστήματος που θα αποτελεί τον απότοκο της τεχνολογικής σύζευξης των τεχνικών των τριών παραπάνω κατηγοριών σε συνδυασμό με την ανάγκη αξιοποίησης μιας μέχρι πρότινος ανεκμετάλλευτης μεγάλης συλλογής βιολογικών δεδομένων αποτέλεσαν το κίνητρο για την εκπόνηση της παρούσας διπλωματικής εργασίας.
Στόχος της εργασίας είναι η ανάπτυξη ενός ολοκληρωμένου συστήματος το οποίο χρησιμοποιώντας την τεχνολογία Microsoft PivotViewer θα απεικονίζει την παραπάνω συλλογή δεδομένων προσφέροντας ένα υψηλό επίπεδο αναπαράστασης και θα καταγράφει τις συχνότητες εμφάνισης των μεταλλάξεων και άλλων γενετικών παραλλαγών ανά πληθυσμιακές ομάδες σε παγκόσμια κλίμακα. Το σύστημα αυτό θα μπορεί να λειτουργήσει ως ένα σύγχρονο εκπαιδευτικό και διαγνωστικό εργαλείο για την πληθυσμιακή μελέτη της παθογένειας και της θεραπείας ασθενειών που οφείλονται σε κάποια γενετική διαταραχή.
Ο χρήστης διαμέσου ενός εύχρηστου και φιλικού περιβάλλοντος διεπαφής θα μπορεί να εστιάσει από μια μεγάλη συλλογή δεδομένων σε ένα εξειδικευμένο υποσύνολό της που ενδεχομένως σχετίζεται με μία συγκεκριμένη ασθένεια, μία συγκεκριμένη μελέτη ή έναν συγκεκριμένο πληθυσμό παρατηρώντας έτσι τα δεδομένα αυτά από μια διαφορετική οπτική γωνία που ενδεχομένως να τον βοηθήσει να ανακαλύψει νέα πρότυπα και σχέσεις ανάμεσα τους αξιόλογης βιολογικής σημασίας. / In the late 20th century, parallel advances and the development of innovative methods and tools in different research areas resulted in the appearance of the so-called "emerging technologies". In the framework of emerging technologies, the science of Bioinformatics came to the fore which is the intersection of the sciences of biology and informatics. The rapid growth of technology has led to the explosive increase in the rate of production of biological data, which dictates the need for efficient and effective data management. Biological databases have been created to satisfy exactly this need and they have extremely dynamic and potential applications today.
The main research areas in biological databases can be classified into three broad categories. The first category concerns the better organization of the biological data so as to enable efficient storage. This is the reason for the development of the biological databases. The second category concerns the development of tools and methods that allow analysis and processing of biological data to facilitate the process of discovering knowledge from them. In this category, data mining techniques play an important role. They are applied over large collections of biological data and often lead to the discovery of new relationships and patterns that lie between the data. Finally, the third category involves the development of tools that facilitate the process of understanding and visualizing the biological meaning of the data mining results. Here, the visualization techniques have an essential role in presenting the data mining results in a meaningful way to the scientists who will eventually decide which of these results are really useful and reliable.
The development of an integrated system which will be the result of the technological coupling of the three above categories in conjunction with the need of utilization a previously unexploited large collection of biological data was the motivation for the elaboration of this thesis.
This work aims to develop an integrated system which represents the above collection providing a high level visualization and records the frequencies of causative genetic variations worldwide by utilizing the Microsoft PivotViewer technology. This system can serve as a modern educational and diagnostic tool for the population-based study of the pathogenesis and treatment of diseases caused by a genetic disorder.
The user through a user-friendly interface can zoom in from the massive amounts of data to particular disease-specific, study-specific, or population-specific data so that he can begin observing the data from a different perspective that may enable him to discover new patterns and relationships between them of remarkable biological importance.
|
4 |
Genômica translacional: integrando dados clínicos e biomoleculares / Translational genomics: integrating clinical and biomolecular dataNewton Shydeo Brandão Miyoshi 06 February 2013 (has links)
A utilização do conhecimento científico para promoção da saúde humana é o principal objetivo da ciência translacional. Para que isto seja possível, faz-se necessário o desenvolvimento de métodos computacionais capazes de lidar com o grande volume e com a heterogeneidade da informação gerada no caminho entre a bancada e a prática clínica. Uma barreira computacional a ser vencida é o gerenciamento e a integração dos dados clínicos, sócio-demográficos e biológicos. Neste esforço, as ontologias desempenham um papel essencial, por serem um poderoso artefato para representação do conhecimento. Ferramentas para gerenciamento e armazenamento de dados clínicos na área da ciência translacional que têm sido desenvolvidas, via de regra falham por não permitir a representação de dados biológicos ou por não oferecer uma integração com as ferramentas de bioinformática. Na área da genômica existem diversos modelos de bancos de dados biológicos (tais como AceDB e Ensembl), os quais servem de base para a construção de ferramentas computacionais para análise genômica de uma forma independente do organismo de estudo. Chado é um modelo de banco de dados biológicos orientado a ontologias, que tem ganhado popularidade devido a sua robustez e flexibilidade, enquanto plataforma genérica para dados biomoleculares. Porém, tanto Chado quanto os outros modelos de banco de dados biológicos não estão preparados para representar a informação clínica de pacientes. Este projeto de mestrado propõe a implementação e validação prática de um framework para integração de dados, com o objetivo de auxiliar a pesquisa translacional integrando dados biomoleculares provenientes das diferentes tecnologias omics com dados clínicos e sócio-demográficos de pacientes. A instanciação deste framework resultou em uma ferramenta denominada IPTrans (Integrative Platform for Translational Research), que tem o Chado como modelo de dados genômicos e uma ontologia como referência. Chado foi estendido para permitir a representação da informação clínica por meio de um novo Módulo Clínico, que utiliza a estrutura de dados entidade-atributo-valor. Foi desenvolvido um pipeline para migração de dados de fontes heterogêneas de informação para o banco de dados integrado. O framework foi validado com dados clínicos provenientes de um Hospital Escola e de um banco de dados biomoleculares para pesquisa de pacientes com câncer de cabeça e pescoço, assim como informações de experimentos de microarray realizados para estes pacientes. Os principais requisitos almejados para o framework foram flexibilidade, robustez e generalidade. A validação realizada mostrou que o sistema proposto satisfaz as premissas, levando à integração necessária para a realização de análises e comparações dos dados. / The use of scientific knowledge to promote human health is the main goal of translational science. To make this possible, it is necessary to develop computational methods capable of dealing with the large volume and heterogeneity of information generated on the road between bench and clinical practice. A computational barrier to be overcome is the management and integration of clinical, biological and socio-demographics data. In this effort, ontologies play a crucial role, being a powerful artifact for knowledge representation. Tools for managing and storing clinical data in the area of translational science that have been developed, usually fail due to the lack on representing biological data or not offering integration with bioinformatics tools. In the field of genomics there are many different biological databases (such as AceDB and Ensembl), which are the basis for the construction of computational tools for genomic analysis in an organism independent way. Chado is a ontology-oriented biological database model which has gained popularity due to its robustness and flexibility, as a generic platform for biomolecular data. However, both Chado as other models of biological databases are not prepared to represent the clinical information of patients. This project consists in the proposal, implementation and validation of a practical framework for data integration, aiming to help translational research integrating data coming from different omics technologies with clinical and socio-demographic characteristics of patients. The instantiation of the designed framework resulted in a computational tool called IPTrans (Integrative Platform for Translational Research), which has Chado as template for genomic data and uses an ontology reference. Chado was extended to allow the representation of clinical information through a new Clinical Module, which uses the data structure entity-attribute-value. We developed a pipeline for migrating data from heterogeneous sources of information for the integrated database. The framework was validated with clinical data from a School Hospital and a database for biomolecular research of patients with head and neck cancer. The main requirements were targeted for the framework flexibility, robustness and generality. The validation showed that the proposed system satisfies the assumptions leading to integration required for the analysis and comparisons of data.
|
5 |
Evaluation of Annotation Performances between Automated and Curated Databases of <i>E.COLI</i> Using the Correlation CoefficientMarpuri, ReddySalilaja 01 August 2009 (has links)
This project compared the performance of the correlation coefficient to show similarities in annotations between a predictive automated bacterial annotation database and the curated EcoCyc database. EcoCyc is a conservative multidimensional annotation system that is exclusively based on experimentally validated findings by over 15,000 publications. The automated annotation system, used in the comparison was BASys. It is often used as a first pass annotation tool that tries to add as many annotations as possible by drawing upon over 30 information sources. Gene ontology served as one basis of comparison between these databases because of the limited common terms in the ontology annotations. Translation libraries were used to extend the number of BASys terms that could be compared to the gene ontology terms in EcoCyc. Additional, non-ontology terms and metadata in BASys were compared to EcoCyc terms after parsing them into root words. The different term sources were quantitatively compared by using the correlation coefficient as the evaluation metric. The direct gene ontology comparison gave the lowest correlation coefficient. The addition of gene ontology terms to BASys by using translation tables of metadata greatly increased the correlation coefficient, which was comparable to the parsed word comparison. The combination of enhanced gene ontology and parsed word methods gave the highest correlation coefficient of 0.16.
The controlled vocabulary system of gene ontology was not sufficient to compare two annotated databases. The addition of gene ontology terms from translation libraries greatly increased the performance of these comparisons. In general, as the number of comparison terms increased the correlation coefficient increased. Future comparisons should include the enhanced gene ontology dataset in order to monitor the organization pertaining to formal nomenclature and the datasets generated from Word parsing can be used to monitor the degree of additional terms might be incorporated with translation libraries.
|
6 |
Εξόρυξη γνώσης από ιατροβιολογικά δεδομένα / Biomedical data miningΚαλλά, Μαρία-Παυλίνα 28 February 2013 (has links)
Πίσω από όλα αυτά τα δεδομένα που υπάρχουν
κρύβεται ένας τεράστιος θησαυρός γνώσεων τον οποίο δεν μπορούμε να αντιληφθούμε καθώς η μορφή των πληροφοριών δεν μας το επιτρέπει. Έτσι αναπτύχθηκαν μέθοδοι και τεχνικές που μας βοηθούν να βρούμε την κρυμμένη
γνώση και να την αξιοποιήσουμε προς όφελος κυρίως του κοινού και η πιο γνωστή
μέθοδος, με την οποία θα ασχοληθούμε και εμείς είναι η Εξόρυξη Γνώσης.
Στην εργασία που ακολουθεί θα μιλήσουμε για την χρήση των μεθόδων Εξόρυξης Γνώσης (όπως λέγονται) σε βιοϊατρικά δεδομένα.
Στην αρχή θα κάνουμε αναφορά στην Μοριακή Βιολογία και στην Βιοπληροφορική. Ακολούθως θα δουμε την Ανακάλυψη γνώσης από βάσεις δεδομένων. Θα δούμε αναλυτικά την Εξόρυξη γνώσης και πιο πολύ τις μεθόδους κατηγοριοποίησης. Τέλος θα εφαρμόσουμε τους αλγορίθμους σε ιατροβιολογικά δεδομένα και θα δούμε τα συμπεράσματα που προκύπτουν αλλά και μελλοντικές επεκτάσεις. / Behind all these data
there is hidden a huge treasure of knowledge which we can not understand . Thus developed methods and techniques that help us find the hidden
knowledge and to utilize it for the benefit of the public.
The most famous method, which we will study, is Data Mining.
In the work that follows we will discuss the use of data mining methods (as they are called) in biomedical data.
In the beginning, we will report information about Molecular Biology and Bioinformatics. Then. we will see the knowledge discovery in databases. We will see in detail the Data Mining and the classification methods. Finally we implement the algorithms in biomedical data and see the conclusions and future extensions.
|
7 |
Σχεδιασμός & ανάπτυξη μιας μετα-βάσης δεδομένων για το δίκτυο πρωτεϊνικών αλληλεπιδράσεων στον άνθρωποΓιουτλάκης, Άρης 26 July 2013 (has links)
Η αποσαφήνιση της σχέσης του γονοτύπου με το φαινότυπο ενός οργανισμού είναι μια από τις μεγαλύτερες προκλήσεις των επιστημών ζωής σήμερα. Για την επίτευξη του στόχου αυτού, η κατανόηση της δομής και της ρύθμισης του δικτύου πρωτεϊνικών αλληλεπιδράσεων (ΔΠΑ) είναι ένα από τα καθοριστικά στάδια αυτής της συσχέτισης. Πρώτο βήμα προς την κατεύθυνση αυτή αποτελεί η λεπτομερής και ακριβής ανακατασκευή του ΔΠΑ. Πειραματικά αποτελέσματα που υποστηρίζουν πρωτεϊνικές αλληλεπιδράσεις δημοσιεύονται στη βιβλιογραφία, από όπου η γνώση αυτή εξορύσσεται είτε μέσω άμεσης καταγραφής από ερευνητές είτε μέσω υπολογιστικών αλγορίθμων ανάλυσης κειμένου, και αποθηκεύεται σε πρωτογενείς βάσεις δεδομένων πρωτεϊνικών αλληλεπιδράσεων (ΒΔΠΑ). Για το ΔΠΑ στον άνθρωπο, υπάρχουν αρκετές ΒΔΠΑ, οι οποίες λόγω διαφορετικών στόχων, τρόπων εξόρυξης γνώσης από τη βιβλιογραφία και διαφορετικής διαχείρισης της βάσης, παρουσιάζουν μικρή επικάλυψη, περιγράφουν τα δεδομένα τους με ασύμβατο μεταξύ τους τρόπο και ορολογία, και ορίζουν τις πρωτεϊνικές αλληλεπιδράσεις μέσω διαφορετικών επιπέδων αναφοράς της γονιδιακής πληροφορίας.
Για την ενοποίηση δεδομένων πρωτεϊνικών αλληλεπιδράσεων από διάφορες πρωτογενείς βάσεις έχουν αναπτυχθεί μετα-βάσεις, οι οποίες προσπαθούν να ξεπεράσουν τα προβλήματα που προκύπτουν από την ετερογένεια των ΒΔΠΑ. Και στην περίπτωση των μεταβάσεων, όμως, ανακύπτουν προβλήματα, που αφορούν: α) στο ότι το δίκτυο ορίζεται με βάση τις πρωτεϊνικές αλληλεπιδράσεις και όχι τις πρωτεΐνες-κόμβους του ΔΠΑ, β) στον πλεονασμό κωδικών ταυτοποίησης των πρωτεϊνών στα διάφορα επίπεδα αναφοράς της γονιδιακής πληροφορίας, γ) στην ετερογένεια του τρόπου κανονικοποίησης των κωδικών ταυτοποίησης πρωτεϊνών, δ) στην υστέρηση της ανανέωσής τους σε σχέση με τις πρωτογενείς βάσεις και ε) στην επιλογή των δεδομένων που καταγράφονται από τις ΒΔΠΑ.
Ο σκοπός αυτής της εργασίας είναι ο σχεδιασμός και η ανάπτυξη μιας μετα-βάσης δεδομένων για το δίκτυο πρωτεϊνικών αλληλεπιδράσεων στον άνθρωπο, PICKLE, που να προσφέρει επαρκείς λύσεις στα προβλήματα αυτά. Η μεγάλη διαφορά σε σχέση με τις υπάρχουσες μετα-βάσεις είναι ο ορισμός του ΔΠΑ με βάση το αξιολογημένο πλήρες ανθρώπινο πρωτεϊνωμα (Reviewed complete Human Proteome), όπως αυτό ορίζεται από τη βάση δεδομένων γνώσης πρωτεϊνικής πληροφορίας UniProt ΚΒ. Για τις πρωτεΐνες αυτές αναζητήθηκε η σχετική πληροφορία αλληλεπιδράσεων στις πέντε κύριες δημόσιες βάσεις πρωτεϊνικών αλληλεπιδράσεων στον άνθρωπο, DIP, HPRD, IntAct, MINT και BioGRID. Τα προβλήματα του πλεονασμού και της κανονικοποίησης λύθηκαν μέσω της ανάπτυξης μίας κατάλληλης γονιδιακής οντολογίας, η οποία μας επέτρεψε να συνδέσουμε το πλήρες ανθρώπινο πρωτεϊνωμα με τα υπόλοιπα επίπεδα αναφοράς της γενετικής πληροφορίας, δρώντας παράλληλα ως ένας ευέλικτος και ακριβής μηχανισμός κανονικοποίησης. Για τη γρήγορη ανανέωση των δεδομένων της μετα-βάσης, αναπτύχθηκε μια αυτοματοποιημένη διαδικασία σύνδεσης και ενημέρωσής της από τις PPIDBs. Η πρώτη έκδοση της PICKLE κατέγραψε 83720 αλληλεπιδράσεις για 12418 UNIPROT IDs από το σύνολο των 20225 του πλήρους ανθρώπινου πρωτεϊνωματος, που υποστηρίζονται από 27.590 δημοσιεύσεις. Η PICKLE θα εμπλουτιστεί με ένα φιλικό προς το χρήστη γραφικό περιβάλλον και θα συνδεθεί με εργαλεία ανάλυσης δικτύων και ομικών δεδομένων, για να αποτελέσει πολύτιμο εργαλείο σε βιοϊατρικές μελέτες και εφαρμογές. / The elucidation of the underlying relationship between an organism’s genotype and its expressed phenotype is currently one the greatest challenges faced by life sciences and biology in general. In order to achieve that, the better understanding of the inner structure and regulation mechanisms of the protein-protein interaction (PPI) networks is of great importance. The first step towards that goal is the detailed and accurate reconstruction of the PPI network itself. The scientific literature is constantly being updated with new experimental results supporting PPI evidence, which in turn are fed into primary PPI databases (PPIDB) by the use of either curators or text mining algorithms. Currently there is a large number of PPIDB referring to the human PPIs. Since many of them have different goals, literature curation methods, and database administration strategies, it is not surprising that they also exhibit a limited PPI overlap and incompatible terminology for PPI intera\-ctors, i.e. use of arbitrary levels of genetic organization.
A number of meta-databases have been developed in order to achieve integrated overviews of PPI networks while circumventing the problems inherent in the field of primary PPI databases. Unfortunately, meta-databases have a number of issues of their own, such as: a) top-down network definition based on protein interactions instead of interactors, b) protein identifier redundancy in all levels of reference, c) the use of {\it ad hoc} normalization methods, d) infrequent updating and d) insufficient information stored.
The major goal of this thesis is the design and implementation of PICKLE (Protein Interaction Knowledge Base), a meta-database for the human PPI network created specifically to tackle the aforementioned problems. PICKLE’s novelty stems from its unique approach to PPI network definition, following a bottom-up reconstruction method based on UniProt’s reviewed complete human proteome (RCHP) definition. Five primary PPIDB (DIP, HPRD, IntAct, ΜΙΝΤ and BioGRID) were mined for interactions explicitly constrained by UniProt’s proteome definition. Furthermore, in order to tackle the issues of redundancy and inadequate normalization, a specific ontology was designed which allowed linking of the RCHP set with all the other levels of genetic organization while also serving as an agile yet accurate normaliza\-tion mechanism. In order to address the issue of updating, an autonomous means of data collection and integration was developed. PICKLE’s maiden release recorded 83720 direct PPIs involving 12418 UniProt IDs (out of 20225) supported by a total of 27590 publications. PICKLE, an evolving valuable bioinformatics for biomedical research and red biotechnology applications tool will soon be updated with a user-friendly interface and upgraded by linking it with network analysis software and various omics datasets.
|
8 |
Síntesis de zeolitas mediante agentes directores de estructura usando procesamiento de datos masivos (Big Data)León Rubio, Santiago 06 November 2023 (has links)
Tesis por compendio / [ES] Las zeolitas son un material de aluminiosilicatos cristalinos microporosos extensivamente utilizados como catalizadores y tamices moleculares involucrados en procesos de separación. La mayoría de estos materiales son sintéticos, obtenidos en laboratorio mediante un proceso hidrotermal y barajando gran cantidad de variables como: relación sílice/agua, temperatura, tiempo, agitación y composición química. Cuando en la síntesis se introducen ciertas moléculas orgánicas, llamadas agentes directores de estructura, es más fácil entender y seleccionar moléculas específicas para dirigir la síntesis hacia una zeolita en particular. La situación ideal sería que cada agente director de estructura condujese la síntesis a una única zeolita, lo cual es poco probable que suceda, ya que, otros términos energéticos también juegan un papel importante, en particular el flúor y el aluminio.
En esta tesis doctoral será acometido el estudio de estos tres factores: agente director de estructura, flúor y aluminio, además de su papel en la síntesis de zeolitas desde un enfoque químico-computacional. Proponiendo agentes directores de estructura más precisos para la síntesis de zeolitas, siendo sintetizadas de manera alternativa y/o más sostenible. / [CA] Les zeolites són un material d'aluminosilicats cristal·lins microporosos extensivament utilitzats com a catalitzadors i tamisos moleculars involucrats en processos de separació. La majoria d'aquests materials són sintètics, obtinguts en laboratori mitjançant un procés hidrotermal el qual presenta una gran quantitat de variables com: relació silici/aigua, temperatura, temps; agitació i composició química. Quan a la síntesi s'introdueixen unes certes molècules orgàniques, anomenades agents directors d'estructura, és més fàcil entendre i seleccionar molècules específiques per dirigir la síntesi cap a una zeolita en particular. La situació ideal seria que cada agent director d'estructura conduïra la síntesi a una única zeolita, la qual cosa és poc probable que succeïsca, ja que, altres termes energètics també juguen un paper important, en particular el fluor i l'alumini. En aquesta tesi doctoral es portarà a terme l'estudi d'aquests tres factors: agent director d'estructura, fluor i alumini, a més del seu paper en la síntesi de zeolites des d'un enfocament químic-computacional. Proposant agents directors d'estructura més precisos per a la síntesi de zeolites, sent sintetitzades de manera alternativa i/o més sostenible. / [EN] Zeolites are a microporous crystalline aluminosilicate material extensively used as catalysts and molecular sieves involved in separation processes. Most of these materials are synthetic, obtained in the laboratory by means of a hydrothermal process, and by shuffling a large number of variables such as silica/water ratio, temperature, time, agitation, and chemical composition. When certain organic molecules, called structure-directing agents, are introduced in the synthesis, it is easier to understand and select specific molecules to direct the synthesis towards a particular zeolite. The ideal situation would be for each structure-directing agent to drive the synthesis to a single zeolite, which is unlikely to happen, since, other energetic terms also play an important role, in particular fluorine and aluminum.
In this doctoral thesis, the study of these three factors: structure directing agent, fluorine, and aluminum, and their role in zeolite synthesis will be undertaken from a chemical-computational approach. Proposing more precise structure-directing agents for the synthesis of zeolites, being synthesized in an alternative and/or more sustainable way. / León Rubio, S. (2023). Síntesis de zeolitas mediante agentes directores de estructura usando procesamiento de datos masivos (Big Data) [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/199280 / Compendio
|
9 |
Desenvolvimento de uma plataforma de bioinformática integrada aplicada a identificação molecular de microrganismos patogênicosSarmento, Felipe José de Queiroz 27 February 2013 (has links)
Submitted by Leonardo Cavalcante (leo.ocavalcante@gmail.com) on 2018-07-17T18:21:26Z
No. of bitstreams: 1
Arquivototal.pdf: 16322215 bytes, checksum: c172a5636f12cf8195f2382f1c23de59 (MD5) / Made available in DSpace on 2018-07-17T18:21:26Z (GMT). No. of bitstreams: 1
Arquivototal.pdf: 16322215 bytes, checksum: c172a5636f12cf8195f2382f1c23de59 (MD5)
Previous issue date: 2013-02-27 / Conselho Nacional de Pesquisa e Desenvolvimento Científico e Tecnológico - CNPq / Various researches in molecular epidemiology, molecular diagnosis and evolutionary genetics related
to pathogens are compared to managing large amounts of data derived from institutions
such as, hospitals or laboratories. Although there already are some proposals to connect molecular
information to the diagnosis of pathogens, none of them uses high performance bioinformatics
tools which are embedded in a system and linked to a patient’s electronic record. The MolEpi
tool has been developed as a system of data and information management addressed to public
health, incorporating clinical and epidemiological information about patients, as well as molecular
data of 16S rRNA sequences of pathogenic bacteria. In order to confirm which species of these
bacteria were identified, biological samples (urine, secretions and purulent wounds, tracheal aspirate
and blood) and subsequently incubation and growth of colonies in culture, and PCR was
used followed by sequencing and analysis of the conserved coding region for 16S ribosomal RNA
(rDNA). Such strategy enabled fast bacterial identification, regardless of prior knowledge of the
species of microorganism under study. Moreover MolEpi is a system interconnected to repositories
of specific sequences as Genbank (NCBI), RDP-II (Ribosomal Database Project - MSU) and
GreenGene (LBL). In this way, once the sequences of clinical isolates are confirmed and validated,
they can be used as reference in the identification of other unknown microorganisms. Thus, a
local database was established, representing the profile of pathogens found in the hospital unity
of study and which should be object of public health surveillance. In order to develop MolEpi,
we used the Java programming language and the PostgreSQL8.3 object-relational database. It
was also developed BACSearch, which has the following programs to handle the analysis of 16S
rDNA sequences, we used the framework BioJava; to multiple alignment, ClustalW2, MAFFT and
MUSCLE, and for editing of multiple alignment and phylogenetic analysis, the JalView2.4.0 was
used. The system was validated with 200 clinical specimens isolated and identified from sites of
nosocomial infection. The DNA sequences produced from these samples were subjected to BLAST
by using the developed tool, which identified Pseudomonas aeruginosa, Acinetobacter baumannii,
Klebsiella pneumoniae and Morganella morganii as the main pathogens involved. Data on resistance
patterns of the species were obtained in microbiology laboratory, and incorporated into the
database. The application of MolEpi tool to the Health System can provide prompt and accurate
diagnosis, connected to relevant network information which can be intended for health professionals. / A maioria das pesquisas em epidemiologia molecular, diagnóstico molecular e genética evolutiva
são confrontadas com o gerenciamento de grandes volumes de dados. Além disso, os dados utilizados
em estudos de doenças patogênicas são complexos e geralmente derivam de instituições
tais como hospitais ou laboratórios. Embora já existam propostas que conecte informações moleculares
ao diagnóstico de patogenias, nenhuma delas utilizam ferramentas de bioinformática de
alto desempenho incorporadas a um sistema e vinculada a um prontuário eletrônico do paciente.
MolEpi foi desenvolvido como um sistema de gerenciamento de dados e informações dimensionado
a saúde pública, incorporando informações clínicas e epidemiológicas sobre pacientes e dados
moleculares de sequências do gene rRNA 16S de bactérias patogênicas. Para identificação destas
bactérias foram utilizadas amostras biológicas (urina, secreções e purulentas de feridas, aspirado
traqueal e sangue) e PCR seguida de sequenciamento e análise da região conservada codificadora
de RNA ribossômico (rDNA) 16S. Este estratégia permite uma identificação bacteriana rápida,
independente de conhecimento prévio da espécie de microrganismo em estudo. O MolEpi é um sistema
facilmente atualizável com as sequências específicas de bancos como Genbank(NCBI), RDP-II
(Ribosomal Database Project - MSU) e GreenGene (LBL). A partir da confirmação e validação
das sequências dos isolados clínicos, estas podem ser utilizadas como referência na identificação
de outros microrganismos desconhecidos. Neste sentido, foi estabelecido um banco de dados local,
representativo do perfil de patógenos encontrados na unidade hospitalar de estudo e objeto
de vigilância epidemiológica. Para o desenvolvimento do MolEpi, utilizamos a linguagem Java e
banco de dados PostgreSQL8.3. Foi desenvolvido também o BACSearch, que possui os seguintes
programas: para o processamento de sequências de rDNA 16S utilizamos os frameworks BioJava;
para alinhamento múltiplo foi implementado o ClustalW2, MAFFT e o MUSCLE e para edição do
alinhamento múltiplo e análise filogenética foi utilizado JalView R⃝2.4.0b2. O sistema foi validado
com 200 espécimes clínicos identificadas e isoladas de sítios de infecção hospitalar. As sequências
de DNA produzidas a partir destas amostras foram submetidas ao BLAST, utilizando a ferramenta
desenvolvida, identificando Pseudomonas aeruginosa, Acinetobacter baumannii, Klebsiela
pneumonie e Staphylococcus aureus como os principais patógenos correspondentes. Os dados sobre
o padrão de resistência das espécies foram obtidos em laboratório de microbiologia e incorporados
ao banco de dados. A aplicação do MolEpi ao Sistema Único de Saúde poderá fornecer diagnósticos
mais rápidos, precisos, e interligados a uma rede de informações relevantes para o profissional de
saúde.
|
10 |
Gerenciamento de anotações de biosseqüências utilizando associações entre ontologias e esquemas XMLTeixeira, Marcus Vinícius Carneiro 26 May 2008 (has links)
Made available in DSpace on 2016-06-02T19:05:31Z (GMT). No. of bitstreams: 1
2080.pdf: 1369419 bytes, checksum: 4100f6c7c0400bc50f4f2f9a28621613 (MD5)
Previous issue date: 2008-05-26 / Universidade Federal de Sao Carlos / Bioinformatics aims at providing computational tools to the development of genome researches. Among those tools are the annotations systems and the Database Management Systems (DBMS) that, associated to ontologies, allow the formalization of both domain conceptual and the data scheme. The data yielded by genome researches are often textual and with no regular structures and also requires scheme evolution. Due to these aspects, semi-structured DBMS might offer great potential to manipulate those data. Thus, this work presents architecture for biosequence annotation based on XML databases. Considering this architecture, a special attention was given to the database design and also to the manual annotation task performed by researchers. Hence, this architecture presents an interface that uses an ontology-driven model for XML schemas modeling and generation, and also a manual annotation interface prototype that uses molecular biology domain ontologies, such as Gene Ontology and Sequence Ontology. These interfaces were proven by Bioinformatics and Database experienced users, who answered questionnaires to evaluate them. The answers presented good assessments to issues like utility and speeding up the database design. The proposed architecture aims at extending and improving the Bio-TIM, an annotation system developed by the Database Group from the Computer Science Department of the Federal University from São Carlos (UFSCar). / A Bioinformática é uma área da ciência que visa suprir pesquisas de genomas com ferramentas computacionais que permitam o seu desenvolvimento tecnológico. Dentre essas ferramentas estão os ambientes de anotação e os Sistemas
Gerenciadores de Bancos de Dados (SGBDs) que, associados a ontologias, permitem a formalização de conceitos do domínio e também dos esquemas de dados. Os dados produzidos em projetos genoma são geralmente textuais e sem uma estrutura de tipo regular, além de requerer evolução de esquemas. Por suas características, SGBDs semi-estruturados oferecem enorme potencial para tratar tais dados. Assim, este
trabalho propõe uma arquitetura para um ambiente de anotação de biosseqüências baseada na persistência dos dados anotados em bancos de dados XML. Neste trabalho, priorizou-se o projeto de bancos de dados e também o apoio à anotação manual realizada por pesquisadores. Assim, foi desenvolvida uma interface que utiliza ontologias para guiar a modelagem de dados e a geração de esquemas XML. Adicionalmente, um protótipo de interface de anotação manual foi desenvolvido, o qual faz uso de ontologias do domínio de biologia molecular, como a Gene Ontology e a Sequence Ontology. Essas interfaces foram testadas por usuários com experiências nas áreas de Bioinformática e Banco de Dados, os quais responderam a questionários para avaliá-las. O resultado apresentou qualificações muito boas em
diversos quesitos avaliados, como exemplo agilidade e utilidade das ferramentas. A arquitetura proposta visa estender e aperfeiçoar o ambiente de anotação Bio-TIM,
desenvolvido pelo grupo de Banco de Dados do Departamento de Computação da Universidade Federal de São Carlos (UFSCar).
|
Page generated in 0.112 seconds