Global ETD Search

911	A study of the influence of choice of record fields on retrieval performance Heesop, Kim January 2002 (has links) No description available. 020
912	Integrating Fuzzy Decisioning Models With Relational Database Constructs Durham, Erin-Elizabeth A 18 December 2014 (has links) Human learning and classification is a nebulous area in computer science. Classic decisioning problems can be solved given enough time and computational power, but discrete algorithms cannot easily solve fuzzy problems. Fuzzy decisioning can resolve more real-world fuzzy problems, but existing algorithms are often slow, cumbersome and unable to give responses within a reasonable timeframe to anything other than predetermined, small dataset problems. We have developed a database-integrated highly scalable solution to training and using fuzzy decision models on large datasets. The Fuzzy Decision Tree algorithm is the integration of the Quinlan ID3 decision-tree algorithm together with fuzzy set theory and fuzzy logic. In existing research, when applied to the microRNA prediction problem, Fuzzy Decision Tree outperformed other machine learning algorithms including Random Forest, C4.5, SVM and Knn. In this research, we propose that the effectiveness with which large dataset fuzzy decisions can be resolved via the Fuzzy Decision Tree algorithm is significantly improved when using a relational database as the storage unit for the fuzzy ID3 objects, versus traditional storage objects. Furthermore, it is demonstrated that pre-processing certain pieces of the decisioning within the database layer can lead to much swifter membership determinations, especially on Big Data datasets. The proposed algorithm uses the concepts inherent to databases: separated schemas, indexing, partitioning, pipe-and-filter transformations, preprocessing data, materialized and regular views, etc., to present a model with a potential to learn from itself. Further, this work presents a general application model to re-architect Big Data applications in order to efficiently present decisioned results: lowering the volume of data being handled by the application itself, and significantly decreasing response wait times while allowing the flexibility and permanence of a standard relational SQL database, supplying optimal user satisfaction in today's Data Analytics world. We experimentally demonstrate the effectiveness of our approach. Database SQL Big Data Query optimization Fuzzy Classification
913	Second-tier Cache Management to Support DBMS Workloads Li, Xuhui 16 September 2011 (has links) Enterprise Database Management Systems (DBMS) often run on computers with dedicated storage systems. Their data access requests need to go through two tiers of cache, i.e., a database bufferpool and a storage server cache, before reaching the storage media, e.g., disk platters. A tremendous amount of work has been done to improve the performance of the first-tier cache, i.e., the database bufferpool. However, the amount of work focusing on second-tier cache management to support DBMS workloads is comparably small. In this thesis we propose several novel techniques for managing second-tier caches to boost DBMS performance in terms of query throughput and query response time. The main purpose of second-tier cache management is to reduce the I/O latency endured by database query executions. This goal can be achieved by minimizing the number of reads and writes issued from second-tier caches to storage devices. The rst part of our research focuses on reducing the number of read I/Os issued by second-tier caches. We observe that DBMSs issue I/O requests for various reasons. The rationales behind these I/O requests provide useful information to second-tier caches because they can be used to estimate the temporal locality of the data blocks being requested. A second-tier cache can exploit this information when making replacement decisions. In this thesis we propose a technique to pass this information from DBMSs to second-tier caches and to use it in guiding cache replacements. The second part of this thesis focuses on reducing the number of writes issued by second-tier caches. Our work is two fold. First, we observe that although there are second-tier caches within computer systems, today's DBMS cannot take full advantage of them. For example, most commercial DBMSs use forced writes to propagate bufferpool updates to permanent storage for data durability reasons. We notice that enforcing such a practice is more conservative than necessary. Some of the writes can be issued as unforced requests and can be cached in the second-tier cache without immediate synchronization. This will give the second-tier cache opportunities to cache and consolidate multiple writes into one request. However, unfortunately, the current POSIX compliant le system interfaces provided by mainstream operating systems e.g., Unix and Windows) are not flexible enough to support such dynamic synchronization. We propose to extend such interfaces to let DBMSs take advantage of using unforced writes whenever possible. Additionally, we observe that the existing cache replacement algorithms are designed solely to maximize read cache hits (i.e., to minimize read I/Os). The purpose is to minimize the read latency, which is on the critical path of query executions. We argue that minimizing read requests is not the only objective of cache replacement. When I/O bandwidth becomes a bottleneck the objective should be to minimize the total number of I/Os, including both reads and writes, to achieve the best performance. We propose to associate a new type of replacement cost, i.e., the total number of I/Os caused by the replacement, with each cache page; and we also present a partial characterization of an optimal algorithm which minimizes the total number of I/Os generated by caches. Based on this knowledge, we extend several existing replacement algorithms, which are write-oblivious (focus only on reducing reads), to be write-aware and observe promising performance gains in the evaluations. Database Storage Cache DBMS buffer bufferpool Computer Science
914	Query Optimization for On-Demand Information Extraction Tasks over Text Databases Farid, Mina H. 12 March 2012 (has links) Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries. In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents. Database Query Optimization Information Extraction Data Quality Computer Science
915	Nonribosomal Peptide Identification with Tandem Mass Spectrometry by Searching Structural Database Yang, Lian 19 April 2012 (has links) Nonribosomal peptides (NRP) are highlighted in pharmacological studies as novel NRPs are often promising substances for new drug development. To eﬀectively discover novel NRPs from microbial fermentations, a crucial step is to identify known NRPs in an early stage and exclude them from further investigation. This so-called dereplication step ensures the scarce resource is only spent on the novel NRPs in the following up experiments. Tandem mass spectrometry has been routinely used for NRP dereplication. However, few bioinformatics tools have been developed to computationally identify NRP compounds from mass spectra, while manual identiﬁcation is currently the roadblock hindering the throughput of novel NRP discovery. In this thesis, we review the nature of nonribosomal peptides and investigate the challenges in computationally solving the identiﬁcation problem. After that, iSNAP software is proposed as an automated and high throughput solution for tandem mass spectrometry based NRP identiﬁcation. The algorithm has been evolved from the traditional database search approach for identifying sequential peptides, to one that is competent at handling complicated NRP structures. It is designed to be capable of identifying mixtures of NRP compounds from LC-MS/MS of complex extract, and also ﬁnding structural analogs which diﬀer from an identiﬁed known NRP compound with one monomer. Combined with an in-house NRP structural database of 1107 compounds, iSNAP is tested to be an eﬀective tool for mass spectrometry based NRP identiﬁcation. The software is available as a web service at http://monod.uwaterloo.ca/isnap for the research community. Nonribosomal peptide Database search Tandem mass spectrometry Computer Science
916	Investigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian Database Liu, Chenyu January 2012 (has links) Medicine and health domains are information intensive fields as data volume has been increasing constantly from them. In order to make full use of the data, the technique of Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts. The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable subset selection phase, and the combination of Best-First search and Correlation-based Feature Selection showed comparable goodness and was maintained for other benefits. Among the five learning schemes investigated, C4.5 decision tree achieved the best performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models. Although the model with the best performance might be suitable for CVD screening in general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research. classification learning KDD process CVD Canadian database Health Studies and Gerontology
917	Utilizing the Canadian Long-Term Pavement Performance (C-LTPP) Database for Asphalt Dynamic Modulus Prediction Korczak, Richard January 2013 (has links) In 2007, the Mechanistic-Empirical Pavement Design Guide (MEPDG) was successfully approved as the new American Association of State Highway and Transportation Officials (AASHTO) pavement design standard (Von Quintus et al., 2007). Calibration and validation of the MEPDG is currently in progress in several provinces across Canada. The MEPDG will be used as the standard pavement design methodology for the foreseeable future (Tighe, 2013). This new pavement design process requires several parameters specific to local conditions of the design location. In order to perform an accurate analysis, a database of parameters including those specific to local materials, climate and traffic are required to calibrate the models in the MEPDG. In 1989, the Canadian Strategic Highway Research Program (C-SHRP) launched a national full scale field experiment known as the Canadian Long-Term Pavement Performance (C-LTPP) program. Between the years, 1989 and 1992, a total of 24 test sites were constructed within all ten provinces. Each test site contained multiple monitored sections for a total of 65 sections. Each of these sites received rehabilitation treatments of various thicknesses of asphalt overlays. The C-LTPP program attempted to design and build the test sections across Canada so as to cover the widest range of experimental factors such as traffic loading, environmental region, and subgrade type. With planned strategic pavement data collection cycles, it would then be possible to compare results obtained at different test sites (i.e. across traffic levels, environmental zones, soil types) across the country. The United States Long-Term Pavement Performance (US-LTPP) database is serving as a critical tool in implementing the new design guide. The MEPDG was delivered with the prediction models calibrated to average national conditions. For the guide to be an effective resource for individual agencies, the national models need to be evaluated against local and regional performance. The results of these evaluations are being used to determine if local calibration is required. It is expected that provincial agencies across Canada will use both C-LTPP and US-LTPP test sites for these evaluations. In addition, C-LTPP and US-LTPP sites provide typical values for many of the MEPDG inputs (C-SHRP, 2000). The scope of this thesis is to examine the existing data in the C-LTPP database and assess its relevance to Canadian MEPDG calibration. Specifically, the thesis examines the dynamic modulus parameter (\|E\|) and how it can be computed using existing C-LTPP data and an Artificial Neural Network (ANN) model developed under a Federal Highway Administration (FHWA) study (FHWA, 2011). The dynamic modulus is an essential property that defines the stiffness characteristics of a Hot Mix Asphalt (HMA) mixture as a function of both its temperature and rate of loading. \|E\| is also a primary material property input required for a Level 1 analysis in the MEPDG. In order to perform a Level 1 MEPDG analysis, detailed local material, environmental and traffic parameters are required for the pavement section being analyzed. Additionally, it can be used in various pavement response models based on visco-elasticity. The dynamic modulus values predicted using both Level 2 and Level 3 viscosity-based ANN models in the ANNACAP software showed a good correlation to the measured dynamic modulus values for two C-LTPP test sections and supplementary Ontario mixes. These findings support previous research findings done during the development of the ANN models. The viscosity-based prediction model requires the least amount data in order to run a prediction. A Level 2 analysis requires mix volumetric data as well as viscosity testing and a Level 3 analysis only requires the PG grade of the binder used in the HMA. The ANN models can be used as an alternative to the MEPDG default predictions (Level 3 analysis) and to develop the master curves and determine the parameters needed for a Level 1 MEPDG analysis. In summary, Both the Level 2 and Level 3 viscosity-based model results demonstrated strong correlations to measured values indicating that either would be a suitable alternative to dynamic modulus laboratory testing. The new MEPDG design methodology is the future of pavement design and research in North America. Current MEPDG analysis practices across the country use default inputs for the dynamic modulus. However, dynamic modulus laboratory characterization of asphalt mixes across Canada is time consuming and not very cost-effective. This thesis has shown that Level 2 and Level 3 viscosity-based ANN predictions can be used in order to perform a Level 1 MEPDG analysis. Further development and use of ANN models in dynamic modulus prediction has the potential to provide many benefits. dynamic modulus pavement performance C-LTPP pavement database Civil Engineering
918	Elasca: Workload-Aware Elastic Scalability for Partition Based Database Systems Rafiq, Taha January 2013 (has links) Providing the ability to increase or decrease allocated resources on demand as the transactional load varies is essential for database management systems (DBMS) deployed on today's computing platforms, such as the cloud. The need to maintain consistency of the database, at very large scales, while providing high performance and reliability makes elasticity particularly challenging. In this thesis, we exploit data partitioning as a way to provide elastic DBMS scalability. We assert that the flexibility provided by a partitioned, shared-nothing parallel DBMS can be used to implement elasticity. Our idea is to start with a small number of servers that manage all the partitions, and to elastically scale out by dynamically adding new servers and redistributing database partitions among these servers as the load varies. Implementing this approach requires (a) efficient mechanisms for addition/removal of servers and migration of partitions, and (b) policies to efficiently determine the optimal placement of partitions on the given servers as well as plans for partition migration. This thesis presents Elasca, a system that implements both these features in an existing shared-nothing DBMS (namely VoltDB) to provide automatic elastic scalability. Elasca consists of a mechanism for enabling elastic scalability, and a workload-aware optimizer for determining optimal partition placement and migration plans. Our optimizer minimizes computing resources required and balances load effectively without compromising system performance, even in the presence of variations in intensity and skew of the load. The results of our experiments show that Elasca is able to achieve performance close to a fully provisioned system while saving 35% resources on average. Furthermore, Elasca's workload-aware optimizer performs up to 79% less data movement than a greedy approach to resource minimization, and also balance load much more effectively. Elastic Scalability Computer Science Database Systems Computer Science
919	Pilot Study for Quantifying LEED Energy & Atmosphere Operational Savings in Healthcare Facilities Daniels, Patrick Rudolph 2012 August 1900 (has links) Owner groups and Facility Managers of health care facilities interested in reducing operation and maintenance (O&M) expenses for new facilities have often been placed in the difficult position of making cost-benefit assessments without a complete understanding of the cumulative impact of building systems selection on their internal rate of return. This is particularly true when owners are evaluating the initial cost and operational benefit (if any) of obtaining various levels of "Leadership in Energy and Environmental Design" (LEED) certifications for their buildings. Heating Ventilation and Air Conditioning, and Lighting (HVAC&L) loads comprise 51% of the total energy demand in the typical outpatient facility; however, in order to estimate the likelihood of achieving a particular LEED rating for a new building, a "Whole Building Energy Simulation" is necessary to evaluate HVAC&L system performance. The conventional of requiring a design upon which to base an analysis presents owner operators attempting to perform a Lifecycle Cost Analysis (LCCA) early in the concept phase with two unique problems - how to estimate energy use without an actual "design" to model, and how to estimate a system's first cost without knowing its performance requirements. This study outlines a process by which existing energy metrics from the Department of Energy (DOE), Commercial Building Energy Consumption Survey (CBECS), and Energy Star, can be made early during the developer's pro forma phase - without the need for a building design. Furthermore, preliminary business decisions targeted at determining the likelihood of obtaining a particular LEED rating, and specifying the corresponding building systems, can be estimated without the cost required to employ an Architect and Engineer (A&E) team, or the time necessary to develop a design. This paper concludes that regional factors can dramatically affect a building's required level of energy performance, and that the highest performing HVAC&L system, irrespective of cost, will not always provide the best return on investment. Accordingly, the national averages utilized to establish LEED EA1 thresholds do not reflect the cost particularities owners may encounter when developing in various climate zones, and therefor may be less relevant to lifecycle considerations that previously believed. LEED LCCA Integrated Project Lifecycle Database Lifecycle Estimate IPL DLE
920	Structural and functional studies of cyclotides Conan Wang Unknown Date (has links) The broad aim of this thesis is to generate fundamental knowledge about the structure and function of cyclotides, which are a topologically unique family of proteins. A long-term goal is to use the fundamental knowledge to assist in the development of drugs based on the stable cyclotide framework. Cyclotides are small proteins that are characterised by a cyclic cystine knot (CCK) motif, which is defined as a circular backbone combined with a cystine knot core. So far cyclotides have been found in plants of the Violaceae (violet) and Rubiaceae (coffee) plant families, and are believed to have a defence-related function. From an application perspective, the CCK framework has potential as a drug scaffold, being an ultra-stable alternative to linear peptide models. The reasons why cyclotides show promise as a drug template are three-fold – they have naturally high sequence diversity, suggesting that their framework can accommodate a range of epitopes; they are remarkably stable under various chemical, enzymatic and thermal conditions, which means that they have increased bioavailability; and they have a diverse range of bioactivities, supporting the notion that they can be used in a number of therapeutic applications. These three reasons are intimately linked to three core knowledge domains of cyclotide research, namely cyclotide sequences, structures and interactions. Thus, fundamental research into these three domains, as investigated in this thesis, is important as it may assist in the development of drugs based on the CCK scaffold. Chapter 1 of this thesis provides the background information to define the molecules studied and to highlight their importance. Chapter 2 describes the main experimental techniques that were used in this thesis, including nuclear magnetic resonance spectroscopy and mass spectrometry. The development of the CCK technology may benefit from a thorough understanding of the natural diversity of cyclotide sequences and the significance of this diversity on activity. Chapter 3 reports on the discovery of cyclotides in Viola yedoensis, a Chinese violet that is interesting because it is widely used in Traditional Chinese Medicine to treat a number of illnesses including swelling and hepatitis. In this study, a total of eight cyclotides was characterised, including five novel sequences. Based on anti-HIV and haemolytic assays, a strong relationship between surface hydrophobicity and activity was established. The stability of cyclotides, which underpins their potential as a drug scaffold, is examined at a structural level in Chapter 4. The solution structure of varv F, a cyclotide from the European field pansy, Viola arvensis, was solved and compared to the crystal structure of the same peptide, confirming the core structural features of cyclotides responsible for their stability, including the topology of the cystine knot, which has previously attracted some debate. From a comparison of biophysical measurements of a representative group of five cyclotides, a conserved network of hydrogen bonds, which also stabilises the cyclotide framework, was defined. A subset of hydrogen bonds involving the highly conserved Glu in loop 1 of cyclotides was examined in more detail by solving the structure of kalata B12, the only naturally occurring cyclotide with an Asp instead of a Glu in loop 1. By comparison with the prototypical cyclotide kalata B1 and an Ala mutant E7A-kalata B1, it was shown that the highly conserved Glu is important for both stability and activity. Chapter 5 reports on studies that add to our understanding of the mechanism of action of cyclotides, which is believed to involve membrane interactions. Spin-label experiments were performed for two cyclotides, kalata B2 and cycloviolacin O2, which are representative cyclotides from the two cyclotide sub-families, Möbius and bracelet, respectively. This study showed that different cyclotides have different but very specific binding modes at the membrane surface. Currently, it is believed that for Möbius cyclotides at least (e.g. kalata B1 and kalata B2), self-association may lead to the formation of membrane pores. Oligomerisation of cyclotides was also studied in this chapter using NMR relaxation. A computer program, NMRdyn, was developed to extract microdynamic and self-association parameters from NMR relaxation data. This program was used to analyse 13C relaxation data on kalata B1, providing clues about the tetramer structure of kalata B1. Although the three areas of cyclotide research examined in this thesis – sequence, structure and interactions – are reported in separate sections, the areas are not independent of each other. For example, the mechanism of action of cyclotides, which is reported in Chapter 3, requires an understanding of cyclotide structures, which is reported in Chapter 4. Chapter 6 describes a database, CyBase, which integrates sequence/structure/activity data on cyclotides so that relationships between the three areas can be examined. The database also provides tools to assist in discovery and engineering of cyclic proteins. In summary, several key areas that are fundamental to our understanding of cyclotides have been investigated in this thesis, ranging from cyclotide sequence diversity to their mechanism of action. The work described in this thesis represents a significant advance in our current understanding of cyclotides by providing, for example, explanations to their observed structural stability and how they work through interactions with other biomolecules. The information presented in this thesis is potentially useful in facilitating the long-term goal of developing peptide therapeutics based on the stable cyclotide framework.

Search results