Global ETD Search

1	Extracting Quantitative Informationfrom Nonnumeric Marketing Data: An Augmentedlatent Semantic Analysis Approach Arroniz, Inigo 01 January 2007 (has links) Despite the widespread availability and importance of nonnumeric data, marketers do not have the tools to extract information from large amounts of nonnumeric data. This dissertation attempts to fill this void: I developed a scalable methodology that is capable of extracting information from extremely large volumes of nonnumeric data. The proposed methodology integrates concepts from information retrieval and content analysis to analyze textual information. This approach avoids a pervasive difficulty of traditional content analysis, namely the classification of terms into predetermined categories, by creating a linear composite of all terms in the document and, then, weighting the terms according to their inferred meaning. In the proposed approach, meaning is inferred by the collocation of the term across all the texts in the corpus. It is assumed that there is a lower dimensional space of concepts that underlies word usage. The semantics of each word are inferred by identifying its various contexts in a document and across documents (i.e., in the corpus). After the semantic similarity space is inferred from the corpus, the words in each document are weighted to obtain their representation on the lower dimensional semantic similarity space, effectively mapping the terms to the concept space and ultimately creating a score that measures the concept of interest. I propose an empirical application of the outlined methodology. For this empirical illustration, I revisit an important marketing problem, the effect of movie critics on the performance of the movies. In the extant literature, researchers have used an overall numerical rating of the review to capture the content of the movie reviews. I contend that valuable information present in the textual materials remains uncovered. I use the proposed methodology to extract this information from the nonnumeric text contained in a movie review. The proposed setting is particularly attractive to validate the methodology because the setting allows for a simple test of the text-derived metrics by comparing them to the numeric ratings provided by the reviewers. I empirically show the application of this methodology and traditional computer-aided content analytic methods to study an important marketing topic, the effect of movie critics on movie performance. In the empirical application of the proposed methodology, I use two datasets that combined contain more than 9,000 movie reviews nested in more than 250 movies. I am restudying this marketing problem in the light of directly obtaining information from the reviews instead of following the usual practice of using an overall rating or a classification of the review as either positive or negative. I find that the addition of direct content and structure of the review adds a significant amount of exploratory power as a determinant of movie performance, even in the presence of actual reviewer overall ratings (stars) and other controls. This effect is robust across distinct opertaionalizations of both the review content and the movie performance metrics. In fact, my findings suggest that as we move from sales to profitability to financial return measures, the role of the content of the review, and therefore the critic's role, becomes increasingly important. Latent Structures Latent Semantic Analysis Content Analysis Movie reviewers Movie performance; Marketing
2	Utformning av mjukvarusensorer för avloppsvatten med multivariata analysmetoder / Design of soft sensors for wastewater with multivariate analysis Abrahamsson, Sandra January 2013 (has links) Varje studie av en verklig process eller ett verkligt system är baserat på mätdata. Förr var den tillgängliga datamängden vid undersökningar ytterst begränsad, men med dagens teknik är mätdata betydligt mer lättillgängligt. Från att tidigare enbart haft få och ofta osammanhängande mätningar för någon enstaka variabel, till att ha många och så gott som kontinuerliga mätningar på ett större antal variabler. Detta förändrar möjligheterna att förstå och beskriva processer avsevärt. Multivariat analys används ofta när stora datamängder med många variabler utvärderas. I det här projektet har de multivariata analysmetoderna PCA (principalkomponentanalys) och PLS (partial least squares projection to latent structures) använts på data över avloppsvatten insamlat på Hammarby Sjöstadsverk. På reningsverken ställs idag allt hårdare krav från samhället för att de ska minska sin miljöpåverkan. Med bland annat bättre processkunskaper kan systemen övervakas och styras så att resursförbrukningen minskas utan att försämra reningsgraden. Vissa variabler är lätta att mäta direkt i vattnet medan andra kräver mer omfattande laboratorieanalyser. Några parametrar i den senare kategorin som är viktiga för reningsgraden är avloppsvattnets innehåll av fosfor och kväve, vilka bland annat kräver resurser i form av kemikalier till fosforfällning och energi till luftning av det biologiska reningssteget. Halterna av dessa ämnen i inkommande vatten varierar under dygnet och är svåra att övervaka. Syftet med den här studien var att undersöka om det är möjligt att utifrån lättmätbara variabler erhålla information om de mer svårmätbara variablerna i avloppsvattnet genom att utnyttja multivariata analysmetoder för att skapa modeller över variablerna. Modellerna kallas ofta för mjukvarusensorer (soft sensors) eftersom de inte utgörs av fysiska sensorer. Mätningar på avloppsvattnet i Linje 1 gjordes under tidsperioden 11 – 15 mars 2013 på flera ställen i processen. Därefter skapades flera multivariata modeller för att försöka förklara de svårmätbara variablerna. Resultatet visar att det går att erhålla information om variablerna med PLS-modeller som bygger på mer lättillgänglig data. De framtagna modellerna fungerade bäst för att förklara inkommande kväve, men för att verkligen säkerställa modellernas riktighet bör ytterligare validering ske. / Studies of real processes are based on measured data. In the past, the amount of available data was very limited. However, with modern technology, the information which is possible to obtain from measurements is more available, which considerably alters the possibility to understand and describe processes. Multivariate analysis is often used when large datasets which contains many variables are evaluated. In this thesis, the multivariate analysis methods PCA (principal component analysis) and PLS (partial least squares projection to latent structures) has been applied to wastewater data collected at Hammarby Sjöstadsverk WWTP (wastewater treatment plant). Wastewater treatment plants are required to monitor and control their systems in order to reduce their environmental impact. With improved knowledge of the processes involved, the impact can be significantly decreased without affecting the plant efficiency. Several variables are easy to measure directly in the water, while other require extensive laboratory analysis. Some of the parameters from the latter category are the contents of phosphorus and nitrogen in the water, both of which are important for the wastewater treatment results. The concentrations of these substances in the inlet water vary during the day and are difficult to monitor properly. The purpose of this study was to investigate whether it is possible, from the more easily measured variables, to obtain information on those which require more extensive analysis. This was done by using multivariate analysis to create models attempting to explain the variation in these variables. The models are commonly referred to as soft sensors, since they don’t actually make use of any physical sensors to measure the relevant variable. Data were collected during the period of March 11 to March 15, 2013 in the wastewater at different stages of the treatment process and a number of multivariate models were created. The result shows that it is possible to obtain information about the variables with PLS models based on easy-to-measure variables. The best created model was the one explaining the concentration of nitrogen in the inlet water. multivariate analysis PCA principal component analysis PLS soft sensor WWTP Hammarby Sjöstadsverk multivariat analys PCA principalkomponentanalys PLS mjukvarusensor avloppsreningsverk Hammarby Sjöstadsverk
3	Novel variable influence on projection (VIP) methods in OPLS, O2PLS, and OnPLS models for single- and multi-block variable selection : VIPOPLS, VIPO2PLS, and MB-VIOP methods Galindo-Prieto, Beatriz January 2017 (has links) Multivariate and multiblock data analysis involves useful methodologies for analyzing large data sets in chemistry, biology, psychology, economics, sensory science, and industrial processes; among these methodologies, partial least squares (PLS) and orthogonal projections to latent structures (OPLS®) have become popular. Due to the increasingly computerized instrumentation, a data set can consist of thousands of input variables which contain latent information valuable for research and industrial purposes. When analyzing a large number of data sets (blocks) simultaneously, the number of variables and underlying connections between them grow very much indeed; at this point, reducing the number of variables keeping high interpretability becomes a much needed strategy. The main direction of research in this thesis is the development of a variable selection method, based on variable influence on projection (VIP), in order to improve the model interpretability of OnPLS models in multiblock data analysis. This new method is called multiblock variable influence on orthogonal projections (MB-VIOP), and its novelty lies in the fact that it is the first multiblock variable selection method for OnPLS models. Several milestones needed to be reached in order to successfully create MB-VIOP. The first milestone was the development of a single-block variable selection method able to handle orthogonal latent variables in OPLS models, i.e. VIP for OPLS (denoted as VIPOPLS or OPLS-VIP in Paper I), which proved to increase the interpretability of PLS and OPLS models, and afterwards, was successfully extended to multivariate time series analysis (MTSA) aiming at process control (Paper II). The second milestone was to develop the first multiblock VIP approach for enhancement of O2PLS® models, i.e. VIPO2PLS for two-block multivariate data analysis (Paper III). And finally, the third milestone and main goal of this thesis, the development of the MB-VIOP algorithm for the improvement of OnPLS model interpretability when analyzing a large number of data sets simultaneously (Paper IV). The results of this thesis, and their enclosed papers, showed that VIPOPLS, VIPO2PLS, and MB-VIOP methods successfully assess the most relevant variables for model interpretation in PLS, OPLS, O2PLS, and OnPLS models. In addition, predictability, robustness, dimensionality reduction, and other variable selection purposes, can be potentially improved/achieved by using these methods. Variable influence on projection VIP MB-VIOP OPLS O2PLS OnPLS variable selection
4	Latent variable based computational methods for applications in life sciences : Analysis and integration of omics data sets Bylesjö, Max January 2008 (has links) With the increasing availability of high-throughput systems for parallel monitoring of multiple variables, e.g. levels of large numbers of transcripts in functional genomics experiments, massive amounts of data are being collected even from single experiments. Extracting useful information from such systems is a non-trivial task that requires powerful computational methods to identify common trends and to help detect the underlying biological patterns. This thesis deals with the general computational problems of classifying and integrating high-dimensional empirical data using a latent variable based modeling approach. The underlying principle of this approach is that a complex system can be characterized by a few independent components that characterize the systematic properties of the system. Such a strategy is well suited for handling noisy, multivariate data sets with strong multicollinearity structures, such as those typically encountered in many biological and chemical applications. The main foci of the studies this thesis is based upon are applications and extensions of the orthogonal projections to latent structures (OPLS) method in life science contexts. OPLS is a latent variable based regression method that separately describes systematic sources of variation that are related and unrelated to the modeling aim (for instance, classifying two different categories of samples). This separation of sources of variation can be used to pre-process data, but also has distinct advantages for model interpretation, as exemplified throughout the work. For classification cases, a probabilistic framework for OPLS has been developed that allows the incorporation of both variance and covariance into classification decisions. This can be seen as a unification of two historical classification paradigms based on either variance or covariance. In addition, a non-linear reformulation of the OPLS algorithm is outlined, which is useful for particularly complex regression or classification tasks. The general trend in functional genomics studies in the post-genomics era is to perform increasingly comprehensive characterizations of organisms in order to study the associations between their molecular and cellular components in greater detail. Frequently, abundances of all transcripts, proteins and metabolites are measured simultaneously in an organism at a current state or over time. In this work, a generalization of OPLS is described for the analysis of multiple data sets. It is shown that this method can be used to integrate data in functional genomics experiments by separating the systematic variation that is common to all data sets considered from sources of variation that are specific to each data set. / Funktionsgenomik är ett forskningsområde med det slutgiltiga målet att karakterisera alla gener i ett genom hos en organism. Detta inkluderar studier av hur DNA transkriberas till mRNA, hur det sedan translateras till proteiner och hur dessa proteiner interagerar och påverkar organismens biokemiska processer. Den traditionella ansatsen har varit att studera funktionen, regleringen och translateringen av en gen i taget. Ny teknik inom fältet har dock möjliggjort studier av hur tusentals transkript, proteiner och små molekyler uppträder gemensamt i en organism vid ett givet tillfälle eller över tid. Konkret innebär detta även att stora mängder data genereras även från små, isolerade experiment. Att hitta globala trender och att utvinna användbar information från liknande data-mängder är ett icke-trivialt beräkningsmässigt problem som kräver avancerade och tolkningsbara matematiska modeller. Denna avhandling beskriver utvecklingen och tillämpningen av olika beräkningsmässiga metoder för att klassificera och integrera stora mängder empiriskt (uppmätt) data. Gemensamt för alla metoder är att de baseras på latenta variabler: variabler som inte uppmätts direkt utan som beräknats från andra, observerade variabler. Detta koncept är väl anpassat till studier av komplexa system som kan beskrivas av ett fåtal, oberoende faktorer som karakteriserar de huvudsakliga egenskaperna hos systemet, vilket är kännetecknande för många kemiska och biologiska system. Metoderna som beskrivs i avhandlingen är generella men i huvudsak utvecklade för och tillämpade på data från biologiska experiment. I avhandlingen demonstreras hur dessa metoder kan användas för att hitta komplexa samband mellan uppmätt data och andra faktorer av intresse, utan att förlora de egenskaper hos metoden som är kritiska för att tolka resultaten. Metoderna tillämpas för att hitta gemensamma och unika egenskaper hos regleringen av transkript och hur dessa påverkas av och påverkar små molekyler i trädet poppel. Utöver detta beskrivs ett större experiment i poppel där relationen mellan nivåer av transkript, proteiner och små molekyler undersöks med de utvecklade metoderna. Chemometrics OPLS O2PLS K-OPLS kernel-based non-linear regression classification Populus Chemistry Kemi
5	FRAMEWORK FOR SUSTAINABILITY METRIC OF THE BUILT ENVIRONMENT Marjaba, Ghassan January 2020 (has links) Sustainability of the built environment is one of the most significant challenges facing the construction industry, and presents significant opportunities to affect change. The absence of quantifiable and holistic sustainability measures for the built environment has hindered their application. As a result, a sustainability performance metric (SPM) framework was conceptually formulated by employing sustainability objectives and function statements a-priori to identify the correlated sustainability indicators that need to be captured equally, with respect to the environment, the economy, and society. Projection to Latent Structures (PLS), a latent variable method, was adopted to mathematically formulate the metric. Detached single-family housing was used to demonstrate the application of SPM. Datasets were generated using Athena Impact Estimator, EnergyPlus, Building Information Modelling (BIM), Socioeconomic Input/Output models, among others. Results revealed that a holistic metric, such as the SPM is necessary to obtain a sustainable design, where qualitative or univariate considerations may result in the contrary. A building envelope coefficient of performance (BECOP) metric based on an idealized system was also developed to measure the energy efficiency of the building envelope. Results revealed the inefficiencies in the current building envelope construction technologies and the missed opportunities for saving energy. Furthermore, a decision-making tool, which was formulated using the PLS utilities, was shown to be effective and necessary for early stages of the design for energy efficiency. / Thesis / Doctor of Science (PhD) / Sustainability of the built environment is a significant challenge facing the industry, and presents opportunities to affect changes. The absence of holistic sustainability measures has hindered their application. As a result, a sustainability performance metric (SPM) framework was formulated by employing sustainability objectives and function statements a-priori to identify the indicators that need to be captured. Projection to Latent Structures was adopted to mathematically formulate the metric. A housing prototype was used to demonstrate the application of the SPM utilizing a bespoke dataset. Results revealed that holistic metric, such as the SPM is necessary for achieving sustainable designs. A building envelope coefficient of performance metric was also developed to measure the energy efficiency of the building envelope. Results revealed the inefficiencies in the current building envelope technologies and identified missed opportunities. Furthermore, a decision-making tool was formulated and shown to be effective and necessary for design for energy efficiency. Sustainability Metric Energy consumption Building envelope Housing Buildings Projection to latent structures Partial least squares
6	Explorative Multivariate Data Analysis of the Klinthagen Limestone Quarry Data / Utforskande multivariat analys av Klinthagentäktens projekteringsdata Bergfors, Linus January 2010 (has links) <p> </p><p>The today quarry planning at Klinthagen is rough, which provides an opportunity to introduce new exciting methods to improve the quarry gain and efficiency. Nordkalk AB, active at Klinthagen, wishes to start a new quarry at a nearby location. To exploit future quarries in an efficient manner and ensure production quality, multivariate statistics may help gather important information.</p><p>In this thesis the possibilities of the multivariate statistical approaches of Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression were evaluated on the Klinthagen bore data. PCA data were spatially interpolated by Kriging, which also was evaluated and compared to IDW interpolation.</p><p>Principal component analysis supplied an overview of the variables relations, but also visualised the problems involved when linking geophysical data to geochemical data and the inaccuracy introduced by lacking data quality.</p><p>The PLS regression further emphasised the geochemical-geophysical problems, but also showed good precision when applied to strictly geochemical data.</p><p>Spatial interpolation by Kriging did not result in significantly better approximations than the less complex control interpolation by IDW.</p><p>In order to improve the information content of the data when modelled by PCA, a more discrete sampling method would be advisable. The data quality may cause trouble, though with sample technique of today it was considered to be of less consequence.</p><p>Faced with a single geophysical component to be predicted from chemical variables further geophysical data need to complement existing data to achieve satisfying PLS models.</p><p>The stratified rock composure caused trouble when spatially interpolated. Further investigations should be performed to develop more suitable interpolation techniques.</p> Multivariate analysis interpolation PCA principal component analysis PLS projection to latent structures partial least squares Limestone quarry Klinthagen Kriging. Multivariat analys interpolation PCA principalkomponentsanalys PLS projektion till latenta strukturer partiell minstakvadrat- anpassning Kalkbrott Klinthagen Kriging
7	Explorative Multivariate Data Analysis of the Klinthagen Limestone Quarry Data / Utforskande multivariat analys av Klinthagentäktens projekteringsdata Bergfors, Linus January 2010 (has links) The today quarry planning at Klinthagen is rough, which provides an opportunity to introduce new exciting methods to improve the quarry gain and efficiency. Nordkalk AB, active at Klinthagen, wishes to start a new quarry at a nearby location. To exploit future quarries in an efficient manner and ensure production quality, multivariate statistics may help gather important information. In this thesis the possibilities of the multivariate statistical approaches of Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression were evaluated on the Klinthagen bore data. PCA data were spatially interpolated by Kriging, which also was evaluated and compared to IDW interpolation. Principal component analysis supplied an overview of the variables relations, but also visualised the problems involved when linking geophysical data to geochemical data and the inaccuracy introduced by lacking data quality. The PLS regression further emphasised the geochemical-geophysical problems, but also showed good precision when applied to strictly geochemical data. Spatial interpolation by Kriging did not result in significantly better approximations than the less complex control interpolation by IDW. In order to improve the information content of the data when modelled by PCA, a more discrete sampling method would be advisable. The data quality may cause trouble, though with sample technique of today it was considered to be of less consequence. Faced with a single geophysical component to be predicted from chemical variables further geophysical data need to complement existing data to achieve satisfying PLS models. The stratified rock composure caused trouble when spatially interpolated. Further investigations should be performed to develop more suitable interpolation techniques. Multivariate analysis interpolation PCA principal component analysis PLS projection to latent structures partial least squares Limestone quarry Klinthagen Kriging. Multivariat analys interpolation PCA principalkomponentsanalys PLS projektion till latenta strukturer partiell minstakvadrat- anpassning Kalkbrott Klinthagen Kriging
8	Multivariate design of molecular docking experiments : An investigation of protein-ligand interactions Andersson, David January 2010 (has links) To be able to make informed descicions regarding the research of new drug molecules (ligands), it is crucial to have access to information regarding the chemical interaction between the drug and its biological target (protein). Computer-based methods have a given role in drug research today and, by using methods such as molecular docking, it is possible to investigate the way in which ligands and proteins interact. Despite the acceleration in computer power experienced in the last decades many problems persist in modelling these complicated interactions. The main objective of this thesis was to investigate and improve molecular modelling methods aimed to estimate protein-ligand binding. In order to do so, we have utilised chemometric tools, e.g. design of experiments (DoE) and principal component analysis (PCA), in the field of molecular modelling. More specifically, molecular docking was investigated as a tool for reproduction of ligand poses in protein 3D structures and for virtual screening. Adjustable parameters in two docking software were varied using DoE and parameter settings were identified which lead to improved results. In an additional study, we explored the nature of ligand-binding cavities in proteins since they are important factors in protein-ligand interactions, especially in the prediction of the function of newly found proteins. We developed a strategy, comprising a new set of descriptors and PCA, to map proteins based on their cavity physicochemical properties. Finally, we applied our developed strategies to design a set of glycopeptides which were used to study autoimmune arthritis. A combination of docking and statistical molecular design, synthesis and biological evaluation led to new binders for two different class II MHC proteins and recognition by a panel of T-cell hybridomas. New and interesting SAR conclusions could be drawn and the results will serve as a basis for selection of peptides to include in in vivo studies. Molecular docking chemometrics multivariate analysis principal component analysis PCA design of experiments DoE PLS scoring functions ligand-binding cavity major histocompatibility complex MHC glycopeptide T-cell. Pharmaceutical chemistry Läkemedelskemi
9	Multivariate Synergies in Pharmaceutical Roll Compaction : The quality influence of raw materials and process parameters by design of experiments Souihi, Nabil January 2014 (has links) Roll compaction is a continuous process commonly used in the pharmaceutical industry for dry granulation of moisture and heat sensitive powder blends. It is intended to increase bulk density and improve flowability. Roll compaction is a complex process that depends on many factors, such as feed powder properties, processing conditions and system layout. Some of the variability in the process remains unexplained. Accordingly, modeling tools are needed to understand the properties and the interrelations between raw materials, process parameters and the quality of the product. It is important to look at the whole manufacturing chain from raw materials to tablet properties. The main objective of this thesis was to investigate the impact of raw materials, process parameters and system design variations on the quality of intermediate and final roll compaction products, as well as their interrelations. In order to do so, we have conducted a series of systematic experimental studies and utilized chemometric tools, such as design of experiments, latent variable models (i.e. PCA, OPLS and O2PLS) as well as mechanistic models based on the rolling theory of granular solids developed by Johanson (1965). More specifically, we have developed a modeling approach to elucidate the influence of different brittle filler qualities of mannitol and dicalcium phosphate and their physical properties (i.e. flowability, particle size and compactability) on intermediate and final product quality. This approach allows the possibility of introducing new fillers without additional experiments, provided that they are within the previously mapped design space. Additionally, this approach is generic and could be extended beyond fillers. Furthermore, in contrast to many other materials, the results revealed that some qualities of the investigated fillers demonstrated improved compactability following roll compaction. In one study, we identified the design space for a roll compaction process using a risk-based approach. The influence of process parameters (i.e. roll force, roll speed, roll gap and milling screen size) on different ribbon, granule and tablet properties was evaluated. In another study, we demonstrated the significant added value of the combination of near-infrared chemical imaging, texture analysis and multivariate methods in the quality assessment of the intermediate and final roll compaction products. Finally, we have also studied the roll compaction of an intermediate drug load formulation at different scales and using roll compactors with different feed screw mechanisms (i.e. horizontal and vertical). The horizontal feed screw roll compactor was also equipped with an instrumented roll technology allowing the measurement of normal stress on ribbon. Ribbon porosity was primarily found to be a function of normal stress, exhibiting a quadratic relationship. A similar quadratic relationship was also observed between roll force and ribbon porosity of the vertically fed roll compactor. A combination of design of experiments, latent variable and mechanistic models led to a better understanding of the critical process parameters and showed that scale up/transfer between equipment is feasible. Roll compaction dry granulation mannitol dicalcium phosphate design of experiments critical quality attributes tablet manufacturing quality by design design space near-infrared chemical imaging texture analysis modeling scale up instrumented roll Johanson model
10	A multivariate approach to characterization of drug-like molecules, proteins and the interactions between them Lindström, Anton January 2008 (has links) En sjukdom kan många gånger härledas till en kaskadereaktion mellan proteiner, co-faktorer och substrat. Denna kaskadreaktion blir många gånger målet för att behandla sjukdomen med läkemedel. För att designa nya läkemedelsmoleyler används vanligen datorbaserade verktyg. Denna design av läkemedelsmolekyler drar stor nytta av att målproteinet är känt och då framförallt dess tredimensionella (3D) struktur. Är 3D-strukturen känd kan man utföra så kallad struktur- och datorbaserad molekyldesign, 3D-geometrin (f.f.a. för inbindningsplatsen) blir en vägledning för designen av en ny molekyl. Många faktorer avgör interaktionen mellan en molekyl och bindningsplatsen, till exempel fysikalisk-kemiska egenskaper hos molekylen och bindningsplatsen, flexibiliteten i molekylen och målproteinet, och det omgivande lösningsmedlet. För att strukturbaserad molekyldesign ska fungera väl måste två viktiga steg utföras: i) 3D anpassning av molekyler till bindningsplatsen i ett målprotein (s.k. dockning) och ii) prediktion av molekylers affinitet för bindningsplatsen. Huvudsyftena med arbetet i denna avhandling var som följer: i) skapa modeler för att prediktera affiniteten mellan en molekyl och bindningsplatsen i ett målprotein; ii) förfina molekyl-protein-geometrin som skapas vid 3D-anpassning mellan en molekyl och bindningsplatsen i ett målprotein (s.k. dockning); iii) karaktärisera proteiner och framför allt deras sekundärstruktur; iv) bedöma effekten av olika matematiska beskrivningar av lösningsmedlet för förfining av 3D molekyl-protein-geometrin skapad vid dockning och prediktion av molekylers affinitet för proteiners bindningsfickor. Ett övergripande syfte var att använda kemometriska metoder för modellering och dataanalys på de ovan nämnda punkterna. För att sammanfatta så presenterar denna avhandling metoder och resultat som är användbara för strukturbaserad molekyldesign. De rapporterade resultaten visar att det är möjligt att skapa kemometriska modeler för prediktion av molekylers affinitet för bindningsplatsen i ett protein och att dessa presterade lika bra som andra vanliga metoder. Dessutom kunde kemometriska modeller skapas för att beskriva effekten av hur inställningarna för olika parametrar i dockningsprogram påverkade den 3D molekyl-protein-geometrin som dockingsprogram skapade. Vidare kunde kemometriska modeller andvändas för att öka förståelsen för deskriptorer som beskrev sekundärstrukturen i proteiner. Förfining av molekyl-protein-geometrin skapad genom dockning gav liknande och ickesignifikanta resultat oberoende av vilken matematisk modell för lösningsmedlet som användes, förutom för ett fåtal (sex av 30) fall. Däremot visade det sig att användandet av en förfinad geometri var värdefullt för prediktion av molekylers affinitet för bindningsplatsen i ett protein. Förbättringen av prediktion av affintitet var markant då en Poisson-Boltzmann beskrivning av lösningsmedlet användes; jämfört med prediktionerna gjorda med ett dockningsprogram förbättrades korrelationen mellan beräknad affintiet och uppmätt affinitet med 0,7 (R2). / A disease is often associated with a cascade reaction pathway involving proteins, co-factors and substrates. Hence to treat the disease, elements of this pathway are often targeted using a therapeutic agent, a drug. Designing new drug molecules for use as therapeutic agents involves the application of methods collectively known as computer-aided molecular design, CAMD. When the three dimensional (3D) geometry of a macromolecular target (usually a protein) is known, structure-based CAMD is undertaken and structural information of the target guides the design of new molecules and their interactions with the binding sites in targeted proteins. Many factors influence the interactions between the designed molecules and the binding sites of the target proteins, such as the physico-chemical properties of the molecule and the binding site, the flexibility of the protein and the ligand, and the surrounding solvent. In order for structure-based CAMD to be successful, two important aspects must be considered that take the abovementioned factors into account. These are; i) 3D fitting of molecules to the binding site of the target protein (like fitting pieces of a jigsaw puzzle), and ii) predicting the affinity of molecules to the protein binding site. The main objectives of the work underlying this thesis were: to create models for predicting the affinity between a molecule and a protein binding site; to refine the geometry of the molecule-protein complex derived by or in 3D fitting (also known as docking); to characterize the proteins and their secondary structure; and to evaluate the effects of different generalized-Born (GB) and Poisson-Boltzmann (PB) implicit solvent models on the refinement of the molecule-protein complex geometry created in the docking and the prediction of the molecule-to-protein binding site affinity. A further objective was to apply chemometric methodologies for modeling and data analysis to all of the above. To summarize, this thesis presents methodologies and results applicable to structure-based CAMD. Results show that predictive chemometric models for molecule-to-protein binding site affinity could be created that yield comparable results to similar, commonly used methods. In addition, chemometric models could be created to model the effects of software settings on the molecule-protein complex geometry using software for molecule-to-binding site docking. Furthermore, the use of chemometric models provided a more profound understanding of protein secondary structure descriptors. Refining the geometry of molecule-protein complexes created through molecule-to-binding site docking gave similar results for all investigated implicit solvent models, but the geometry was significantly improved in only a few examined cases (six of 30). However, using the geometry-refined molecule-protein complexes was highly valuable for the prediction of molecule-to-binding site affinity. Indeed, using the PB solvent model it yielded improvements of 0.7 in correlation coefficients (R2) for binding affinity parameters of a set of Factor Xa protein drug molecules, relative to those obtained using the fitting software. binding affinity prediction CAMD principal component analysis (PCA) MM-GB-SA MM-PB-SA docking geometry optimization implicit solvent generalized-Born Poisson-Boltzmann molecular mechanics (MM) drug discovery bindningsaffinitet prediktion dockning geometrioptimering sekundärstruktur matematisk vattenmodel generalized-Born Poisson-Boltzmann molekylmekanik (MM) läkemedelsdesign principal komponent analys (PCA) MM-GB-SA MM-PB-SA Other chemistry Övrig kemi

Search results