Global ETD Search

471	Anomaly Detection With Machine Learning In Astronomical Images Etsebeth, Verlon January 2020 (has links) Masters of Science / Observations that push the boundaries have historically fuelled scientific breakthroughs, and these observations frequently involve phenomena that were previously unseen and unidentified. Data sets have increased in size and quality as modern technology advances at a record pace. Finding these elusive phenomena within these large data sets becomes a tougher challenge with each advancement made. Fortunately, machine learning techniques have proven to be extremely valuable in detecting outliers within data sets. Astronomaly is a framework that utilises machine learning techniques for anomaly detection in astronomy and incorporates active learning to provide target specific results. It is used here to evaluate whether machine learning techniques are suitable to detect anomalies within the optical astronomical data obtained from the Dark Energy Camera Legacy Survey. Using the machine learning algorithm isolation forest, Astronomaly is applied on subsets of the Dark Energy Camera Legacy Survey (DECaLS) data set. The pre-processing stage of Astronomaly had to be significantly extended to handle real survey data from DECaLS, with the changes made resulting in up to 10% more sources having their features extracted successfully. For the top 500 sources returned, 292 were ordinary sources, 86 artefacts and masked sources and 122 were interesting anomalous sources. A supplementary machine learning algorithm known as active learning enhances the identification probability of outliers in data sets by making it easier to identify target specific sources. The addition of active learning further increases the amount of interesting sources returned by almost 40%, with 273 ordinary sources, 56 artefacts and 171 interesting anomalous sources returned. Among the anomalies discovered are some merger events that have been successfully identified in known catalogues and several candidate merger events that have not yet been identified in the literature. The results indicate that machine learning, in combination with active learning, can be effective in detecting anomalies in actual data sets. The extensions integrated into Astronomaly pave the way for its application on future surveys like the Vera C. Rubin Observatory Legacy Survey of Space and Time. Sloan Digital Sky Survey (SDSS) Support Vector Machine (SVM) Beijing-Arizona Sky Survey (BASS)2 Mayall z-band Legacy Survey (MzLS)3 Rank Weighted Score (RWS)
472	Fault detection of planetary gearboxes in BLDC-motors using vibration and acoustic noise analysis Ahnesjö, Henrik January 2020 (has links) This thesis aims to use vibration and acoustic noise analysis to help a production line of a certain motor type to ensure good quality. Noise from the gearbox is sometimes present and the way it is detected is with a human listening to it. This type of error detection is subjective, and it is possible for human error to be present. Therefore, an automatic test that pass or fail the produced Brush Less Direct Current (BLDC)-motors is wanted. Two measurement setups were used. One was based on an accelerometer which was used for vibration measurements, and the other based on a microphone for acoustic sound measurements. The acquisition and analysis of the measurements were implemented using the data acquisition device, compactDAQ NI 9171, and the graphical programming software, NI LabVIEW. Two methods, i.e., power spectrum analysis and machine learning, were used for the analyzing of vibration and acoustic signals, and identifying faults in the gearbox. The first method based on the Fast Fourier transform (FFT) was used to the recorded sound from the BLDC-motor with the integrated planetary gearbox to identify the peaks of the sound signals. The source of the acoustic sound is from a faulty planet gear, in which a flank of a tooth had an indentation. Which could be measured and analyzed. It sounded like noise, which can be used as the indications of faults in gears. The second method was based on the BLDC-motors vibration characteristics and uses supervised machine learning to separate healthy motors from the faulty ones. Support Vector Machine (SVM) is the suggested machine learning algorithm and 23 different features are used. The best performing model was a Coarse Gaussian SVM, with an overall accuracy of 92.25 % on the validation data. Planetary gearbox Fault detection Condition Monitoring Support Vector Machine (SVM) Machine Learning Supervised Machine Learning Defect Pinion Defect Planet Gear MATLAB LabVIEW Soundproof-box Noise Vibration Harshness (NVH) Gearmesh Frequency (GMF) Power Spectrum BLDC-motor Elektroteknik och elektronik
473	Silent speech recognition in EEG-based brain computer interface Ghane, Parisa January 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / A Brain Computer Interface (BCI) is a hardware and software system that establishes direct communication between human brain and the environment. In a BCI system, brain messages pass through wires and external computers instead of the normal pathway of nerves and muscles. General work ow in all BCIs is to measure brain activities, process and then convert them into an output readable for a computer. The measurement of electrical activities in different parts of the brain is called electroencephalography (EEG). There are lots of sensor technologies with different number of electrodes to record brain activities along the scalp. Each of these electrodes captures a weighted sum of activities of all neurons in the area around that electrode. In order to establish a BCI system, it is needed to set a bunch of electrodes on scalp, and a tool to send the signals to a computer for training a system that can find the important information, extract them from the raw signal, and use them to recognize the user's intention. After all, a control signal should be generated based on the application. This thesis describes the step by step training and testing a BCI system that can be used for a person who has lost speaking skills through an accident or surgery, but still has healthy brain tissues. The goal is to establish an algorithm, which recognizes different vowels from EEG signals. It considers a bandpass filter to remove signals' noise and artifacts, periodogram for feature extraction, and Support Vector Machine (SVM) for classification. Brain Computer Interface EEG Support Vector Machine Multi-class Classification Speech recognition Speech processing systems -- Research Multimedia systems -- Research Wavelets (Mathematics) Computer algorithms -- Research User interfaces (Computer systems) Electrodes -- Testing
474	PCA based dimensionality reduction of MRI images for training support vector machine to aid diagnosis of bipolar disorder / PCA baserad dimensionalitetsreduktion av MRI bilder för träning av stödvektormaskin till att stödja diagnostisering av bipolär sjukdom Chen, Beichen, Chen, Amy Jinxin January 2019 (has links) This study aims to investigate how dimensionality reduction of neuroimaging data prior to training support vector machines (SVMs) affects the classification accuracy of bipolar disorder. This study uses principal component analysis (PCA) for dimensionality reduction. An open source data set of 19 bipolar and 31 control structural magnetic resonance imaging (sMRI) samples was used, part of the UCLA Consortium for Neuropsychiatric Phenomics LA5c Study funded by the NIH Roadmap Initiative aiming to foster breakthroughs in the development of novel treatments for neuropsychiatric disorders. The images underwent smoothing, feature extraction and PCA before they were used as input to train SVMs. 3-fold cross-validation was used to tune a number of hyperparameters for linear, radial, and polynomial kernels. Experiments were done to investigate the performance of SVM models trained using 1 to 29 principal components (PCs). Several PC sets reached 100% accuracy in the final evaluation, with the minimal set being the first two principal components. Accumulated variance explained by the PCs used did not have a correlation with the performance of the model. The choice of kernel and hyperparameters is of utmost importance as the performance obtained can vary greatly. The results support previous studies that SVM can be useful in aiding the diagnosis of bipolar disorder, and that the use of PCA as a dimensionality reduction method in combination with SVM may be appropriate for the classification of neuroimaging data for illnesses not limited to bipolar disorder. Due to the limitation of a small sample size, the results call for future research using larger collaborative data sets to validate the accuracies obtained. / Syftet med denna studie är att undersöka hur dimensionalitetsreduktion av neuroradiologisk data före träning av stödvektormaskiner (SVMs) påverkar klassificeringsnoggrannhet av bipolär sjukdom. Studien använder principalkomponentanalys (PCA) för dimensionalitetsreduktion. En datauppsättning av 19 bipolära och 31 friska magnetisk resonanstomografi(MRT) bilder användes, vilka tillhör den öppna datakällan från studien UCLA Consortium for Neuropsychiatric Phenomics LA5c som finansierades av NIH Roadmap Initiative i syfte att främja genombrott i utvecklingen av nya behandlingar för neuropsykiatriska funktionsnedsättningar. Bilderna genomgick oskärpa, särdragsextrahering och PCA innan de användes som indata för att träna SVMs. Med 3-delad korsvalidering inställdes ett antal parametrar för linjära, radiala och polynomiska kärnor. Experiment gjordes för att utforska prestationen av SVM-modeller tränade med 1 till 29 principalkomponenter (PCs). Flera PC uppsättningar uppnådde 100% noggrannhet i den slutliga utvärderingen, där den minsta uppsättningen var de två första PCs. Den ackumulativa variansen över antalet PCs som användes hade inte någon korrelation med prestationen på modellen. Valet av kärna och hyperparametrar är betydande eftersom prestationen kan variera mycket. Resultatet stödjer tidigare studier att SVM kan vara användbar som stöd för diagnostisering av bipolär sjukdom och användningen av PCA som en dimensionalitetsreduktionsmetod i kombination med SVM kan vara lämplig för klassificering av neuroradiologisk data för bipolär och andra sjukdomar. På grund av begränsningen med få dataprover, kräver resultaten framtida forskning med en större datauppsättning för att validera de erhållna noggrannheten. Bipolar disorder diagnosis computer-aided medical diagnosis SVM Support vector machine PCA Principal component analysis dimensionality reduction feature reduction neuroimaging MRI sMRI machine learning classification psychiatric disorders mental illness Bipolär sjukdom diagnotisering datorstödd medicinsk diagnotisering SVM stödvektormaskin PCA principalkomponentanalys MRI magnetisk resonanstomografi MRT dimensionalitetsreduktion maskininlärning dimensionsreduktion klassificering psykiska sjukdomar Computer and Information Sciences Data- och informationsvetenskap
475	Reusage classification of damaged Paper Cores using Supervised Machine Learning Elofsson, Max, Larsson, Victor January 2023 (has links) This paper consists of a project exploring the possibility to assess paper code reusability by measuring chuck damages utilizing a 3D sensor and usingMachine Learning to classify reusage. The paper cores are part of a rolling/unrolling system at a paper mill whereas a chuck is used to slow and eventually stop the revolving paper core, which creates damages that at a certain point is too grave for reuse. The 3D sensor used is a TriSpector1008from SICK, based on active triangulation through laser line projection and optic sensing. A number of paper cores with damages varying in severity labeled approved or unapproved for further use was provided. SupervisedLearning in the form of K-NN, Support Vector Machine, Decision Trees andRandom Forest was used to binary classify the dataset based on readings from the sensor. Features were extracted from these readings based on the spatial and frequency domain of each reading in an experimental way.Classification of reusage was previously done through thresholding on internal features in the sensor software. The goal of the project is to unify the decision making protocol/system with economical, environmental and sustainable waste management benefits. K-NN was found to be best suitedin our case. Features for standard deviation of calculated depth obtained from the readings, performed best and lead to a zero false positive rate and recall score of 99.14%, outperforming the compared threshold system. / Den här rapporten undersöker möjligheten att bedöma papperskärnors återanvändbarhet genom att mäta chuck skador med hjälp av en 3D-sensor för att genom maskininlärning klassificera återanvändning. Papperskärnorna används i ett rullnings-/avrullningssystem i ett pappersbruk där en chuck används för att bromsa och till sist stoppa den roterande papperskärnan, vilket skapar skador som vid en viss punkt är för allvarliga för återanvändning. 3D-sensorn som används är en TriSpector1008 från SICK,baserad på aktiv triangulering genom laserlinje projektion och optisk avläsning. Projektet försågs med ett antal papperskärnor med varierande skador, märkta godkända eller ej godkända för vidare användning av leverantören. Supervised Learning i form av K-NN, Support VectorMachine, Decision Trees och Random Forest användes för att binärt klassificera datasetet baserat på avläsningar från sensorn. Features Extraherades från dessa avläsningar baserat på spatial och frekvensdomänen för varje avläsning på ett experimentellt sätt. Klassificering av återanvändning gjordes tidigare genom tröskelvärden på interna features isensorns mjukvara. Målet med projektet är att skapa ett enhetligtbeslutsprotokoll/system med ekonomiska, miljömässiga och hållbaraavfallshanteringsfördelar. K-NN visades vara bäst lämpad för projektet.Featuerna representerande standardavvikelse för beräknat djup som erhållits från avläsningarna visades vara bäst och leder till en false positive rate lika med noll och recall score på 99.14%, vilket överpresterade det jämförda tröskel systemet. Machine Learning Supervised Learning classification 3D sensor TriSpector1008 SICK paper core paper mill KNN Support Vector Machine Decision Trees Random Forest Maskininlärning klassificering papperskärna pappersbruk Computer Systems Datorsystem Embedded Systems Inbäddad systemteknik Robotics Robotteknik och automation
476	Evaluation of machine learning methods for anomaly detection in combined heat and power plant Carls, Fredrik January 2019 (has links) In the hope to increase the detection rate of faults in combined heat and power plant boilers thus lowering unplanned maintenance three machine learning models are constructed and evaluated. The algorithms; k-Nearest Neighbor, One-Class Support Vector Machine, and Auto-encoder have a proven track record in research for anomaly detection, but are relatively unexplored for industrial applications such as this one due to the difficulty in collecting non-artificial labeled data in the field.The baseline versions of the k-Nearest Neighbor and Auto-encoder performed very similarly. Nevertheless, the Auto-encoder was slightly better and reached an area under the precision-recall curve (AUPRC) of 0.966 and 0.615 on the trainingand test period, respectively. However, no sufficiently good results were reached with the One-Class Support Vector Machine. The Auto-encoder was made more sophisticated to see how much performance could be increased. It was found that the AUPRC could be increased to 0.987 and 0.801 on the trainingand test period, respectively. Additionally, the model was able to detect and generate one alarm for each incident period that occurred under the test period.The conclusion is that ML can successfully be utilized to detect faults at an earlier stage and potentially circumvent otherwise costly unplanned maintenance. Nevertheless, there is still a lot of room for improvements in the model and the collection of the data. / I hopp om att öka identifieringsgraden av störningar i kraftvärmepannor och därigenom minska oplanerat underhåll konstrueras och evalueras tre maskininlärningsmodeller.Algoritmerna; k-Nearest Neighbor, One-Class Support Vector Machine, och Autoencoder har bevisad framgång inom forskning av anomalidetektion, men är relativt outforskade för industriella applikationer som denna på grund av svårigheten att samla in icke-artificiell uppmärkt data inom området.Grundversionerna av k-Nearest Neighbor och Auto-encoder presterade nästan likvärdigt. Dock var Auto-encoder-modellen lite bättre och nådde ett AUPRC-värde av 0.966 respektive 0.615 på träningsoch testperioden. Inget tillräckligt bra resultat nåddes med One-Class Support Vector Machine. Auto-encoder-modellen gjordes mer sofistikerad för att se hur mycket prestandan kunde ökas. Det visade sig att AUPRC-värdet kunde ökas till 0.987 respektive 0.801 under träningsoch testperioden. Dessutom lyckades modellen identifiera och generera ett larm vardera för alla incidenter under testperioden. Slutsatsen är att ML framgångsrikt kan användas för att identifiera störningar iett tidigare skede och därigenom potentiellt kringgå i annat fall dyra oplanerade underhåll. Emellertid finns det fortfarande mycket utrymme för förbättringar av modellen samt inom insamlingen av data. Machine Learning Anomaly detection Fault detection Health/condition monitoring Sensor surveillance PHM CHP Plant Boilers k-Nearest Neighbor One-Class Support Vector Machine Auto-encoder Maskininlärning Anomalidetektion Feldetektering Tillståndsbevakning Sensorövervakning PHM Kraftvärmeverkpannor k-Nearest Neighbor One-Class Support Vector Ma- chine Auto-encoder Computer and Information Sciences Data- och informationsvetenskap
477	Loan Default Prediction using Supervised Machine Learning Algorithms / Fallissemangprediktion med hjälp av övervakade maskininlärningsalgoritmer Granström, Daria, Abrahamsson, Johan January 2019 (has links) It is essential for a bank to estimate the credit risk it carries and the magnitude of exposure it has in case of non-performing customers. Estimation of this kind of risk has been done by statistical methods through decades and with respect to recent development in the field of machine learning, there has been an interest in investigating if machine learning techniques can perform better quantification of the risk. The aim of this thesis is to examine which method from a chosen set of machine learning techniques exhibits the best performance in default prediction with regards to chosen model evaluation parameters. The investigated techniques were Logistic Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificial Neural Network and Support Vector Machine. An oversampling technique called SMOTE was implemented in order to treat the imbalance between classes for the response variable. The results showed that XGBoost without implementation of SMOTE obtained the best result with respect to the chosen model evaluation metric. / Det är nödvändigt för en bank att ha en bra uppskattning på hur stor risk den bär med avseende på kunders fallissemang. Olika statistiska metoder har använts för att estimera denna risk, men med den nuvarande utvecklingen inom maskininlärningsområdet har det väckt ett intesse att utforska om maskininlärningsmetoder kan förbättra kvaliteten på riskuppskattningen. Syftet med denna avhandling är att undersöka vilken metod av de implementerade maskininlärningsmetoderna presterar bäst för modellering av fallissemangprediktion med avseende på valda modelvaldieringsparametrar. De implementerade metoderna var Logistisk Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificiella neurala nätverk och Stödvektormaskin. En översamplingsteknik, SMOTE, användes för att behandla obalansen i klassfördelningen för svarsvariabeln. Resultatet blev följande: XGBoost utan implementering av SMOTE visade bäst resultat med avseende på den valda metriken. Machine Learning Deep Learning Credit Risk Default Prediction Logistic Regression Random Forest Decision Tree AdaBoost XGBoost Artificial Neural Network Support Vector Machine SMOTE Maskininlärning Djupinlärning Kreditrisk Fallissemangprediktion Logistisk Regression Random Forest Decision Tree AdaBoost XGBoost Artificiella neurala nätverk Stödvektormaskin SMOTE Probability Theory and Statistics Sannolikhetsteori och statistik
478	Using AI to improve the effectiveness of turbine performance data Shreyas Sudarshan Supe (17552379) 06 December 2023 (has links) <p dir="ltr">For turbocharged engine simulation analysis, manufacturer-provided data are typically used to predict the mass flow and efficiency of the turbine. To create a turbine map, physical tests are performed in labs at various turbine speeds and expansion ratios. These tests can be very expensive and time-consuming. Current testing methods can have limitations that result in errors in the turbine map. As such, only a modest set of data can be generated, all of which have to be interpolated and extrapolated to create a smooth surface that can then be used for simulation analysis.</p><p><br></p><p dir="ltr">The current method used by the manufacturer is a physics-informed polynomial regression model that depends on the Blade Speed Ratio (BSR ) in the polynomial function to model the efficiency and MFP. This method is memory-consuming and provides a lower-than-desired accuracy. This model is decades old and must be updated with new state-of-the-art Machine Learning models to be more competitive. Currently, CTT is facing up to +/-2% error in most turbine maps for efficiency and MFP and the aim is to decrease the error to 0.5% while interpolating the data points in the available region. The current model also extrapolates data to regions where experimental data cannot be measured. Physical tests cannot validate this extrapolation and can only be evaluated using CFD analysis.</p><p><br></p><p dir="ltr">The thesis focuses on investigating different AI techniques to increase the accuracy of the model for interpolation and evaluating the models for extrapolation. The data was made available by CTT. The available data consisted of various turbine parameters including ER, turbine speeds, efficiency, and MFP which were considered significant in turbine modeling. The AI models developed contained the above 4 parameters where ER and turbine speeds are predictors and, efficiency and MFP are the response. Multiple supervised ML models such as SVM, GPR, LMANN, BRANN, and GBPNN were developed and evaluated. From the above 5 ML models, BRANN performed the best achieving an error of 0.5% across multiple turbines for efficiency and MFP. The same model was used to demonstrate extrapolation, where the model gave unreliable predictions. Additional data points were inputted in the training data set at the far end of the testing regions which greatly increased the overall look of the map.</p><p><br></p><p dir="ltr">An additional contribution presented here is to completely predict an expansion ratio line and evaluate with CTT test data points where the model performed with an accuracy of over 95%. Since physical testing in a lab is expensive and time-consuming, another goal of the project was to reduce the number of data points provided for ANN model training. Furthermore, strategically reducing the data points is of utmost importance as some data points play a major role in the training of ANN and can greatly affect the model's overall accuracy. Up to 50% of the data points were removed for training inputs and it was found that BRANN was able to predict a satisfactory turbine map while reducing 20% of the overall data points at various regions.</p> Machine Learning Artificial Intelligence Turbocharger Turbine Turbine mapping SVM GPR Neural Networks Bayesian regularization neural network Gaussian Process Regression Predict... Levenberg-Marquardt networks support vector machine learning Gaussian process regression method
479	Detection and Classification of Sparse Traffic Noise Events / Detektering och klassificering av bullerhändelser från gles trafik Golshani, Kevin, Ekberg, Elias January 2023 (has links) Noise pollution is a big health hazard for people living in urban areas, and its effects on humans is a growing field of research. One of the major contributors to urban noise pollution is the noise generated by traffic. Noise simulations can be made in order to build noise maps used for noise management action plans, but in order to test their accuracy real measurements needs to be done, in this case in the form of noise measurements taken adjacent to a road. The aim of this project is to test machine learning based methods in order to develop a robust way of detecting and classifying vehicle noise in sparse traffic conditions. The primary focus is to detect traffic noise events, and the secondary focus is to classify what kind of vehicle is producing the noise. The data used in this project comes from sensors installed on a testbed at a street in southern Stockholm. The sensors include a microphone that is continuously measuring the local noise environment, a radar that detects each time a vehicle is passing by, and a camera that also detects a vehicle by capturing its license plate. Only sparse traffic noises are considered for this thesis, as such the audio recordings used are those where the radar has only detected one vehicle in a 40 second window. This makes the data gathered weakly labeled. The resulting detection method is a two-step process: First, the unsupervised learning method k-means is implemented for the generation of strong labels. Second, the supervised learning method random forest or support vector machine uses the strong labels in order to classify audio features. The detection system of sparse traffic noise achieved satisfactory results. However, the unsupervised vehicle classification method produced inadequate results and the clustering could not differentiate different vehicle classes based on the noise data. / Buller är en stor hälsorisk för människor som bor i stadsområden, och dess effekter på människor är ett växande forskningsfält. En av de största bidragen till stadsbuller är oljud som genereras av trafiken. Man kan utföra simuleringar i syfte att skapa bullerkartor som kan användas till planer för att minska dessa ljud. För att testa deras noggrannhet måste verkliga mätningar tas, i detta fall i formen av ljudmätningar tagna intill en väg. Syftet med detta projekt är att testa maskininlärningsmetoder för att utveckla ett robust sätt att detektera och klassificera fordonsljud i glesa trafikförhållanden. Primärt fokus ligger på att detektera bullerhändelser från trafiken, och sekundärt fokus är att försöka klassificera vilken typ av fordon som producerade ljudet. Datan som används i detta projekt kommer från sensorer installerade på en testbädd på en gata i södra Stockholm. Sensorerna inkluderar en mikrofon som kontinuerligt mäter den lokala ljudmiljön, en radar som detekterar varje gång ett fordon passerar, och en kamera som också detekterar ett fordon genom att ta bild på dess registreringsskylt. Endast ljud från gles trafik kommer att beaktas och användas i detta arbete, och därför används bara de ljudinspelningar där radarn har upptäckt ett enskilt fordon under ett 40 sekunders intervall. Detta gör att den insamlade datan har svaga etiketter. Den resulterande detekteringsmetoden är en tvåstegsprocess: För det första används den oövervakade inlärningsmetoden k-means för att generera starka etiketter. För det andra används de starka etiketterna av den övervakade inlärningsmetoden slumpmässig beslutsskog eller stödvektormaskin i syfte att klassificera ljudegenskaper. Detekteringssystemet av glest trafikljud uppnådde tillfredsställande resultat. Däremot producerade den oövervakade klassificeringsmetoden för fordonsljud otillräckliga resultat, och klustringen kunde inte urskilja mellan olika fordonsklasser baserat på ljuddatan. Noise pollution Machine learning Sound event detection SED Support vector machine SVM Random forest RF Decision tree K-means clustering Spherical k-means clustering Traffic noise Buller Maskininlärning Ljudhändelsedetektering Stödvektormaskin SVM Slumpmässiga beslutsskogar RF K-means klustring Sfärisk k-means klustring Trafikljud Bullerhändelse Other Mathematics Annan matematik
480	Improved in silico methods for target deconvolution in phenotypic screens Mervin, Lewis January 2018 (has links) Target-based screening projects for bioactive (orphan) compounds have been shown in many cases to be insufficiently predictive for in vivo efficacy, leading to attrition in clinical trials. Phenotypic screening has hence undergone a renaissance in both academia and in the pharmaceutical industry, partly due to this reason. One key shortcoming of this paradigm shift is that the protein targets modulated need to be elucidated subsequently, which is often a costly and time-consuming procedure. In this work, we have explored both improved methods and real-world case studies of how computational methods can help in target elucidation of phenotypic screens. One limitation of previous methods has been the ability to assess the applicability domain of the models, that is, when the assumptions made by a model are fulfilled and which input chemicals are reliably appropriate for the models. Hence, a major focus of this work was to explore methods for calibration of machine learning algorithms using Platt Scaling, Isotonic Regression Scaling and Venn-Abers Predictors, since the probabilities from well calibrated classifiers can be interpreted at a confidence level and predictions specified at an acceptable error rate. Additionally, many current protocols only offer probabilities for affinity, thus another key area for development was to expand the target prediction models with functional prediction (activation or inhibition). This extra level of annotation is important since the activation or inhibition of a target may positively or negatively impact the phenotypic response in a biological system. Furthermore, many existing methods do not utilize the wealth of bioactivity information held for orthologue species. We therefore also focused on an in-depth analysis of orthologue bioactivity data and its relevance and applicability towards expanding compound and target bioactivity space for predictive studies. The realized protocol was trained with 13,918,879 compound-target pairs and comprises 1,651 targets, which has been made available for public use at GitHub. Consequently, the methodology was applied to aid with the target deconvolution of AstraZeneca phenotypic readouts, in particular for the rationalization of cytotoxicity and cytostaticity in the High-Throughput Screening (HTS) collection. Results from this work highlighted which targets are frequently linked to the cytotoxicity and cytostaticity of chemical structures, and provided insight into which compounds to select or remove from the collection for future screening projects. Overall, this project has furthered the field of in silico target deconvolution, by improving the performance and applicability of current protocols and by rationalizing cytotoxicity, which has been shown to influence attrition in clinical trials.

Search results