• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 252
  • 58
  • 58
  • 56
  • 21
  • 12
  • 10
  • 9
  • 8
  • 7
  • 6
  • 5
  • 3
  • 3
  • 2
  • Tagged with
  • 561
  • 225
  • 180
  • 175
  • 170
  • 169
  • 148
  • 81
  • 75
  • 71
  • 68
  • 67
  • 64
  • 64
  • 60
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
141

High-Order Inference, Ranking, and Regularization Path for Structured SVM / Inférence d'ordre supérieur, Classement, et Chemin de Régularisation pour les SVM Structurés

Dokania, Puneet Kumar 30 May 2016 (has links)
Cette thèse présente de nouvelles méthodes pour l'application de la prédiction structurée en vision numérique et en imagerie médicale.Nos nouvelles contributions suivent quatre axes majeurs.La première partie de cette thèse étudie le problème d'inférence d'ordre supérieur.Nous présentons une nouvelle famille de problèmes de minimisation d'énergie discrète, l'étiquetage parcimonieux, encourageant la parcimonie des étiquettes.C'est une extension naturelle des problèmes connus d'étiquetage de métriques aux potentiels d'ordre élevé.Nous proposons par ailleurs une généralisation du modèle Pn-Potts, le modèle Pn-Potts hiérarchique.Enfin, nous proposons un algorithme parallélisable à proposition de mouvements avec de fortes bornes multiplicatives pour l'optimisation du modèle Pn-Potts hiérarchique et l'étiquetage parcimonieux.La seconde partie de cette thèse explore le problème de classement en utilisant de l'information d'ordre élevé.Nous introduisons deux cadres différents pour l'incorporation d'information d'ordre élevé dans le problème de classement.Le premier modèle, que nous nommons SVM binaire d'ordre supérieur (HOB-SVM), optimise une borne supérieure convexe sur l'erreur 0-1 pondérée tout en incorporant de l'information d'ordre supérieur en utilisant un vecteur de charactéristiques jointes.Le classement renvoyé par HOB-SVM est obtenu en ordonnant les exemples selon la différence entre la max-marginales de l'affectation d'un exemple à la classe associée et la max-marginale de son affectation à la classe complémentaire.Le second modèle, appelé AP-SVM d'ordre supérieur (HOAP-SVM), s'inspire d'AP-SVM et de notre premier modèle, HOB-SVM.Le modèle correspond à une optimisation d'une borne supérieure sur la précision moyenne, à l'instar d'AP-SVM, qu'il généralise en permettant également l'incorporation d'information d'ordre supérieur.Nous montrons comment un optimum local du problème d'apprentissage de HOAP-SVM peut être déterminé efficacement grâce à la procédure concave-convexe.En utilisant des jeux de données standards, nous montrons empiriquement que HOAP-SVM surpasse les modèles de référence en utilisant efficacement l'information d'ordre supérieur tout en optimisant directement la fonction d'erreur appropriée.Dans la troisième partie, nous proposons un nouvel algorithme, SSVM-RP, pour obtenir un chemin de régularisation epsilon-optimal pour les SVM structurés.Nous présentons également des variantes intuitives de l'algorithme Frank-Wolfe pour l'optimisation accélérée de SSVM-RP.De surcroît, nous proposons une approche systématique d'optimisation des SSVM avec des contraintes additionnelles de boîte en utilisant BCFW et ses variantes.Enfin, nous proposons un algorithme de chemin de régularisation pour SSVM avec des contraintes additionnelles de positivité/negativité.Dans la quatrième et dernière partie de la thèse, en appendice, nous montrons comment le cadre de l'apprentissage semi-supervisé des SVM à variables latentes peut être employé pour apprendre les paramètres d'un problème complexe de recalage déformable.Nous proposons un nouvel algorithme discriminatif semi-supervisé pour apprendre des métriques de recalage spécifiques au contexte comme une combinaison linéaire des métriques conventionnelles.Selon l'application, les métriques traditionnelles sont seulement partiellement sensibles aux propriétés anatomiques des tissus.Dans ce travail, nous cherchons à déterminer des métriques spécifiques à l'anatomie et aux tissus, par agrégation linéaire de métriques connues.Nous proposons un algorithme d'apprentissage semi-supervisé pour estimer ces paramètres conditionnellement aux classes sémantiques des données, en utilisant un jeu de données faiblement annoté.Nous démontrons l'efficacité de notre approche sur trois jeux de données particulièrement difficiles dans le domaine de l'imagerie médicale, variables en terme de structures anatomiques et de modalités d'imagerie. / This thesis develops novel methods to enable the use of structured prediction in computer vision and medical imaging. Specifically, our contributions are four fold. First, we propose a new family of high-order potentials that encourage parsimony in the labeling, and enable its use by designing an accurate graph cuts based algorithm to minimize the corresponding energy function. Second, we show how the average precision SVM formulation can be extended to incorporate high-order information for ranking. Third, we propose a novel regularization path algorithm for structured SVM. Fourth, we show how the weakly supervised framework of latent SVM can be employed to learn the parameters for the challenging deformable registration problem.In more detail, the first part of the thesis investigates the high-order inference problem. Specifically, we present a novel family of discrete energy minimization problems, which we call parsimonious labeling. It is a natural generalization of the well known metric labeling problems for high-order potentials. In addition to this, we propose a generalization of the Pn-Potts model, which we call Hierarchical Pn-Potts model. In the end, we propose parallelizable move making algorithms with very strong multiplicative bounds for the optimization of the hierarchical Pn-Potts model and the parsimonious labeling.Second part of the thesis investigates the ranking problem while using high-order information. Specifically, we introduce two alternate frameworks to incorporate high-order information for the ranking tasks. The first framework, which we call high-order binary SVM (HOB-SVM), optimizes a convex upperbound on weighted 0-1 loss while incorporating high-order information using joint feature map. The rank list for the HOB-SVM is obtained by sorting samples using max-marginals based scores. The second framework, which we call high-order AP-SVM (HOAP-SVM), takes its inspiration from AP-SVM and HOB-SVM (our first framework). Similar to AP-SVM, it optimizes upper bound on average precision. However, unlike AP-SVM and similar to HOB-SVM, it can also encode high-order information. The main disadvantage of HOAP-SVM is that estimating its parameters requires solving a difference-of-convex program. We show how a local optimum of the HOAP-SVM learning problem can be computed efficiently by the concave-convex procedure. Using standard datasets, we empirically demonstrate that HOAP-SVM outperforms the baselines by effectively utilizing high-order information while optimizing the correct loss function.In the third part of the thesis, we propose a new algorithm SSVM-RP to obtain epsilon-optimal regularization path of structured SVM. We also propose intuitive variants of the Block-Coordinate Frank-Wolfe algorithm (BCFW) for the faster optimization of the SSVM-RP algorithm. In addition to this, we propose a principled approach to optimize the SSVM with additional box constraints using BCFW and its variants. In the end, we propose regularization path algorithm for SSVM with additional positivity/negativity constraints.In the fourth and the last part of the thesis (Appendix), we propose a novel weakly supervised discriminative algorithm for learning context specific registration metrics as a linear combination of conventional metrics. Conventional metrics can cope partially - depending on the clinical context - with tissue anatomical properties. In this work we seek to determine anatomy/tissue specific metrics as a context-specific aggregation/linear combination of known metrics. We propose a weakly supervised learning algorithm for estimating these parameters conditionally to the data semantic classes, using a weak training dataset. We show the efficacy of our approach on three highly challenging datasets in the field of medical imaging, which vary in terms of anatomical structures and image modalities.
142

Multi Criteria Mapping Based on SVM and Clustering Methods

Diddikadi, Abhishek 09 November 2015 (has links)
There are many more ways to automate the application process like using some commercial software’s that are used in big organizations to scan bills and forms, but this application is only for the static frames or formats. In our application, we are trying to automate the non-static frames as the study certificate we get are from different counties with different universities. Each and every university have there one format of certificates, so we try developing a very new application that can commonly work for all the frames or formats. As we observe many applicants are from same university which have a common format of the certificate, if we implement this type of tools, then we can analyze this sort of certificates in a simple way within very less time. To make this process more accurate we try implementing SVM and Clustering methods. With these methods we can accurately map courses in certificates to ASE study path if not to exclude list. A grade calculation is done for courses which are mapped to an ASE list by separating the data for both labs and courses in it. At the end, we try to award some points, which includes points from ASE related courses, work experience, specialization certificates and German language skills. Finally, these points are provided to the chair to select the applicant for master course ASE.
143

Intent classification through conversational interfaces : Classification within a small domain

Lekic, Sasa, Liu, Kasper January 2019 (has links)
Natural language processing and Machine learning are subjects undergoing intense study nowadays. These fields are continually spreading, and are more interrelated than ever before. A case in point is text classification which is an instance of Machine learning(ML) application in Natural Language processing(NLP).Although these subjects have evolved over the recent years, they still have some problems that have to be considered. Some are related to the computing power techniques from these subjects require, whereas the others to how much training data they require.The research problem addressed in this thesis regards lack of knowledge on whether Machine learning techniques such as Word2Vec, Bidirectional encoder representations from transformers (BERT) and Support vector machine(SVM) classifier can be used for text classification, provided only a small training set. Furthermore, it is not known whether these techniques can be run on regular laptops.To solve the research problem, the main purpose of this thesis was to develop two separate conversational interfaces utilizing text classification techniques. These interfaces, provided with user input, can recognise the intent behind it, viz. classify the input sentence within a small set of pre-defined categories. Firstly, a conversational interface utilizing Word2Vec, and SVM classifier was developed. Secondly, an interface utilizing BERT and SVM classifier was developed. The goal of the thesis was to determine whether a small dataset can be used for intent classification and with what accuracy, and if it can be run on regular laptops.The research reported in this thesis followed a standard applied research method. The main purpose was achieved and the two conversational interfaces were developed. Regarding the conversational interface utilizing Word2Vec pre-trained dataset, and SVM classifier, the main results showed that it can be used for intent classification with the accuracy of 60%, and that it can be run on regular computers. Concerning the conversational interface utilizing BERT and SVM Classifier, the results showed that this interface cannot be trained and run on regular laptops. The training ran over 24 hours and then crashed.The results showed that it is possible to make a conversational interface which is able to classify intents provided only a small training set. However, due to the small training set, and consequently low accuracy, this conversational interface is not a suitable option for important tasks, but can be used for some non-critical classification tasks. / Natural language processing och maskininlärning är ämnen som forskas mycket om idag. Dessa områden fortsätter växa och blir allt mer sammanvävda, nu mer än någonsin. Ett område är textklassifikation som är en gren av maskininlärningsapplikationer (ML) inom Natural language processing (NLP).Även om dessa ämnen har utvecklats de senaste åren, finns det fortfarande problem att ha i å tanke. Vissa är relaterade till rå datakraft som krävs för dessa tekniker medans andra problem handlar om mängden data som krävs.Forskningsfrågan i denna avhandling handlar om kunskapsbrist inom maskininlärningtekniker som Word2vec, Bidirectional encoder representations from transformers (BERT) och Support vector machine(SVM) klassificierare kan användas som klassification, givet endast små träningsset. Fortsättningsvis, vet man inte om dessa metoder fungerar på vanliga datorer.För att lösa forskningsproblemet, huvudsyftet för denna avhandling var att utveckla två separata konversationsgränssnitt som använder textklassifikationstekniker. Dessa gränssnitt, give med data, kan känna igen syftet bakom det, med andra ord, klassificera given datamening inom ett litet set av fördefinierade kategorier. Först, utvecklades ett konversationsgränssnitt som använder Word2vec och SVM klassificerare. För det andra, utvecklades ett gränssnitt som använder BERT och SVM klassificerare. Målet med denna avhandling var att avgöra om ett litet dataset kan användas för syftesklassifikation och med vad för träffsäkerhet, och om det kan användas på vanliga datorer.Forskningen i denna avhandling följde en standard tillämpad forskningsmetod. Huvudsyftet uppnåddes och de två konversationsgränssnitten utvecklades. Angående konversationsgränssnittet som använde Word2vec förtränat dataset och SVM klassificerar, visade resultatet att det kan användas för syftesklassifikation till en träffsäkerhet på 60%, och fungerar på vanliga datorer. Angående konversationsgränssnittet som använde BERT och SVM klassificerare, visade resultatet att det inte går att köra det på vanliga datorer. Träningen kördes i över 24 timmar och kraschade efter det.Resultatet visade att det är möjligt att skapa ett konversationsgränssnitt som kan klassificera syften, givet endast ett litet träningsset. Däremot, på grund av det begränsade träningssetet, och konsekvent låg träffsäkerhet, är denna konversationsgränssnitt inte lämplig för viktiga uppgifter, men kan användas för icke kritiska klassifikationsuppdrag.
144

Automatic Classification of Conditions for Grants in Appropriation Directions of Government Agencies

Wallerö, Emma January 2022 (has links)
This study explores the possibilities of classifying language as governing or not. The ground premise is to examine how detecting and quantifying governing conditions from thousands of financial grants in appropriation directions can be performed automatically, as well as creating a data set to perform machine learning for this text classification task. In this study, automatic classification is performed along with an annotation process extracting and labelling data. Automatic classification can be performed by using a variety of data, methods and tasks. The classification task aims to mainly divide conditions into being governing of the conducting of the specific agency or not. The data consists of text from the specific chapter in the appropriation directions regarding financial grants. The text is split into sentences, keeping only sentences longer than 15 words. An iterative annotation process is then performed in order to receive labelled conditions, involving three expert annotators for the final data set, and laymen annotations for initial experiments. Given the data extracted from the annotation process, SVM, BiLSTM and KB-BERT classifiers are trained and evaluated. All models are evaluated using no context information, with bullet points as an exception, where a previous, generally descriptive sentence is included. Apart from this default input representation type, context regarding preceding sentence along with the target sentence, as well as adding specific agency to the target sentence are evaluated as alternative data representation types. The final inter-annotator agreement was not optimal with Cohen’s Kappa scores that can be interpreted as representing moderate agreement. By using majority vote for the test set, the non-optimal agreement was somewhat prevented for this specific set. The best performing model all input representation types considered was the KB-BERT using no context information, receiving an F1-score on 0.81 and an accuracy score on 0.89 on the test set. All models gave a better performance for sentences classed as governing, which might be partially due to the final annotated data sets being skewed. Possible future studies include further iterative annotation and working towards a clear and as objective definition of how a governing condition can be defined, as well as exploring the possibilities of using data augmentation to counteract the uneven distribution of classes in the final data sets.
145

6G RF Waveform with AI for Human Presence Detection in Indoor Environments

Stratigi, Eirini January 2022 (has links)
Wireless communication equipment is widely available and the number of transmitters and receivers keeps increasing. In addition to communications, wireless nodes can be used for sensing. This project is focuses on human presence detection in indoor environments using measurements such as CSI that can be extracted from radio receivers and labeled using a camera and AI computer vision techniques (YoLo framework). Our goal is to understand if a room is empty or has one or two people by utilizing machine learning algorithms. We have selected SVM (Support Vector Machines) and CNN (Convolutional Neural Networks). These methods will be evaluated in different scenarios such as different locations, bandwidths of 20, 40 and 120MHz, carrier frequencies of 2.4 and 5 GHz, high/low SNR values as well as different antenna configurations (MIMO, SIMO, SISO). Both methods perform very well for classification and specifically in case of CNN it performs better in low SNR compared to SVM. We found that some of the measurements seemed to be outliers and the clustering algorithm DBScan was used in order to identify them. Last but not least, we explore whether the radio can complement computer vision in presence detection since radio waves may propagate through walls and opaque obstacles. / Trådlös kommunikationsutrustning är allmänt tillgänglig och antalet sändare och mottagare fortsätter att öka. Förutom kommunikation kan trådlösa noder användas för avkänning. Detta projekt fokuserar på mänsklig närvarodetektering i inomhusmiljöer med hjälp av mätningar som CSI som kan extraheras från radiomottagare och märkas med hjälp av en kamera och AI datorseende tekniker (YoLo-ramverket). Vårt mål är att förstå om ett rum är tomt eller har en eller två personer genom att använda maskininlärningsalgoritmer. Vi har valt SVM och CNN. Dessa metoder kommer att utvärderas i olika scenarier såsom olika platser, bandbredder på 20, 40 och 120MHz, bärvågsfrekvenser på 2,4 och 5 GHz, höga/låga SNR-värden samt olika antennkonfigurationer (MIMO, SIMO, SISO). Båda metoderna fungerar mycket bra för klassificering och specifikt i fall av CNN presterar den bättre i låg SNR jämfört med SVM. Vi fann att några av mätningarna verkade vara extremvärden och klustringsalgoritmen DBScan användes för att identifiera dem. Sist men inte minst undersöker vi om radion kan komplettera datorseende vid närvarodetektering eftersom radiovågor kan fortplanta sig genom väggar och ogenomskinliga hinder.
146

Supervised Failure Diagnosis of Clustered Logs from Microservice Tests / Övervakad feldiagnos av klustrade loggar från tester på mikrotjänster

Strömdahl, Amanda January 2023 (has links)
Pinpointing the source of a software failure based on log files can be a time consuming process. Automated log analysis tools are meant to streamline such processes, and can be used for tasks like failure diagnosis. This thesis evaluates three supervised models for failure diagnosis of clustered log data. The goal of the thesis is to compare the performance of the models on industry data, as a way to investigate whether the chosen ML techniques are suitable in the context of automated log analysis. A Random Forest, an SVM and an MLP are generated from a dataset of 194 failed executions of tests on microservices, that each resulted in a large collection of logs. The models are tuned with random search and compared in terms of precision, recall, F1-score, hold-out accuracy and 5-fold cross-validation accuracy. The hold-out accuracy is calculated as a mean from 50 hold-out data splits, and the cross-validation accuracy is computed separately from a single set of folds. The results show that the Random Forest scores highest in terms of mean hold-out accuracy (90%), compared to the SVM (86%) and the Neural Network (85%). The mean cross-validation accuracy is the highest for the SVM (95%), closely followed by the Random Forest (94%), and lastly the Neural Network (85%). The precision, recall and F1-score are stable and consistent with the hold-out results, although the precision results are slightly higher than the other two measures. According to this evaluation, the Random Forest has the overall highest performance on the dataset when considering the hold-out- and cross-validation accuracies, and also the fact that it has the lowest complexity and thus the shortest training time, compared to the other considered solutions. All in all, the results of the thesis demonstrate that supervised learning is a promising approach to automatize log analysis. / Att identifiera orsaken till en misslyckad mjukvaruexekvering utifrån logg-filer kan vara en tidskrävande process. Verktyg för automatiserad logg-analysis är tänkta att effektivisera sådana processer, och kan bland annat användas för feldiagnos. Denna avhandling tillhandahåller tre övervakade modeller för feldiagnos av klustrad logg-data. Målet med avhandlingen är att jämföra modellernas prestanda på data från näringslivet, i syfte att utforska huruvida de valda maskininlärningsteknikerna är lämpliga för automatiserad logg-analys. En Random Forest, en SVM och en MLP genereras utifrån ett dataset bestående av 194 misslyckade exekveringar av tester på mikrotjänster, där varje exekvering resulterade i en stor uppsättning loggar. Modellerna finjusteras med hjälp av slumpmässig sökning och jämförs via precision, träffsäkerhet, F-poäng, noggrannhet och 5-faldig korsvalidering. Noggrannheten beräknas som medelvärdet av 50 datauppdelningar, och korsvalideringen tas fram separat från en enstaka uppsättning vikningar. Resultaten visar att Random Forest har högst medelvärde i noggrannhet (90%), jämfört med SVM (86%) och Neurala Nätverket (85%). Medelvärdet i korsvalidering är högst för SVM (95%), tätt följt av Random Forest (94%), och till sist, Neurala Nätverket (85%). Precisionen, träffsäkerheten och F-poängen är stabila och i enlighet med noggrannheten, även om precisionen är något högre än de andra två måtten. Enligt den här analysen har Random Forest överlag högst prestanda på datasetet, med hänsyn till noggrannheten och korsvalideringen, samt faktumet att denna modell har lägst komplexitet och därmed kortast träningstid, jämfört med de andra undersökta lösningarna. Sammantaget visar resultaten från denna avhandling att övervakad inlärning är ett lovande tillvägagångssätt för att automatisera logg-analys.
147

A Multivariate Data Stream Anomaly Detection Framework

Jin, Jiakun January 2016 (has links)
High speed stream anomaly detection is an important technology used in many industry applications such as monitoring system health, detecting financial fraud, monitoring customer's unusual behavior and so on. In those scenarios multivariate data arrives in high speed, and needs to be calculated in real-time. Since solutions for high speed multivariate stream anomaly detection are still under development, the objective of this thesis is introducing a framework for testing different anomaly detection algorithms.Multivariate anomaly detection, usually includes two major steps: point anomaly detection and stream anomaly detection. Point anomaly detection is used to transfer multivariate feature data into anomaly score according to the recent stream of data. The stream anomaly detectors are used to detect stream anomalies based on the recent anomaly scores generated from previous point anomaly detector. This thesis presents a flexible framework that allows the easy integration and evaluation of different  data sources, point and stream anomaly detection algorithms. To demonstrate the capabilities of the framework,  we consider different scenarios with generators of artificial data, real industry data sets and time series data, point anomaly detectors of PYISC, SVM and LOF, stream anomaly detectors of DDM, CUSUM and FCWM. The evaluation results show that for point anomaly detectors, PYISC and LOF perform well when the distributions of features are known, SVM performs well even when the distributions of features are not known. For the stream anomaly detectors, DDM has some possibilities to get false anomaly detection, CUSUM has some possibilities to get failed when the stream anomalies increase slowly, while FCWM performs best with very low possibilities to get failed. / Höghastighet ström anomali detektion är en viktig teknik som används i många industriella tillämpningar såsom övervakningssystem för hälsa, upptäckande av ekonomiska bedrägerier, övervakning av kundernas ovanliga beteende och så vidare. I dessa scenarier kommer multivariat data i hög hastighet, och måste beräknas i realtid. Eftersom lösningar för höghastighet multivariat ström anomali detektion är fortfarande under utveckling, är syftet med denna avhandling att införa en ramverk för att testa olika anomali algoritmer. Multivariat anomali detektion har oftast två viktiga steg: att upptäcka punkt-avvikelser och att upptäcka ström-avvikelser.  Punkt- anomali detektorer används för att överföra multivariat data i anomali poäng enligt den senaste tidens dataström. Ström anomali detektorer används för att detektera ström avvikelser baserade på den senaste tidens anomali poäng genererade från föregående punkt anomali detektoren. Denna avhandling presenterar ett flexibelt ramverk som möjlig gör enkel integration och utvärdering av olika datakällor, punkt och ström anomali detektorer. För att demonstrera ramverkets kapabiliteteter, betraktar vi olika scenarier med  datageneratorer av konstgjorda data, verkliga industri data och tidsseriedata; punkt anomali detektorer  PYISC, SVM och Löf,  och ström anomali detektorer DDM, CUSUM och FCWM. Utvärderingsresultaten visar att för punkt anomali detektor har PYISC och LOF bra prestanda när datafördelningen är kända,  men SVM fungerar bra även när fördelningarna  inte är kända. För ström anomali detektor har DDM vissa sannolikhet att få falskt upptäcka avvikelser, och CUSUM vissa sannolikhet att misslycka när avvikelser ökar långsamt. FCWM fungerar bäst med mycket låga sannolikhet för misslyckande.
148

Machine Learning Based Failure Detection in Data Centers

Piran Nanekaran, Negin January 2020 (has links)
This work proposes a new approach to fast detection of abnormal behaviour of cooling, IT, and power distribution systems in micro data centers based on machine learning techniques. Conventional protection of micro data centers focuses on monitoring individual parameters such as temperature at different locations and when these parameters reach certain high values, then an alarm will be triggered. This research employs machine learning techniques to extract normal and abnormal behaviour of the cooling and IT systems. Developed data acquisition system together with unsupervised learning methods quickly learns the physical dynamics of normal operation and can detect deviations from such behaviours. This provides an efficient way for not only producing health index for the micro data center, but also a rich label logging system that will be used for the supervised learning methods. The effectiveness of the proposed detection technique is evaluated on an micro data center placed at Computing Infrastructure Research Center (CIRC) in McMaster Innovation Park (MIP), McMaster University. / Thesis / Master of Science (MSc)
149

Automatic Text Classification of Research Grant Applications / Automatisk textklassificering av forskningsbidragsansökningar

Lindqvist, Robin January 2024 (has links)
This study aims to construct a state-of-the-art classifier model and compare it against a largelanguage model. A variation of SVM called LinearSVC was utilised and the BERT model usingbert-base-uncased was used. The data, provided by the Swedish Research Council, consisted ofresearch grant applications. The research grant applications were divided into two groups, whichwere further divided into several subgroups. The subgroups represented research fields such ascomputer science and applied physics. Significant class imbalances were present, with someclasses having only a tenth of the applications of the largest class. To address these imbalances,a new dataset was created using data that had been randomly oversampled. The models weretrained and tested on their ability to correctly assign a subgroup to a research grant application.Results indicate that the BERT model outperformed the SVM model on the original dataset,but not on the balanced dataset . Furthermore, the BERT model’s performance decreased whentransitioning from the original to the balanced dataset, due to overfitting or randomness. / Denna studie har som mål att bygga en state-of-the-art klassificerar model och sedan jämföraden mot en stor språkmodel. SVM modellen var en variation av SVM vid namn LinearSVC ochför BERT användes bert-base-uncased. Data erhölls från Vetenskapsrådet och bestod av forskn-ingsbidragsansökningar. Forskningsbidragsansökningarna var uppdelade i två grupper, som varytterligare uppdelade i ett flertal undergrupper. Dessa undergrupper representerar forsknings-fält såsom datavetenskap och tillämpad fysik. I den data som användes i studien fanns storaskillnader mellan klasserna, där somliga klasser hade en tiondel av ansökningarna som de storaklasserna hade. I syfte att lösa dessa klassbalanseringsproblem skapades en datamängd somundergått slumpmässig översampling. Modellerna tränades och testades på deras förmåga attkorrekt klassificera en forskningsbidragsansökan in i rätt undergrupp. Studiens fynd visade attBERT modellen presterade bättre än SVM modellen på både den ursprungliga datamängden,dock inte på den balanserade datamängden. Tilläggas kan, BERTs prestanda sjönk vid övergångfrån den ursprungliga datamängden till den balanserade datamängden, något som antingen berorpå överanpassning eller slump.
150

A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target Genes

Hafez, Dina Mohamed January 2015 (has links)
<p>Next generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.</p><p>Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge. </p><p>In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.</p><p>On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries. </p><p>We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights. </p><p>Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.</p> / Dissertation

Page generated in 0.0414 seconds