141 |
Multi Criteria Mapping Based on SVM and Clustering MethodsDiddikadi, Abhishek 09 November 2015 (has links)
There are many more ways to automate the application process like using some commercial software’s that are used in big organizations to scan bills and forms, but this application is only for the static frames or formats. In our application, we are trying to automate the non-static frames as the study certificate we get are from different counties with different universities. Each and every university have there one format of certificates, so we try developing a very new application that can commonly work for all the frames or formats. As we observe many applicants are from same university which have a common format of the certificate, if we implement this type of tools, then we can analyze this sort of certificates in a simple way within very less time. To make this process more accurate we try implementing SVM and Clustering methods. With these methods we can accurately map courses in certificates to ASE study path if not to exclude list. A grade calculation is done for courses which are mapped to an ASE list by separating the data for both labs and courses in it. At the end, we try to award some points, which includes points from ASE related courses, work experience, specialization certificates and German language skills. Finally, these points are provided to the chair to select the applicant for master course ASE.
|
142 |
Intent classification through conversational interfaces : Classification within a small domainLekic, Sasa, Liu, Kasper January 2019 (has links)
Natural language processing and Machine learning are subjects undergoing intense study nowadays. These fields are continually spreading, and are more interrelated than ever before. A case in point is text classification which is an instance of Machine learning(ML) application in Natural Language processing(NLP).Although these subjects have evolved over the recent years, they still have some problems that have to be considered. Some are related to the computing power techniques from these subjects require, whereas the others to how much training data they require.The research problem addressed in this thesis regards lack of knowledge on whether Machine learning techniques such as Word2Vec, Bidirectional encoder representations from transformers (BERT) and Support vector machine(SVM) classifier can be used for text classification, provided only a small training set. Furthermore, it is not known whether these techniques can be run on regular laptops.To solve the research problem, the main purpose of this thesis was to develop two separate conversational interfaces utilizing text classification techniques. These interfaces, provided with user input, can recognise the intent behind it, viz. classify the input sentence within a small set of pre-defined categories. Firstly, a conversational interface utilizing Word2Vec, and SVM classifier was developed. Secondly, an interface utilizing BERT and SVM classifier was developed. The goal of the thesis was to determine whether a small dataset can be used for intent classification and with what accuracy, and if it can be run on regular laptops.The research reported in this thesis followed a standard applied research method. The main purpose was achieved and the two conversational interfaces were developed. Regarding the conversational interface utilizing Word2Vec pre-trained dataset, and SVM classifier, the main results showed that it can be used for intent classification with the accuracy of 60%, and that it can be run on regular computers. Concerning the conversational interface utilizing BERT and SVM Classifier, the results showed that this interface cannot be trained and run on regular laptops. The training ran over 24 hours and then crashed.The results showed that it is possible to make a conversational interface which is able to classify intents provided only a small training set. However, due to the small training set, and consequently low accuracy, this conversational interface is not a suitable option for important tasks, but can be used for some non-critical classification tasks. / Natural language processing och maskininlärning är ämnen som forskas mycket om idag. Dessa områden fortsätter växa och blir allt mer sammanvävda, nu mer än någonsin. Ett område är textklassifikation som är en gren av maskininlärningsapplikationer (ML) inom Natural language processing (NLP).Även om dessa ämnen har utvecklats de senaste åren, finns det fortfarande problem att ha i å tanke. Vissa är relaterade till rå datakraft som krävs för dessa tekniker medans andra problem handlar om mängden data som krävs.Forskningsfrågan i denna avhandling handlar om kunskapsbrist inom maskininlärningtekniker som Word2vec, Bidirectional encoder representations from transformers (BERT) och Support vector machine(SVM) klassificierare kan användas som klassification, givet endast små träningsset. Fortsättningsvis, vet man inte om dessa metoder fungerar på vanliga datorer.För att lösa forskningsproblemet, huvudsyftet för denna avhandling var att utveckla två separata konversationsgränssnitt som använder textklassifikationstekniker. Dessa gränssnitt, give med data, kan känna igen syftet bakom det, med andra ord, klassificera given datamening inom ett litet set av fördefinierade kategorier. Först, utvecklades ett konversationsgränssnitt som använder Word2vec och SVM klassificerare. För det andra, utvecklades ett gränssnitt som använder BERT och SVM klassificerare. Målet med denna avhandling var att avgöra om ett litet dataset kan användas för syftesklassifikation och med vad för träffsäkerhet, och om det kan användas på vanliga datorer.Forskningen i denna avhandling följde en standard tillämpad forskningsmetod. Huvudsyftet uppnåddes och de två konversationsgränssnitten utvecklades. Angående konversationsgränssnittet som använde Word2vec förtränat dataset och SVM klassificerar, visade resultatet att det kan användas för syftesklassifikation till en träffsäkerhet på 60%, och fungerar på vanliga datorer. Angående konversationsgränssnittet som använde BERT och SVM klassificerare, visade resultatet att det inte går att köra det på vanliga datorer. Träningen kördes i över 24 timmar och kraschade efter det.Resultatet visade att det är möjligt att skapa ett konversationsgränssnitt som kan klassificera syften, givet endast ett litet träningsset. Däremot, på grund av det begränsade träningssetet, och konsekvent låg träffsäkerhet, är denna konversationsgränssnitt inte lämplig för viktiga uppgifter, men kan användas för icke kritiska klassifikationsuppdrag.
|
143 |
Automatic Classification of Conditions for Grants in Appropriation Directions of Government AgenciesWallerö, Emma January 2022 (has links)
This study explores the possibilities of classifying language as governing or not. The ground premise is to examine how detecting and quantifying governing conditions from thousands of financial grants in appropriation directions can be performed automatically, as well as creating a data set to perform machine learning for this text classification task. In this study, automatic classification is performed along with an annotation process extracting and labelling data. Automatic classification can be performed by using a variety of data, methods and tasks. The classification task aims to mainly divide conditions into being governing of the conducting of the specific agency or not. The data consists of text from the specific chapter in the appropriation directions regarding financial grants. The text is split into sentences, keeping only sentences longer than 15 words. An iterative annotation process is then performed in order to receive labelled conditions, involving three expert annotators for the final data set, and laymen annotations for initial experiments. Given the data extracted from the annotation process, SVM, BiLSTM and KB-BERT classifiers are trained and evaluated. All models are evaluated using no context information, with bullet points as an exception, where a previous, generally descriptive sentence is included. Apart from this default input representation type, context regarding preceding sentence along with the target sentence, as well as adding specific agency to the target sentence are evaluated as alternative data representation types. The final inter-annotator agreement was not optimal with Cohen’s Kappa scores that can be interpreted as representing moderate agreement. By using majority vote for the test set, the non-optimal agreement was somewhat prevented for this specific set. The best performing model all input representation types considered was the KB-BERT using no context information, receiving an F1-score on 0.81 and an accuracy score on 0.89 on the test set. All models gave a better performance for sentences classed as governing, which might be partially due to the final annotated data sets being skewed. Possible future studies include further iterative annotation and working towards a clear and as objective definition of how a governing condition can be defined, as well as exploring the possibilities of using data augmentation to counteract the uneven distribution of classes in the final data sets.
|
144 |
6G RF Waveform with AI for Human Presence Detection in Indoor EnvironmentsStratigi, Eirini January 2022 (has links)
Wireless communication equipment is widely available and the number of transmitters and receivers keeps increasing. In addition to communications, wireless nodes can be used for sensing. This project is focuses on human presence detection in indoor environments using measurements such as CSI that can be extracted from radio receivers and labeled using a camera and AI computer vision techniques (YoLo framework). Our goal is to understand if a room is empty or has one or two people by utilizing machine learning algorithms. We have selected SVM (Support Vector Machines) and CNN (Convolutional Neural Networks). These methods will be evaluated in different scenarios such as different locations, bandwidths of 20, 40 and 120MHz, carrier frequencies of 2.4 and 5 GHz, high/low SNR values as well as different antenna configurations (MIMO, SIMO, SISO). Both methods perform very well for classification and specifically in case of CNN it performs better in low SNR compared to SVM. We found that some of the measurements seemed to be outliers and the clustering algorithm DBScan was used in order to identify them. Last but not least, we explore whether the radio can complement computer vision in presence detection since radio waves may propagate through walls and opaque obstacles. / Trådlös kommunikationsutrustning är allmänt tillgänglig och antalet sändare och mottagare fortsätter att öka. Förutom kommunikation kan trådlösa noder användas för avkänning. Detta projekt fokuserar på mänsklig närvarodetektering i inomhusmiljöer med hjälp av mätningar som CSI som kan extraheras från radiomottagare och märkas med hjälp av en kamera och AI datorseende tekniker (YoLo-ramverket). Vårt mål är att förstå om ett rum är tomt eller har en eller två personer genom att använda maskininlärningsalgoritmer. Vi har valt SVM och CNN. Dessa metoder kommer att utvärderas i olika scenarier såsom olika platser, bandbredder på 20, 40 och 120MHz, bärvågsfrekvenser på 2,4 och 5 GHz, höga/låga SNR-värden samt olika antennkonfigurationer (MIMO, SIMO, SISO). Båda metoderna fungerar mycket bra för klassificering och specifikt i fall av CNN presterar den bättre i låg SNR jämfört med SVM. Vi fann att några av mätningarna verkade vara extremvärden och klustringsalgoritmen DBScan användes för att identifiera dem. Sist men inte minst undersöker vi om radion kan komplettera datorseende vid närvarodetektering eftersom radiovågor kan fortplanta sig genom väggar och ogenomskinliga hinder.
|
145 |
Supervised Failure Diagnosis of Clustered Logs from Microservice Tests / Övervakad feldiagnos av klustrade loggar från tester på mikrotjänsterStrömdahl, Amanda January 2023 (has links)
Pinpointing the source of a software failure based on log files can be a time consuming process. Automated log analysis tools are meant to streamline such processes, and can be used for tasks like failure diagnosis. This thesis evaluates three supervised models for failure diagnosis of clustered log data. The goal of the thesis is to compare the performance of the models on industry data, as a way to investigate whether the chosen ML techniques are suitable in the context of automated log analysis. A Random Forest, an SVM and an MLP are generated from a dataset of 194 failed executions of tests on microservices, that each resulted in a large collection of logs. The models are tuned with random search and compared in terms of precision, recall, F1-score, hold-out accuracy and 5-fold cross-validation accuracy. The hold-out accuracy is calculated as a mean from 50 hold-out data splits, and the cross-validation accuracy is computed separately from a single set of folds. The results show that the Random Forest scores highest in terms of mean hold-out accuracy (90%), compared to the SVM (86%) and the Neural Network (85%). The mean cross-validation accuracy is the highest for the SVM (95%), closely followed by the Random Forest (94%), and lastly the Neural Network (85%). The precision, recall and F1-score are stable and consistent with the hold-out results, although the precision results are slightly higher than the other two measures. According to this evaluation, the Random Forest has the overall highest performance on the dataset when considering the hold-out- and cross-validation accuracies, and also the fact that it has the lowest complexity and thus the shortest training time, compared to the other considered solutions. All in all, the results of the thesis demonstrate that supervised learning is a promising approach to automatize log analysis. / Att identifiera orsaken till en misslyckad mjukvaruexekvering utifrån logg-filer kan vara en tidskrävande process. Verktyg för automatiserad logg-analysis är tänkta att effektivisera sådana processer, och kan bland annat användas för feldiagnos. Denna avhandling tillhandahåller tre övervakade modeller för feldiagnos av klustrad logg-data. Målet med avhandlingen är att jämföra modellernas prestanda på data från näringslivet, i syfte att utforska huruvida de valda maskininlärningsteknikerna är lämpliga för automatiserad logg-analys. En Random Forest, en SVM och en MLP genereras utifrån ett dataset bestående av 194 misslyckade exekveringar av tester på mikrotjänster, där varje exekvering resulterade i en stor uppsättning loggar. Modellerna finjusteras med hjälp av slumpmässig sökning och jämförs via precision, träffsäkerhet, F-poäng, noggrannhet och 5-faldig korsvalidering. Noggrannheten beräknas som medelvärdet av 50 datauppdelningar, och korsvalideringen tas fram separat från en enstaka uppsättning vikningar. Resultaten visar att Random Forest har högst medelvärde i noggrannhet (90%), jämfört med SVM (86%) och Neurala Nätverket (85%). Medelvärdet i korsvalidering är högst för SVM (95%), tätt följt av Random Forest (94%), och till sist, Neurala Nätverket (85%). Precisionen, träffsäkerheten och F-poängen är stabila och i enlighet med noggrannheten, även om precisionen är något högre än de andra två måtten. Enligt den här analysen har Random Forest överlag högst prestanda på datasetet, med hänsyn till noggrannheten och korsvalideringen, samt faktumet att denna modell har lägst komplexitet och därmed kortast träningstid, jämfört med de andra undersökta lösningarna. Sammantaget visar resultaten från denna avhandling att övervakad inlärning är ett lovande tillvägagångssätt för att automatisera logg-analys.
|
146 |
A Multivariate Data Stream Anomaly Detection FrameworkJin, Jiakun January 2016 (has links)
High speed stream anomaly detection is an important technology used in many industry applications such as monitoring system health, detecting financial fraud, monitoring customer's unusual behavior and so on. In those scenarios multivariate data arrives in high speed, and needs to be calculated in real-time. Since solutions for high speed multivariate stream anomaly detection are still under development, the objective of this thesis is introducing a framework for testing different anomaly detection algorithms.Multivariate anomaly detection, usually includes two major steps: point anomaly detection and stream anomaly detection. Point anomaly detection is used to transfer multivariate feature data into anomaly score according to the recent stream of data. The stream anomaly detectors are used to detect stream anomalies based on the recent anomaly scores generated from previous point anomaly detector. This thesis presents a flexible framework that allows the easy integration and evaluation of different data sources, point and stream anomaly detection algorithms. To demonstrate the capabilities of the framework, we consider different scenarios with generators of artificial data, real industry data sets and time series data, point anomaly detectors of PYISC, SVM and LOF, stream anomaly detectors of DDM, CUSUM and FCWM. The evaluation results show that for point anomaly detectors, PYISC and LOF perform well when the distributions of features are known, SVM performs well even when the distributions of features are not known. For the stream anomaly detectors, DDM has some possibilities to get false anomaly detection, CUSUM has some possibilities to get failed when the stream anomalies increase slowly, while FCWM performs best with very low possibilities to get failed. / Höghastighet ström anomali detektion är en viktig teknik som används i många industriella tillämpningar såsom övervakningssystem för hälsa, upptäckande av ekonomiska bedrägerier, övervakning av kundernas ovanliga beteende och så vidare. I dessa scenarier kommer multivariat data i hög hastighet, och måste beräknas i realtid. Eftersom lösningar för höghastighet multivariat ström anomali detektion är fortfarande under utveckling, är syftet med denna avhandling att införa en ramverk för att testa olika anomali algoritmer. Multivariat anomali detektion har oftast två viktiga steg: att upptäcka punkt-avvikelser och att upptäcka ström-avvikelser. Punkt- anomali detektorer används för att överföra multivariat data i anomali poäng enligt den senaste tidens dataström. Ström anomali detektorer används för att detektera ström avvikelser baserade på den senaste tidens anomali poäng genererade från föregående punkt anomali detektoren. Denna avhandling presenterar ett flexibelt ramverk som möjlig gör enkel integration och utvärdering av olika datakällor, punkt och ström anomali detektorer. För att demonstrera ramverkets kapabiliteteter, betraktar vi olika scenarier med datageneratorer av konstgjorda data, verkliga industri data och tidsseriedata; punkt anomali detektorer PYISC, SVM och Löf, och ström anomali detektorer DDM, CUSUM och FCWM. Utvärderingsresultaten visar att för punkt anomali detektor har PYISC och LOF bra prestanda när datafördelningen är kända, men SVM fungerar bra även när fördelningarna inte är kända. För ström anomali detektor har DDM vissa sannolikhet att få falskt upptäcka avvikelser, och CUSUM vissa sannolikhet att misslycka när avvikelser ökar långsamt. FCWM fungerar bäst med mycket låga sannolikhet för misslyckande.
|
147 |
Machine Learning Based Failure Detection in Data CentersPiran Nanekaran, Negin January 2020 (has links)
This work proposes a new approach to fast detection of abnormal behaviour of cooling, IT, and power distribution systems in micro data centers based on machine learning techniques. Conventional protection of micro data centers focuses on monitoring individual parameters such as temperature at different locations and when these parameters reach certain high values, then an alarm will be triggered. This research employs machine learning techniques to extract normal and abnormal behaviour of the cooling and IT systems. Developed data acquisition system together with unsupervised learning methods quickly learns the physical dynamics of normal operation and can detect deviations from such behaviours. This provides an efficient way for not only producing health index for the micro data center, but also a rich label logging system that will be used for the supervised learning methods. The effectiveness of the proposed detection technique is evaluated on an micro data center placed at Computing Infrastructure Research Center (CIRC) in McMaster Innovation Park (MIP), McMaster University. / Thesis / Master of Science (MSc)
|
148 |
[en] HYBRID SYSTEM FOR RULE EXTRACTION APPLIED TO DIAGNOSIS OF POWER TRANSFORMERS / [pt] SISTEMA HÍBRIDO DE EXTRAÇÃO DE REGRAS APLICADO A DIAGNÓSTICO DE TRANSFORMADORESCINTIA DE FARIA FERREIRA CARRARO 28 November 2012 (has links)
[pt] Este trabalho tem como objetivo construir um classificador baseado em
regras de inferência fuzzy, as quais são extraídas a partir de máquinas de vetor
suporte (SVMs) e ajustadas com o auxílio de um algoritmo genético. O
classificador construído visa a diagnosticar transformadores de potência. As
SVMs são sistemas de aprendizado baseados na teoria do aprendizado
estatístico e apresentam boa habilidade de generalização em conjuntos de
dados reais. SVMs, da mesma forma que redes neurais (RN), geram um
modelo caixa preta, isto é, um modelo que não explica o processo pelo qual
sua saída é obtida. Entretanto, para alguns problemas, o conhecimento sobre
como a classificação foi obtida é tão importante quanto a classificação
propriamente dita. Alguns métodos propostos para reduzir ou eliminar essa
limitação já foram desenvolvidos, embora sejam restritos à extração de regras
simbólicas, isto é, contêm funções ou intervalos nos antecedentes das regras.
No entanto, a interpretabilidade de regras simbólicas ainda é reduzida. De forma
a aumentar a interpretabilidade das regras, o modelo FREx_SVM foi
desenvolvido. Neste modelo as regras fuzzy são extraídas a partir de SVMs
treinadas. O modelo FREx_SVM pode ser aplicado a problemas de classificação
com n classes, não sendo restrito a classificações binárias. Entretanto, apesar
do bom desempenho do modelo FREx_SVM na extração de regras linguísticas,
o desempenho de classificação do sistema de inferência fuzzy obtido é ainda
inferior ao da SVM, uma vez que as partições (conjuntos fuzzy) das variáveis de
entrada são definidas a priori, permanecendo fixas durante o processo de
aprendizado das regras. O objetivo desta dissertação é, portanto, estender o
modelo FREx_SVM, de forma a permitir o ajuste automático das funções de
pertinência das variáveis de entrada através de algoritmos genéticos. Para
avaliar o desempenho do modelo estendido, foram realizados estudos de caso
em dois bancos de dados: Iris, como uma base benchmark, e a análise de
resposta em frequência. A análise de resposta em frequência é uma técnica não
invasiva e não destrutiva, pois preserva as características dos equipamentos. No
entanto, o diagnóstico é feito de modo visual comparativo e requer o auxílio de
um especialista. Muitas vezes, este diagnóstico é subjetivo e inconclusivo. O
ajuste automático das funções de pertinência correspondentes aos conjuntos
fuzzy associados às variáveis de entrada reduziu o erro de classificação em até
13,38 por cento em relação à configuração sem este ajuste. Em alguns casos, o
desempenho da configuração com ajuste das funções de pertinência supera até
mesmo aquele obtido pela própria SVM. / [en] This work aims to develop a classifier model based on fuzzy inference
rules, which are extracted from support vector machines (SVMs) and optimized
by a genetic algorithm. The classifier built aims to diagnose power transformers.
The SVMs are learning systems based on statistical learning theory and have
provided good generalization performance in real data sets. SVMs, as artificial
neural networks (NN), generate a black box model, that is, a model that does not
explain the process by which its output is obtained. However, for some
applications, the knowledge about how the classification was obtained is as
important as the classification itself. Some proposed methods to reduce or
eliminate this limitation have already been developed, although they are
restricted to the extraction of symbolic rules, i.e. contain functions or ranges in
the rules´ antecedents. Nevertheless, the interpretability of symbolic rules is still
reduced. In order to increase the interpretability of the rules, the FREx_SVM
model was developed. In this model the fuzzy rules are extracted from trained
SVMs. The FREx_SVM model can be applied to classification problems with n
classes, not being restricted to binary classifications. However, despite the good
performance of the FREx_SVM model in extracting linguistic rules, the
classification performance of fuzzy classification system obtained is still lower
than the SVM, since the partitions (fuzzy sets) of the input variables are predefined
at the beginning of the process, and are fixed during the rule extraction
process. The goal of this dissertation is, therefore, to extend the FREx_SVM
model, so as to enable the automatic adjustment of the membership functions of
the input variables through genetic algorithms. To assess the performance of the
extended model, case studies were carried out in two databases: iris benchmark
and frequency response analysis. The frequency response analysis is a noninvasive
and non-destructive technique, because it preserves the characteristics
of the equipment. However, the diagnosis is carried out by visual comparison and
requires the assistance of an expert. Often, this diagnosis is subjective and
inconclusive. The automatic adjustment of the membership functions associated
with input variables reduced the error up to 13.38 per cent when compared to the
configuration without this optimization. In some cases, the classification
performance with membership functions optimization exceeds even those
obtained by SVM.
|
149 |
Automatic Text Classification of Research Grant Applications / Automatisk textklassificering av forskningsbidragsansökningarLindqvist, Robin January 2024 (has links)
This study aims to construct a state-of-the-art classifier model and compare it against a largelanguage model. A variation of SVM called LinearSVC was utilised and the BERT model usingbert-base-uncased was used. The data, provided by the Swedish Research Council, consisted ofresearch grant applications. The research grant applications were divided into two groups, whichwere further divided into several subgroups. The subgroups represented research fields such ascomputer science and applied physics. Significant class imbalances were present, with someclasses having only a tenth of the applications of the largest class. To address these imbalances,a new dataset was created using data that had been randomly oversampled. The models weretrained and tested on their ability to correctly assign a subgroup to a research grant application.Results indicate that the BERT model outperformed the SVM model on the original dataset,but not on the balanced dataset . Furthermore, the BERT model’s performance decreased whentransitioning from the original to the balanced dataset, due to overfitting or randomness. / Denna studie har som mål att bygga en state-of-the-art klassificerar model och sedan jämföraden mot en stor språkmodel. SVM modellen var en variation av SVM vid namn LinearSVC ochför BERT användes bert-base-uncased. Data erhölls från Vetenskapsrådet och bestod av forskn-ingsbidragsansökningar. Forskningsbidragsansökningarna var uppdelade i två grupper, som varytterligare uppdelade i ett flertal undergrupper. Dessa undergrupper representerar forsknings-fält såsom datavetenskap och tillämpad fysik. I den data som användes i studien fanns storaskillnader mellan klasserna, där somliga klasser hade en tiondel av ansökningarna som de storaklasserna hade. I syfte att lösa dessa klassbalanseringsproblem skapades en datamängd somundergått slumpmässig översampling. Modellerna tränades och testades på deras förmåga attkorrekt klassificera en forskningsbidragsansökan in i rätt undergrupp. Studiens fynd visade attBERT modellen presterade bättre än SVM modellen på både den ursprungliga datamängden,dock inte på den balanserade datamängden. Tilläggas kan, BERTs prestanda sjönk vid övergångfrån den ursprungliga datamängden till den balanserade datamängden, något som antingen berorpå överanpassning eller slump.
|
150 |
A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target GenesHafez, Dina Mohamed January 2015 (has links)
<p>Next generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.</p><p>Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge. </p><p>In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.</p><p>On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries. </p><p>We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights. </p><p>Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.</p> / Dissertation
|
Page generated in 0.0198 seconds