Global ETD Search

1	A Village-level Economic Evaluation of the Southwest Poverty Reduction Project Mo, Xiugen 30 April 2011 (has links) This research evaluates the post-program treatment effects of the Southwest Poverty Reduction Project (SWPRP), a large-scale ($463.55 million) rural development project jointly funded by the World Bank and the Chinese Government from 1995 to 2001. The SWPRP aimed at reducing poverty and increasing living standards for the absolute poor in southwest China. The treatment effects are measured by the changes in 21 indicators at the village level. The dataset for this research includes 327 project villages and 3887 non-project villages in Guangxi Zhuang Autonomous Region. Rigorous econometric methods are employed to remove selection bias. A probit model is established to investigate the selection rule of the project villages. In addition to the control function approach, different methods of propensity score matching such as nearest neighbor, caliper or radius, and kernel-based matching, are used to estimate the treatment effects, including the average treatment effect, average treatment effect on the treated, and average treatment effect on the untreated.The evidence from the treatment effect estimations shows that the SWPRP achieved its overall objective but not necessarily all specific objectives. The evidence supports a statement of significant impacts on farming, offarm employment, and infrastructure by the project investments, while there is no strong evidence to support a conclusion of significant impacts on primary education and rural healthcare services. The poverty rate in the project villages was reduced by about 3.0-3.3 percent and net income increased by about 24-26 Yuan. Further investigation of the specific treatment effects on individual villages expose that the treatment effects vary with land resources in the villages. Lastly, the project was successful in targeting the poorer villages but not necessarily the poorest.This research also reveals some findings of practical relevance for social program design. The approach of integrated policies proves to be effective in large-scale poverty reduction. However, designers should be aware that households may trade off one activity against another to maximize their utility rather than simply follow the whole package of integrated activities. In addition, the minimization of operational costs of the project agents should not be detrimental to the effectiveness of the project. impact evaluation rural development poverty kernel-based matching China
2	Monitoring and diagnosis of process systems using kernel-based learning methods Jemwa, Gorden Takawadiyi 12 1900 (has links) Thesis (PhD (Process Engineering))--University of Stellenbosch, 2007. / Dissertation presented for the degree of Doctor of Philosophy in Engineering at the University of Stellenbosch. / ENGLISH ABSTRACT: The development of advanced methods of process monitoring, diagnosis, and control has been identified as a major 21st century challenge in control systems research and application. This is particularly the case for chemical and metallurgical operations owing to the lack of expressive fundamental models as well as the nonlinear nature of most process systems, which makes established linearization methods unsuitable. As a result, efforts have been directed in the search of alternative approaches that do not require fundamental or analytical models. Data-based methods provide a very promising alternative in this regard, given the huge volumes of data being collected in modern process operations as well as advances in both theoretical and practical aspects of extracting information from observations. In this thesis, the use of kernel-based learning methods in fault detection and diagnosis of complex processes is considered. Kernel-based machine learning methods are a robust family of algorithms founded on insights from statistical learning theory. Instead of estimating a decision function on the basis of minimizing the training error as other learning algorithms, kernel methods use a criterion called large margin maximization to estimate a linear learning rule on data embedded in a suitable feature space. The embedding is implicitly defined by the choice of a kernel function and corresponds to inducing a nonlinear learning rule in the original measurement space. Large margin maximization corresponds to developing an algorithm with theoretical guarantees on how well it will perform on unseen data. In the first contribution, the characterization of time series data from process plants is investigated. Whereas complex processes are difficult to model from first principles, they can be identified using historic process time series data and a suitable model structure. However, prior to fitting such a model, it is important to establish whether the time series data justify the selected model structure. Singular spectrum analysis (SSA) has been used for time series identification. A nonlinear extension of SSA is proposed for classification of time series. Using benchmark systems, the proposed extension is shown to perform better than linear SSA. Moreover, the method is shown to be useful for filtering noise in time series data and, therefore, has potential applications in other tasks such as data rectification and gross error detection. Multivariate statistical process monitoring methods are well-established techniques for efficient information extraction from multivariate data. Such information is usually compact and amenable to graphical representation in two or three dimensional plots. For process monitoring purposes control limits are also plotted on these charts. These control limits are usually based on a hypothesized analytical distribution, typically the Gaussian normal distribution. A robust approach for estimating con dence bounds using the reference data is proposed. The method is based on one-class classification methods. The usefulness of using data to define a confidence bound in reducing fault detection errors is illustrated using plant data. The use of both linear and nonlinear supervised feature extraction is also investigated. The advantages of supervised feature extraction using kernel methods are highlighted via illustrative case studies. A general strategy for fault detection and diagnosis is proposed that integrates feature extraction methods, fault identification, and different methods to estimate confidence bounds. For kernel-based approaches, the general framework allows for interpretation of the results in the input space instead of the feature space. An important step in process monitoring is identifying a variable responsible for a fault. Although all faults that can occur at any plant cannot be known beforehand, it is possible to use knowledge of previous faults or simulations to anticipate their recurrence. A framework for fault diagnosis using one-class support vector machine (SVM) classification is proposed. Compared to other previously studied techniques, the one-class SVM approach is shown to have generally better robustness and performance characteristics. Most methods for process monitoring make little use of data collected under normal operating conditions, whereas most quality issues in process plants are known to occur when the process is in-control . In the final contribution, a methodology for continuous optimization of process performance is proposed that combines support vector learning with decision trees. The methodology is based on continuous search for quality improvements by challenging the normal operating condition regions established via statistical control. Simulated and plant data are used to illustrate the approach. / AFRIKAANSE OPSOMMING: Die ontwikkeling van gevorderde metodes van prosesmonitering, diagnose en -beheer is geïdentifiseer as 'n groot 21ste eeuse uitdaging in die navorsing en toepassing van beheerstelsels. Dit is veral die geval in die chemiese en metallurgiese bedryf, a.g.v. die gebrek aan fundamentele modelle, sowel as die nielineêre aard van meeste prosesstelsels, wat gevestigde benaderings tot linearisasie ongeskik maak. Die gevolg is dat pogings aangewend word om te soek na alternatiewe benaderings wat nie fundamentele of analitiese modelle benodig nie. Data-gebaseerde metodes voorsien belowende alternatiewe in dié verband, gegewe die enorme volumes data wat in moderne prosesaanlegte geberg word, sowel as die vooruitgang wat gemaak word in beide die teoretiese en praktiese aspekte van die onttrekking van inligting uit waarnemings. In die tesis word die gebruik van kern-gebaseerde metodes vir foutopsporing en -diagnose van komplekse prosesse beskou. Kern-gebaseerde masjienleermetodes is 'n robuuste familie van metodes gefundeer op insigte uit statistiese leerteorie. Instede daarvan om 'n besluitnemingsfunksie te beraam deur passingsfoute op verwysingsdata te minimeer, soos wat gedoen word met ander leermetodes, gebruik kern-metodes 'n kriterium genaamd groot marge maksimering om lineêre reëls te pas op data wat ingebed is in 'n geskikte kenmerkruimte. Die inbedding word implisiet gedefinieer deur die keuse van die kern-funksie en stem ooreen met die indusering van 'n nielineêre reël in die oorspronklike meetruimte. Groot marge-maksimering stem ooreen met die ontwikkeling van algoritmes waarvan die prestasie t.o.v. die passing van nuwe data teoreties gewaarborg is. In die eerste bydrae word die karakterisering van tydreeksdata van prosesaanlegte ondersoek. Alhoewel komplekse prosesse moeilik is om vanaf eerste beginsels te modelleer, kan hulle geïdentifiseer word uit historiese tydreeksdata en geskikte modelstrukture. Voor so 'n model gepas word, is dit belangrik om vas te stel of die tydreeksdata wel die geselekteerde modelstruktuur ondersteun. 'n Nielineêre uitbreiding van singuliere spektrale analise (SSA) is voorgestel vir die klassifikasie van tydreekse. Deur gebruik te maak van geykte stelsels, is aangetoon dat die voorgestelde uitbreiding beter presteer as lineêre SSA. Tewens, daar word ook aangetoon dat die metode nuttig is vir die verwydering van geraas in tydreeksdata en daarom ook potensiële toepassings het in ander take, soos datarektifikasie en die opsporing van sistematiese foute in data. Meerveranderlike statistiese prosesmonitering is goed gevestig vir die doeltreffende onttrekking van inligting uit meerveranderlike data. Sulke inligting is gewoonlik kompak en geskik vir voorstelling in twee- of drie-dimensionele grafieke. Vir die doeleindes van prosesmonitering word beheerlimiete dikwels op sulke grafieke aangestip. Hierdie beheerlimiete word gewoonlik gebaseer op 'n hipotetiese analitiese verspreiding van die data, tipiese gebaseer op 'n Gaussiaanse model. 'n Robuuste benadering vir die beraming van betroubaarheidslimiete gebaseer op verwysingsdata, word in die tesis voorgestel. Die metode is gebaseer op eenklas-klassifikasie en die nut daarvan deur data te gebruik om die betroubaarheidsgrense te beraam ten einde foutopsporing te optimeer, word geïllustreer aan die hand van aanlegdata. Die gebruik van beide lineêre en nielineêre oorsiggedrewe kenmerkonttrekking is vervolgens ondersoek. Die voordele van oorsiggedrewe kenmerkonttrekking deur van kern-metodes gebruik te maak is beklemtoon deur middel van illustratiewe gevallestudies. 'n Algemene strategie vir foutopsporing en -diagnose word voorgestel, wat kenmerkonttrekkingsmetodes, foutidenti kasie en verskillende metodes om betroubaarheidsgrense te beraam saamsnoer. Vir kern-gebaseerde metodes laat die algemene raamwerk toe dat die resultate in die invoerruimte vertolk kan word, in plaas van in die kenmerkruimte. 'n Belangrike stap in prosesmonitering is om veranderlikes te identifiseer wat verantwoordelik is vir foute. Alhoewel alle foute wat by 'n chemiese aanleg kan plaasvind, nie vooraf bekend kan wees nie, is dit moontlik om kennis van vorige foute of simulasies te gebruik om die herhaalde voorkoms van die foute te antisipeer. 'n Raamwerk vir foutdiagnose wat van eenklas-steunvektormasjiene (SVM) gebruik maak is voorgestel. Vergeleke met ander tegnieke wat voorheen bestudeer is, is aangetoon dat die eenklas-SVM benadering oor die algemeen beter robuustheid en prestasiekenmerke het. Meeste metodes vir prosesmonitering maak min gebruik van data wat opgeneem is onder normale bedryfstoestande, alhoewel meeste kwaliteitsprobleme ondervind word waneer die proses onder beheer is. In die laaste bydrae, is 'n metodologie vir die kontinue optimering van prosesprestasie voorgestel, wat steunvektormasjiene en beslissingsbome kombineer. Die metodologie is gebaseer op die kontinue soeke na kwaliteitsverbeteringe deur die normale bedryfstoestandsgrense, soos bepaal deur statistiese beheer, te toets. Gesimuleerde en werklike aanlegdata is gebruik om die benadering te illustreer. Kernel-based learning methods Process monitoring Chemical operations Metallurgical operations Process engineering
3	Efficient Exact Tests in Linear Mixed Models for Longitudinal Microbiome Studies Zhai, Jing January 2016 (has links) Microbiome plays an important role in human health. The analysis of association between microbiome and clinical outcome has become an active direction in biostatistics research. Testing the microbiome effect on clinical phenotypes directly using operational taxonomic unit abundance data is a challenging problem due to the high dimensionality, non-normality and phylogenetic structure of the data. Most of the studies only focus on describing the change of microbe population that occur in patients who have the specific clinical condition. Instead, a statistical strategy utilizing distance-based or similarity-based non-parametric testing, in which a distance or similarity measure is defined between any two microbiome samples, is developed to assess association between microbiome composition and outcomes of interest. Despite the improvements, this test is still not easily interpretable and not able to adjust for potential covariates. A novel approach, kernel-based semi-parametric regression framework, is applied in evaluating the association while controlling the covariates. The framework utilizes a kernel function which is a measure of similarity between samples' microbiome compositions and characterizes the relationship between the microbiome and the outcome of interest. This kernel-based regression model, however, cannot be applied in longitudinal studies since it could not model the correlation between the repeated measurements. We proposed microbiome association exact tests (MAETs) in linear mixed model can deal with longitudinal microbiome data. MAETs can test not only the effect of overall microbiome but also the effect from specific cluster of the OTUs while controlling for others by introducing more random effects in the model. The current methods for multiple variance component testing are based on either asymptotic distribution or parametric bootstrap which require large sample size or high computational cost. The exact (R)LRT tests, an computational efficient and powerful testing methodology, was derived by Crainiceanu. Since the exact (R)LRT can only be used in testing one variance component, we proposed an approach that combines the recent development of exact (R)LRT and a strategy for simplifying linear mixed model with multiple variance components to a single case. The Monte Carlo simulation studies present correctly controlled type I error and provided superior power in testing association between microbiome and outcomes in longitudinal studies. Finally, the MAETs were applied to longitudinal pulmonary microbiome datasets to demonstrate that microbiome composition is associated with lung function and immunological outcomes. We also successfully found two interesting genera Prevotella and Veillonella which are associated with forced vital capacity. Kernel-based regression Longitudinal study Microbiome composition Multiple variance components Public Health Exact tests
4	Density Based Data Clustering Albarakati, Rayan 01 March 2015 (has links) Data clustering is a data analysis technique that groups data based on a measure of similarity. When data is well clustered the similarities between the objects in the same group are high, while the similarities between objects in different groups are low. The data clustering technique is widely applied in a variety of areas such as bioinformatics, image segmentation and market research. This project conducted an in-depth study on data clustering with focus on density-based clustering methods. The latest density-based (CFSFDP) algorithm is based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively larger distance from points with higher densities. This method has been examined, experimented, and improved. These methods (KNN-based, Gaussian Kernel-based and Iterative Gaussian Kernel-based) are applied in this project to improve (CFSFDP) density-based clustering. The methods are applied to four milestone datasets and the results are analyzed and compared. Clustering analysis density-based CFSFDP Iterative Gaussian Kernel-based Other Computer Engineering
5	Nonlinear fault detection and diagnosis using Kernel based techniques applied to a pilot distillation colomn Phillpotts, David Nicholas Charles 15 January 2008 (has links) Fault detection and diagnosis is an important problem in process engineering. In this dissertation, use of multivariate techniques for fault detection and diagnosis is explored in the context of statistical process control. Principal component analysis and its extension, kernel principal component analysis, are proposed to extract features from process data. Kernel based methods have the ability to model nonlinear processes by forming higher dimensional representations of the data. Discriminant methods can be used to extend on feature extraction methods by increasing the isolation between different faults. This is shown to aid fault diagnosis. Linear and kernel discriminant analysis are proposed as fault diagnosis methods. Data from a pilot scale distillation column were used to explore the performance of the techniques. The models were trained with normal and faulty operating data. The models were tested with unseen and/or novel fault data. All the techniques demonstrated at least some fault detection and diagnosis ability. Linear PCA was particularly successful. This was mainly due to the ease of the training and the ability to relate the scores back to the input data. The attributes of these multivariate statistical techniques were compared to the goals of statistical process control and the desirable attributes of fault detection and diagnosis systems. / Dissertation (MEng (Control Engineering))--University of Pretoria, 2008. / Chemical Engineering / MEng / Unrestricted Statistical process control Fault diagnosis Fault detection Kernel based methods UCTD
6	Studies on Kernel-Based System Identification / カーネルに基づくシステム同定に関する研究 Fujimoto, Yusuke 26 March 2018 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21214号 / 情博第667号 / 新制\|\|情\|\|115(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授杉江俊治, 教授太田快人, 教授大塚敏之 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DGAM System identification Kernel-based method Impulse response estimation Input design Bayesian estimation 007
7	Latent variable based computational methods for applications in life sciences : Analysis and integration of omics data sets Bylesjö, Max January 2008 (has links) With the increasing availability of high-throughput systems for parallel monitoring of multiple variables, e.g. levels of large numbers of transcripts in functional genomics experiments, massive amounts of data are being collected even from single experiments. Extracting useful information from such systems is a non-trivial task that requires powerful computational methods to identify common trends and to help detect the underlying biological patterns. This thesis deals with the general computational problems of classifying and integrating high-dimensional empirical data using a latent variable based modeling approach. The underlying principle of this approach is that a complex system can be characterized by a few independent components that characterize the systematic properties of the system. Such a strategy is well suited for handling noisy, multivariate data sets with strong multicollinearity structures, such as those typically encountered in many biological and chemical applications. The main foci of the studies this thesis is based upon are applications and extensions of the orthogonal projections to latent structures (OPLS) method in life science contexts. OPLS is a latent variable based regression method that separately describes systematic sources of variation that are related and unrelated to the modeling aim (for instance, classifying two different categories of samples). This separation of sources of variation can be used to pre-process data, but also has distinct advantages for model interpretation, as exemplified throughout the work. For classification cases, a probabilistic framework for OPLS has been developed that allows the incorporation of both variance and covariance into classification decisions. This can be seen as a unification of two historical classification paradigms based on either variance or covariance. In addition, a non-linear reformulation of the OPLS algorithm is outlined, which is useful for particularly complex regression or classification tasks. The general trend in functional genomics studies in the post-genomics era is to perform increasingly comprehensive characterizations of organisms in order to study the associations between their molecular and cellular components in greater detail. Frequently, abundances of all transcripts, proteins and metabolites are measured simultaneously in an organism at a current state or over time. In this work, a generalization of OPLS is described for the analysis of multiple data sets. It is shown that this method can be used to integrate data in functional genomics experiments by separating the systematic variation that is common to all data sets considered from sources of variation that are specific to each data set. / Funktionsgenomik är ett forskningsområde med det slutgiltiga målet att karakterisera alla gener i ett genom hos en organism. Detta inkluderar studier av hur DNA transkriberas till mRNA, hur det sedan translateras till proteiner och hur dessa proteiner interagerar och påverkar organismens biokemiska processer. Den traditionella ansatsen har varit att studera funktionen, regleringen och translateringen av en gen i taget. Ny teknik inom fältet har dock möjliggjort studier av hur tusentals transkript, proteiner och små molekyler uppträder gemensamt i en organism vid ett givet tillfälle eller över tid. Konkret innebär detta även att stora mängder data genereras även från små, isolerade experiment. Att hitta globala trender och att utvinna användbar information från liknande data-mängder är ett icke-trivialt beräkningsmässigt problem som kräver avancerade och tolkningsbara matematiska modeller. Denna avhandling beskriver utvecklingen och tillämpningen av olika beräkningsmässiga metoder för att klassificera och integrera stora mängder empiriskt (uppmätt) data. Gemensamt för alla metoder är att de baseras på latenta variabler: variabler som inte uppmätts direkt utan som beräknats från andra, observerade variabler. Detta koncept är väl anpassat till studier av komplexa system som kan beskrivas av ett fåtal, oberoende faktorer som karakteriserar de huvudsakliga egenskaperna hos systemet, vilket är kännetecknande för många kemiska och biologiska system. Metoderna som beskrivs i avhandlingen är generella men i huvudsak utvecklade för och tillämpade på data från biologiska experiment. I avhandlingen demonstreras hur dessa metoder kan användas för att hitta komplexa samband mellan uppmätt data och andra faktorer av intresse, utan att förlora de egenskaper hos metoden som är kritiska för att tolka resultaten. Metoderna tillämpas för att hitta gemensamma och unika egenskaper hos regleringen av transkript och hur dessa påverkas av och påverkar små molekyler i trädet poppel. Utöver detta beskrivs ett större experiment i poppel där relationen mellan nivåer av transkript, proteiner och små molekyler undersöks med de utvecklade metoderna. Chemometrics OPLS O2PLS K-OPLS kernel-based non-linear regression classification Populus Chemistry Kemi
8	INVERSE-DISTANCE INTERPOLATION BASED SET-POINT GENERATION METHODS FOR CLOSED-LOOP COMBUSTION CONTROL OF A CIDI ENGINE Maringanti, Rajaram Seshu 15 December 2009 (has links) No description available. Mechanical Engineering diesel engine inverse-distance interpolation calibration closed-loop combustion control set-point generation kernel-based interpolation
9	Designing Reactive Power Control Rules for Smart Inverters using Machine Learning Garg, Aditie 14 June 2018 (has links) Due to increasing penetration of solar power generation, distribution grids are facing a number of challenges. Frequent reverse active power flows can result in rapid fluctuations in voltage magnitudes. However, with the revised IEEE 1547 standard, smart inverters can actively control their reactive power injection to minimize voltage deviations and power losses in the grid. Reactive power control and globally optimal inverter coordination in real-time is computationally and communication-wise demanding, whereas the local Volt-VAR or Watt-VAR control rules are subpar for enhanced grid services. This thesis uses machine learning tools and poses reactive power control as a kernel-based regression task to learn policies and evaluate the reactive power injections in real-time. This novel approach performs inverter coordination through non-linear control policies centrally designed by the operator on a slower timescale using anticipated scenarios for load and generation. In real-time, the inverters feed locally and/or globally collected grid data to the customized control rules. The developed models are highly adjustable to the available computation and communication resources. The developed control scheme is tested on the IEEE 123-bus system and is seen to efficiently minimize losses and regulate voltage within the permissible limits. / Master of Science / The increasing integration of solar photovoltaic (PV) systems poses both opportunities and technical challenges for the electrical distribution grid. Although PV systems provide more power to the grid but, can also lead to problems in the operation of the grid like overvoltages and voltage fluctuations. These variations can lead to overheating and burning of electrical devices and equipment malfunction. Since the solar generation is highly dependent on weather and geographical location, they are uncertain in their output. The uncertainity in the solar irradiance can not be handled with the existing voltage control devices as they need to operate more frequently than usual which can cause recurring maintenance needs for these devices. Thus, to make solar PV more flexible and grid-friendly, smart inverters are being developed. Smart inverters have the capability of advanced sensing, communication, and controllability which can be utilized for voltage control. The research discusses how the inverters can be used to improve the grid profile by providing reactive power support to reduce the power losses and maintain voltages in their limits for a safer operation. smart inverters support vector machines kernel-based learning voltage regulation power loss minimization linearized distribution flow model
10	Machine learning for epigenetics : algorithms for next generation sequencing data Mayo, Thomas Richard January 2018 (has links) The advent of Next Generation Sequencing (NGS), a little over a decade ago, has led to a vast and rapid increase in the generation of genomic data. The drastically reduced cost has in turn enabled powerful modifications that can be used to investigate not just genetic, but epigenetic, phenomena. Epigenetics refers to the study of mechanisms effecting gene expression other than the genetic code itself and thus, at the transcription level, incorporates DNA methylation, transcription factor binding and histone modifications amongst others. This thesis outlines and tackles two major challenges in the computational analysis of such data using techniques from machine learning. Firstly, I address the problem of testing for differential methylation between groups of bisulfite sequencing data sets. DNA methylation plays an important role in genomic imprinting, X-chromosome inactivation and the repression of repetitive elements, as well as being implicated in numerous diseases, such as cancer. Bisulfite sequencing provides single nucleotide resolution methylation data at the whole genome scale, but a sensitive analysis of such data is difficult. I propose a solution that uses a powerful kernel-based machine learning technique, the Maximum Mean Discrepancy, to leverage well-characterised spatial correlations in DNA methylation, and adapt the method for this particular use. I use this tailored method to analyse a novel data set from a study of ageing in three different tissues in the mouse. This study motivates further modifications to the method and highlights the utility of the underlying measure as an exploratory tool for methylation analysis. Secondly, I address the problem of predictive and explanatory modelling of chromatin immunoprecipitation sequencing data (ChIP-Seq). ChIP-Seq is typically used to assay the binding of a protein of interest, such as a transcription factor or histone, to the DNA, and as such is one of the most widely used sequencing assays. While peak callers are a powerful tool in identifying binding sites of sparse and clean ChIPSeq profiles, more broad signals defy analysis in this framework. Instead, generative models that explain the data in terms of the underlying sequence can help uncover mechanisms that predicting binding or the lack thereof. I explore current problems with ChIP-Seq analysis, such as zero-inflation and the use of the control experiment, known as the input. I then devise a method for representing k-mers that enables the use of longer DNA sub-sequences within a flexible model development framework, such as generalised linear models, without heavy programming requirements. Finally, I use these insights to develop an appropriate Bayesian generative model that predicts ChIP-Seq count data in terms of the underlying DNA sequence, incorporating DNA methylation information where available, fitting the model with the Expectation-Maximization algorithm. The model is tested on simulated data and real data pertaining to the histone mark H3k27me3. This thesis therefore straddles the fields of bioinformatics and machine learning. Bioinformatics is both plagued and blessed by the plethora of different techniques available for gathering data and their continual innovations. Each technique presents a unique challenge, and hence out-of-the-box machine learning techniques have had little success in solving biological problems. While I have focused on NGS data, the methods developed in this thesis are likely to be applicable to future technologies, such as Third Generation Sequencing methods, and the lessons learned in their adaptation will be informative for the next wave of computational challenges.

Search results