Return to search

Statistical methods & algorithms for autonomous immunoglobulin repertoire analysis

Investigating the immunoglobulin repertoire is a means of understanding the adaptive immune response to infectious disease or vaccine challenge. The data examined are typically generated using high-throughput sequencing on samples of immunoglobulin variable-region genes present in blood or tissue collected from human or animal subjects. The analysis of these large, diverse collections provides a means of gaining insight into the specific molecular mechanisms involved in generating and maintaining a protective immune response. It involves the characterization of distinct clonal populations, specifically through the inference of founding alleles for germline gene segment recombination, as well as the lineage of accumulated mutations acquired during the development of each clone.
Germline gene segment inference is currently performed by aligning immunoglobulin sequencing reads against an external reference database and assigning each read to the entry that provides the best score according to the metric used. The problem with this approach is that allelic diversity is greater than can be usefully accommodated in a static database. The absence of the alleles used from the database often leads to the misclassification of single-nucleotide polymorphisms as somatic mutations acquired during affinity maturation. This trend is especially evident with the rhesus macaque, but also affects the comparatively well-catalogued human databases, whose collections are biased towards samples from individuals of European descent.
Our project presents novel statistical methods for immunoglobulin repertoire analysis which allow for the de novo inference of germline gene segment libraries directly from next-generation sequencing data, without the need for external reference databases. These methods follow a Bayesian paradigm, which uses an information-theoretic modelling approach to iteratively improve upon internal candidate gene segment libraries. Both candidate libraries and trial analyses given those libraries are incorporated as components of the machine learning evaluation procedure, allowing for the simultaneous optimization of model accuracy and simplicity. Finally, the proposed methods are evaluated using synthetic data designed to mimic known mechanisms for repertoire generation, with pre-designated parameters. We also apply these methods to known biological sources with unknown repertoire generation parameters, and conclude with a discussion on how this method can be used to identify potential novel alleles.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/41875
Date13 January 2021
CreatorsNorwood, Katherine Frances
ContributorsKepler, Thomas B.
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation
RightsAttribution-NonCommercial-ShareAlike 4.0 International, http://creativecommons.org/licenses/by-nc-sa/4.0/

Page generated in 0.0022 seconds