Over the past several years, new DNA sequencing technologies have led to a great in-
crease in the quantity of biological sequence data that can be generated. Typically there
may be millions or even billions of short reads sequences of a few hundred base pairs
that are to some degree redundant: the data fall naturally into clusters of sequences
that are highly similar to each other. In order to reduce the time required for analysis
of the data, it therefore becomes of interest to compute representatives of these clusters,
based on some definition of similarity.
In this thesis we examine two clustering software packages, USEARCH and DNACLUST,
that seek to perform this clustering task efficiently. We provide an overview of the techniques used by these two packages; we compare and evaluate them both from a methodological and experimental perspective, and draw conclusions about their effectiveness and utility. / Thesis / Master of Science (MSc)
Identifer | oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/18471 |
Date | 11 1900 |
Creators | Shafqat, Raazia |
Contributors | Smyth, W. F., Computer Science |
Source Sets | McMaster University |
Language | English |
Detected Language | English |
Type | Thesis |
Page generated in 0.0017 seconds