String matching algorithms are the algorithms used to search through different types of text in search of a certain pattern. Many of these algorithms achieve their impressive performance by analysing the pattern and saving that information. That information is then continuously used during the searching phase to know what parts of the text can be skipped. One such algorithm is the Berry-Ravindran. The Berry-Ravindran checks the two characters past the current try for a match and sees if those characters exist in the pattern. This thesis compares the Berry-Ravindran algorithm to new versions of itself that check three and four characters instead of two, along with the Boyer-Moore algorithm. Checking more characters improves the amount of the text that can be skipped by reducing the number of attempts needed but exponentially increases the pre-processing time. The improved performance in attempts does not necessarily mean a faster run-time because of the increased pre-processing time. The variable impacting the pre-processing time the biggest is the size of the alphabet that the text uses. This is researched by testing these algorithms with patterns ranging from 4 to 100 characters long on two different data sets. Protein data which has an alphabet size of 27 and DNA data which has an alphabet size of 4.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-201038 |
Date | January 2022 |
Creators | Riekkola, Jesper |
Publisher | Umeå universitet, Institutionen för datavetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UMNAD ; 1365 |
Page generated in 0.0028 seconds