boyer moore search Algorithm
The Boyer-Moore search algorithm is an efficient string matching technique, used to search for the occurrence of a pattern within a given text. This algorithm, developed by Robert S. Boyer and J Strother Moore in 1977, has become one of the standard algorithms for string searching due to its impressive performance in practice. It works by pre-processing the pattern and using this information to shift the pattern more intelligently, skipping over portions of the text that cannot contain the pattern. This results in fewer character comparisons and faster search times compared to other traditional string matching algorithms, such as the naïve approach or the Knuth-Morris-Pratt algorithm.
The Boyer-Moore algorithm's efficiency comes from two main heuristics: the bad character rule and the good suffix rule. The bad character rule allows the algorithm to shift the pattern by using information about the occurrence of a mismatched character in the pattern itself. This is done by pre-processing the pattern and creating a lookup table that shows how far the pattern can be shifted when a mismatch occurs. The good suffix rule, on the other hand, deals with the situation when a partial match has been found. It uses the information about the matched portion of the pattern to determine how far the pattern can be shifted to align with a potential match in the text. By combining these two heuristics, the Boyer-Moore algorithm can achieve better performance in terms of the number of character comparisons and overall search time, especially on large texts or when searching for relatively rare patterns.
"""
The algorithm finds the pattern in given text using following rule.
The bad-character rule considers the mismatched character in Text.
The next occurrence of that character to the left in Pattern is found,
If the mismatched character occurs to the left in Pattern,
a shift is proposed that aligns text block and pattern.
If the mismatched character does not occur to the left in Pattern,
a shift is proposed that moves the entirety of Pattern past
the point of mismatch in the text.
If there no mismatch then the pattern matches with text block.
Time Complexity : O(n/m)
n=length of main string
m=length of pattern string
"""
class BoyerMooreSearch:
def __init__(self, text, pattern):
self.text, self.pattern = text, pattern
self.textLen, self.patLen = len(text), len(pattern)
def match_in_pattern(self, char):
""" finds the index of char in pattern in reverse order
Parameters :
char (chr): character to be searched
Returns :
i (int): index of char from last in pattern
-1 (int): if char is not found in pattern
"""
for i in range(self.patLen - 1, -1, -1):
if char == self.pattern[i]:
return i
return -1
def mismatch_in_text(self, currentPos):
""" finds the index of mis-matched character in text when compared with pattern from last
Parameters :
currentPos (int): current index position of text
Returns :
i (int): index of mismatched char from last in text
-1 (int): if there is no mismatch between pattern and text block
"""
for i in range(self.patLen - 1, -1, -1):
if self.pattern[i] != self.text[currentPos + i]:
return currentPos + i
return -1
def bad_character_heuristic(self):
# searches pattern in text and returns index positions
positions = []
for i in range(self.textLen - self.patLen + 1):
mismatch_index = self.mismatch_in_text(i)
if mismatch_index == -1:
positions.append(i)
else:
match_index = self.match_in_pattern(self.text[mismatch_index])
i = (
mismatch_index - match_index
) # shifting index lgtm [py/multiple-definition]
return positions
text = "ABAABA"
pattern = "AB"
bms = BoyerMooreSearch(text, pattern)
positions = bms.bad_character_heuristic()
if len(positions) == 0:
print("No match found")
else:
print("Pattern found in following positions: ")
print(positions)