Search (Pattern Matching)

The page contains the documentation for the string-to-string search algorithms implemented in the package.

Rabin-Karp Search Algorithm

class string2string.search.RabinKarpSearch(hash_function: ~string2string.misc.hash_functions.HashFunction = <string2string.misc.hash_functions.PolynomialRollingHash object>)[source]

This class contains the Rabin-Karp search algorithm.

__init__(hash_function: ~string2string.misc.hash_functions.HashFunction = <string2string.misc.hash_functions.PolynomialRollingHash object>) None[source]

This function initializes the Rabin-Karp search algorithm class, which uses a hash function to search for a pattern in a text. [RK1987]

Parameters:

hash_function (HashFunction) – The hash function to use.

Returns:

None

Raises:

AssertionError – If the inputs are invalid.

[RK1987]

Karp, R.M. and Rabin, M.O., 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), pp.249-260.

itialize_pattern_hash(pattern: str) None[source]

This function initializes the pattern hash value.

Parameters:

pattern (str) – The pattern to search for.

Returns:

None

Raises:

AssertionError – If the inputs are invalid.

search(pattern: str, text: str) int[source]

This function searches for the pattern in the text.

Parameters:
  • pattern (str) – The pattern to search for.

  • text (str) – The text to search in.

Returns:

The index of the pattern in the text (or -1 if the pattern is not found).

Return type:

int

Raises:

AssertionError – If the inputs are invalid.

Knuth-Morris-Pratt (KMP) Search Algorithm

class string2string.search.KMPSearch[source]

This class contains the KMP search algorithm.

__init__() None[source]

This function initializes the Knuth-Morris-Pratt (KMP) search algorithm class. [KMP1977]

Parameters:

None

Returns:

None

Note

  • The current version of the KMP algorithm utilizes an auxiliary list called the lps_array, which stands for “longest proper prefix which is also a suffix”. The lps_array is a list of integers where lps_array[i] represents the length of the longest proper prefix of the pattern that is also a suffix of the pattern[:i+1].

  • By precomputing the lps_array, the KMP algorithm avoids unnecessary character comparisons while searching for the pattern in the text. The algorithm scans the text from left to right and compares characters in the pattern with characters in the text. When a mismatch occurs, the algorithm uses the values in the lps_array to determine the next character in the pattern to compare with the text.

  • An alternative implementation of the KMP algorithm exists, which uses a finite state automaton (FSA) instead of the lps_array, but this is not implemented in this version of the package.

[KMP1977]

Knuth, D.E., Morris, J.H. and Pratt, V.R., 1977. Fast pattern matching in strings. SIAM journal on computing, 6(2), pp.323-350.

initialize_lps() None[source]

This function initializes the pongest proper prefix suffix (lps) array, which contains the length of the longest proper prefix that is also a suffix of the pattern.

IOW: For each index i in the lps array, lps[i] is the length of the longest proper prefix that is also a suffix of the pattern[:i + 1]. In other words, if k = lps[i], then pattern[:k] is equal to pattern[i - k + 1:i + 1] (with the condition that pattern[:k+1] is not equal to pattern[i - k:i + 1]). The lps array is used in the Knuth-Morris-Pratt (KMP) algorithm to avoid unnecessary comparisons when searching for a pattern in a text.

Parameters:

pattern (str) – The pattern to search for.

Returns:

None

search(pattern: str, text: str) int[source]

This function searches for the pattern in the text.

Parameters:
  • pattern (str) – The pattern to search for.

  • text (str) – The text to search in.

Returns:

The index of the pattern in the text (or -1 if the pattern is not found)

Return type:

int

Raises:

AssertionError – If the text is not a string.

Note

  • This is the main function of the KMP search algorithm class.

Boyer-Moore Search Algorithm

class string2string.search.BoyerMooreSearch[source]

This class contains the Boyer-Moore search algorithm.

__init__() None[source]

This function initializes the Boyer-Moore search algorithm class. [BM1977]

The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms such as brute-force or Knuth-Morris-Pratt. It is particularly useful for searching for patterns in large amounts of text.

[BM1977]

Boyer, RS and Moore, JS. “A fast string searching algorithm.” Communications of the ACM 20.10 (1977): 762-772.

A Correct Preprocessing Algorithm for Boyer–Moore String-Searching

https://www.cs.jhu.edu/~langmea/resources/lecture_notes/strings_matching_boyer_moore.pdf

aux_get_matching_substring_length(j: int) int[source]

This auxilary function is used to compute the length of the longess suffix of the patterm that matches a substring of the pattern that ends at the index j.

It is used in the “substring match” case of the good suffix rule. More specifically, it is used to find when the suffix of the pattern does not match the text at all. Hence, we find the longest suffix of the pattern that matches a substring of the pattern that ends at the index j.

Parameters:

j (int) – The end index of the substring.

Returns:

The length of the longess suffix of the patterm that matches a substring of the pattern that ends at the index j.

Return type:

int

aux_get_suffix_prefix_length(i: int) int[source]

This auxiliary function is used to compute the length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.

Parameters:

i (int) – The index of the suffix.

Returns:

The length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.

Return type:

int

create_skip_bc() None[source]

This function creates the “bad character” skip table. (It is used in the preprocessing step of the Boyer-Moore search algorithm.)

Parameters:

None

Returns:

None

create_skip_gs() None[source]

This function creates the “good suffix” skip table. (It is used in the preprocessing step of the Boyer-Moore search algorithm.)

Parameters:

None

Returns:

None

search(pattern: str, text: str) int[source]

This function searches for the pattern in the text using the Boyer-Moore algorithm.

Parameters:
  • pattern (str) – The pattern to search for.

  • text (str) – The text to search in.

Returns:

The index of the pattern in the text (or -1 if the pattern is not found)

Return type:

int

Raises:

AssertionError – If the text or the pattern is not a string.