Search (Pattern Matching)¶
The page contains the documentation for the string-to-string search algorithms implemented in the package.
Naive (Brute Force) Search¶
Rabin-Karp Search Algorithm¶
- class string2string.search.RabinKarpSearch(hash_function: ~string2string.misc.hash_functions.HashFunction = <string2string.misc.hash_functions.PolynomialRollingHash object>)[source]¶
This class contains the Rabin-Karp search algorithm.
- __init__(hash_function: ~string2string.misc.hash_functions.HashFunction = <string2string.misc.hash_functions.PolynomialRollingHash object>) None[source]¶
This function initializes the Rabin-Karp search algorithm class, which uses a hash function to search for a pattern in a text. [RK1987]
- Parameters:
hash_function (HashFunction) – The hash function to use.
- Returns:
None
- Raises:
AssertionError – If the inputs are invalid.
[RK1987]Karp, R.M. and Rabin, M.O., 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), pp.249-260.
- itialize_pattern_hash(pattern: str) None[source]¶
This function initializes the pattern hash value.
- Parameters:
pattern (str) – The pattern to search for.
- Returns:
None
- Raises:
AssertionError – If the inputs are invalid.
- search(pattern: str, text: str) int[source]¶
This function searches for the pattern in the text.
- Parameters:
pattern (str) – The pattern to search for.
text (str) – The text to search in.
- Returns:
The index of the pattern in the text (or -1 if the pattern is not found).
- Return type:
int
- Raises:
AssertionError – If the inputs are invalid.
Knuth-Morris-Pratt (KMP) Search Algorithm¶
- class string2string.search.KMPSearch[source]¶
This class contains the KMP search algorithm.
- __init__() None[source]¶
This function initializes the Knuth-Morris-Pratt (KMP) search algorithm class. [KMP1977]
- Parameters:
None –
- Returns:
None
Note
The current version of the KMP algorithm utilizes an auxiliary list called the lps_array, which stands for “longest proper prefix which is also a suffix”. The lps_array is a list of integers where lps_array[i] represents the length of the longest proper prefix of the pattern that is also a suffix of the pattern[:i+1].
By precomputing the lps_array, the KMP algorithm avoids unnecessary character comparisons while searching for the pattern in the text. The algorithm scans the text from left to right and compares characters in the pattern with characters in the text. When a mismatch occurs, the algorithm uses the values in the lps_array to determine the next character in the pattern to compare with the text.
An alternative implementation of the KMP algorithm exists, which uses a finite state automaton (FSA) instead of the lps_array, but this is not implemented in this version of the package.
[KMP1977]Knuth, D.E., Morris, J.H. and Pratt, V.R., 1977. Fast pattern matching in strings. SIAM journal on computing, 6(2), pp.323-350.
- initialize_lps() None[source]¶
This function initializes the pongest proper prefix suffix (lps) array, which contains the length of the longest proper prefix that is also a suffix of the pattern.
IOW: For each index i in the lps array, lps[i] is the length of the longest proper prefix that is also a suffix of the pattern[:i + 1]. In other words, if k = lps[i], then pattern[:k] is equal to pattern[i - k + 1:i + 1] (with the condition that pattern[:k+1] is not equal to pattern[i - k:i + 1]). The lps array is used in the Knuth-Morris-Pratt (KMP) algorithm to avoid unnecessary comparisons when searching for a pattern in a text.
- Parameters:
pattern (str) – The pattern to search for.
- Returns:
None
- search(pattern: str, text: str) int[source]¶
This function searches for the pattern in the text.
- Parameters:
pattern (str) – The pattern to search for.
text (str) – The text to search in.
- Returns:
The index of the pattern in the text (or -1 if the pattern is not found)
- Return type:
int
- Raises:
AssertionError – If the text is not a string.
Note
This is the main function of the KMP search algorithm class.
Boyer-Moore Search Algorithm¶
- class string2string.search.BoyerMooreSearch[source]¶
This class contains the Boyer-Moore search algorithm.
- __init__() None[source]¶
This function initializes the Boyer-Moore search algorithm class. [BM1977]
The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms such as brute-force or Knuth-Morris-Pratt. It is particularly useful for searching for patterns in large amounts of text.
[BM1977]Boyer, RS and Moore, JS. “A fast string searching algorithm.” Communications of the ACM 20.10 (1977): 762-772.
A Correct Preprocessing Algorithm for Boyer–Moore String-Searching
https://www.cs.jhu.edu/~langmea/resources/lecture_notes/strings_matching_boyer_moore.pdf
- aux_get_matching_substring_length(j: int) int[source]¶
This auxilary function is used to compute the length of the longess suffix of the patterm that matches a substring of the pattern that ends at the index j.
It is used in the “substring match” case of the good suffix rule. More specifically, it is used to find when the suffix of the pattern does not match the text at all. Hence, we find the longest suffix of the pattern that matches a substring of the pattern that ends at the index j.
- Parameters:
j (int) – The end index of the substring.
- Returns:
The length of the longess suffix of the patterm that matches a substring of the pattern that ends at the index j.
- Return type:
int
- aux_get_suffix_prefix_length(i: int) int[source]¶
This auxiliary function is used to compute the length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.
- Parameters:
i (int) – The index of the suffix.
- Returns:
The length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.
- Return type:
int
- create_skip_bc() None[source]¶
This function creates the “bad character” skip table. (It is used in the preprocessing step of the Boyer-Moore search algorithm.)
- Parameters:
None –
- Returns:
None
- create_skip_gs() None[source]¶
This function creates the “good suffix” skip table. (It is used in the preprocessing step of the Boyer-Moore search algorithm.)
- Parameters:
None –
- Returns:
None
- search(pattern: str, text: str) int[source]¶
This function searches for the pattern in the text using the Boyer-Moore algorithm.
- Parameters:
pattern (str) – The pattern to search for.
text (str) – The text to search in.
- Returns:
The index of the pattern in the text (or -1 if the pattern is not found)
- Return type:
int
- Raises:
AssertionError – If the text or the pattern is not a string.
Faiss Semantic Search¶
- class string2string.search.FaissSearch(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str = 'facebook/bart-large', device: str = 'cpu')[source]¶
- __init__(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str = 'facebook/bart-large', device: str = 'cpu') None[source]¶
This function initializes the wrapper for the FAISS library, which is used to perform semantic search.
Attention
If you use this class, please make sure to cite the following paper:
@article{johnson2019billion, title={Billion-scale similarity search with {GPUs}}, author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}}, journal={IEEE Transactions on Big Data}, volume={7}, number={3}, pages={535--547}, year={2019}, publisher={IEEE} }
- The code is based on the following GitHub repository:
- Parameters:
model_name_or_path (str, optional) – The name or path of the model to use. Defaults to ‘facebook/bart-large’.
tokenizer_name_or_path (str, optional) – The name or path of the tokenizer to use. Defaults to ‘facebook/bart-large’.
device (str, optional) – The device to use. Defaults to ‘cpu’.
- Returns:
None
- add_faiss_index(column_name: str = 'embeddings', metric_type: int | None = None, batch_size: int = 8, **kwargs) None[source]¶
This function adds a FAISS index to the dataset.
- Parameters:
column_name (str, optional) – The name of the column containing the embeddings. Defaults to ‘embeddings’.
index_type (str, optional) – The index type to use. Defaults to ‘Flat’.
metric_type (str, optional) – The metric type to use. Defaults to ‘L2’.
- Returns:
None
- Raises:
ValueError – If the dataset is not initialized.
- get_embeddings(text: str | List[str], embedding_type: str = 'last_hidden_state', batch_size: int = 8, num_workers: int = 4) Tensor[source]¶
This function returns the embeddings of the input text.
- Parameters:
text (Union[str, List[str]]) – The input text.
embedding_type (str, optional) – The type of embedding to use. Defaults to ‘last_hidden_state’.
batch_size (int, optional) – The batch size to use. Defaults to 8.
num_workers (int, optional) – The number of workers to use. Defaults to 4.
- Returns:
The embeddings.
- Return type:
torch.Tensor
- Raises:
ValueError – If the embedding type is invalid.
This function returns the last hidden state (e.g., [CLS] token’s) of the input embeddings.
- Parameters:
embeddings (torch.Tensor) – The input embeddings.
- Returns:
The last hidden state.
- Return type:
torch.Tensor
- get_mean_pooling(embeddings: Tensor) Tensor[source]¶
This function returns the mean pooling of the input embeddings.
- Parameters:
embeddings (torch.Tensor) – The input embeddings.
- Returns:
The mean pooling.
- Return type:
torch.Tensor
- initialize_corpus(corpus: Dict[str, List[str]] | DataFrame | Dataset, section: str = 'text', index_column_name: str = 'embeddings', embedding_type: str = 'last_hidden_state', batch_size: int | None = None, num_workers: int | None = None, save_path: str | None = None) Dataset[source]¶
This function initializes a dataset using a dictionary or pandas DataFrame or HuggingFace Datasets object.
- Parameters:
dataset_dict (Dict[str, List[str]]) – The dataset dictionary.
section (str) – The section of the dataset to use whose embeddings will be used for semantic search (e.g., ‘text’, ‘title’, etc.) (default: ‘text’).
index_column_name (str) – The name of the column containing the embeddings (default: ‘embeddings’)
embedding_type (str) – The type of embedding to use (default: ‘last_hidden_state’).
batch_size (int, optional) – The batch size to use (default: 8).
max_length (int, optional) – The maximum length of the input sequences.
num_workers (int, optional) – The number of workers to use.
save_path (Optional[str], optional) – The path to save the dataset (default: None).
- Returns:
The dataset object (HuggingFace Datasets).
- Return type:
Dataset
- Raises:
ValueError – If the dataset is not a dictionary or pandas DataFrame or HuggingFace Datasets object.
- load_dataset_from_json(json_path: str) Dataset[source]¶
This function loads a dataset from a JSON file.
- Parameters:
json_path (str) – The path to the JSON file.
- Returns:
The dataset.
- Return type:
Dataset
- load_faiss_index(index_name: str, file_path: str, device: str = 'cpu') None[source]¶
- This function loads the FAISS index from the specified file path.
This is a wrapper function for the load_faiss_index function in the Dataset class.
- Parameters:
index_name (str) – The name of the FAISS index (e.g., “embeddings”)
file_path (str) – The file path to load the FAISS index from.
device (str, optional) – The device to use (“cpu” or “cuda”) (default: “cpu”).
- Returns:
None
- Raises:
ValueError – If the dataset is not initialized.
- save_faiss_index(index_name: str, file_path: str) None[source]¶
- This function saves the FAISS index to the specified file path.
This is a wrapper function for the save_faiss_index function in the Dataset class.
- Parameters:
index_name (str) – The name of the FAISS index (e.g., “embeddings”)
file_path (str) – The file path to save the FAISS index.
- Returns:
None
- Raises:
ValueError – If the dataset is not initialized.
- search(query: str, k: int = 1, index_column_name: str = 'embeddings') DataFrame[source]¶
This function searches for the most similar elements in the dataset, given a query.
- Parameters:
query (str) – The query.
k (int, optional) – The number of elements to return (default: 1).
index_column_name (str, optional) – The name of the column containing the embeddings (default: ‘embeddings’)
- Returns:
The most similar elements in the dataset (text, score, etc.), sorted by score.
- Return type:
pd.DataFrame
- Remarks:
The returned elements are dictionaries containing the text and the score.