Distance¶
The page contains the documentation for the string distance algorithms implemented in the package.
Levenshtein Edit Distance¶
- class string2string.distance.LevenshteinEditDistance(match_weight: float = 0.0, insert_weight: float = 1.0, delete_weight: float = 1.0, substitute_weight: float = 1.0)[source]¶
- __init__(match_weight: float = 0.0, insert_weight: float = 1.0, delete_weight: float = 1.0, substitute_weight: float = 1.0) None[source]¶
This class initializes the Levenshtein edit distance algorithm. Levenshtein edit distance represents the minimum number of edit distance operations (insertion, deletion, and substitution) required to convert one string to another.
The Levenshtein edit distance (with unit cost for each edit distance operation) is given by the following recurrence relation:
\begin{align} d[i, j] := \min( & d[i-1, j-1] + \texttt{mismatch}(i, j), \\ & d[i-1, j] + 1, \\ & d[i, j-1] + 1), \end{align}where \(\texttt{mismatch}(i, j)\) is 1 if the i-th element in str1 is not equal to the j-th element in str2, and 0 otherwise.
- Parameters:
match_weight (float) – The weight of a match (default: 0.0).
insert_weight (float) – The weight of an insertion (default: 1.0).
delete_weight (float) – The weight of a deletion (default: 1.0).
substitute_weight (float) – The weight of a substitution (default: 1.0).
- Raises:
AssertionError – If any of the weights are negative.
- compute(str1: str | List[str], str2: str | List[str], method: str = 'dynamic-programming') float[source]¶
This function computes the Levenshtein edit distance between two strings (or lists of strings), using the method specified by the user.
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
method (str) – The method to use to compute the Levenshtein edit distance (default: “dynamic-programming”).
- Returns:
The Levenshtein edit distance between the two strings.
Note
- The method can be one of the following:
“recursive”: This method computes the Levenshtein edit distance using recursion.
“recursive-memoization”: This method computes the Levenshtein edit distance using recursion with memoization.
“dynamic-programming”: This method computes the Levenshtein edit distance using dynamic programming (Wagner-Fischer algorithm).
By default, the method is “dynamic-programming”.
- compute_dynamic_programming(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Levenshtein edit distance between two strings (or lists of strings) using dynamic programming (Wagner-Fischer algorithm).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Levenshtein edit distance between the two strings.
Note
The solution presented here utilizes dynamic programming principles to compute the Levenshtein edit distance between two strings.
This solution is also known as the Wagner-Fischer algorithm. [WF1974]
The time complexity of this dynamic-programming-based solution is \(\mathcal{O}(nm)\), and the space complexity is \(\mathcal{O}(nm)\), where n and m are the lengths of the two strings, respectively.
However, by using only two rows of the distance matrix at a time, the space complexity of the dynamic programming solution can be reduced to \(\mathcal{O}(min(n, m))\).
The time complexity cannot be made strongly subquadratic time unless SETH is false. [BI2015]
Finally, we note that this solution can be extended to cases where each edit distance operation has a non-unit cost.
[WF1974]Wagner, R.A. and Fischer, M.J., 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1), pp.168-173.
[BI2015]Backurs, A. and Indyk, P., 2015, June. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (pp. 51-58).
- compute_memoization_helper(str1: str | List[str], str2: str | List[str], memoization: Dict[Tuple[str, str], float]) float[source]¶
This is a helper function that computes the Levenshtein edit distance between two strings (or lists of strings) using memoization.
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
memoization (dict) – The memoization dictionary.
- Returns:
The Levenshtein edit distance between the two strings.
Note
The solution presented here utilizes memoization to compute the Levenshtein edit distance between two strings.
One can also use the
functools.lru_cache()(@lru_cache()) decorator to memoize the function calls. However, for the sake of educational purposes, we have implemented memoization using a dictionary.The time complexity of this function is quadratic, that is \(\mathcal{O}(nm)\), where m and n are the lengths of the two strings.
- compute_recursive(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Levenshtein edit distance between two strings (or lists of strings) using recursion.
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Levenshtein edit distance between the two strings.
Note
The solution presented here utilizes recursion to compute the Levenshtein edit distance between two strings. It has an exponential time complexity and is not recommended for pairs of strings with a large length.
The time complexity of this function is \(O(3^{m+n})\), where \(m\) and \(n\) are the lengths of the two strings.
- compute_recursive_memoization(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Levenshtein edit distance between two strings (or lists of strings) using memoization.
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Levenshtein edit distance between the two strings.
Note
The solution presented here utilizes memoization to compute the Levenshtein edit distance between two strings.
The time complexity of this function is \(\mathcal{O}(m n)\), where \(m\) and \(n\) are the lengths of the two strings.
Hamming Distance¶
- class string2string.distance.HammingDistance(match_weight: float = 0.0, substitute_weight: float = 1.0)[source]¶
- __init__(match_weight: float = 0.0, substitute_weight: float = 1.0) None[source]¶
This function initializes the class variables of the Hamming distance.
The Hamming distance is the number of positions at which the corresponding symbols are different. [H1950]
- Parameters:
match_weight (float) – The weight of a match (default: 0.0).
substitute_weight (float) – The weight of a substitution (default: 1.0).
- Raises:
AssertionError – If the substite weight is negative.
Note
The Hamming distance has a time complexity of \(\mathcal{O}(n)\), where :math: n the length of the two strings.
[H1950]Hamming, R.W., 1968. Error detecting and error correcting codes. Bell System Technical Journal, 29(2), pp.147-160.
- compute(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Hamming distance between two strings (or lists of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Hamming distance between the two strings.
- Raises:
ValueError – If the two strings (or lists of strings) have different lengths.
Damerau-Levenshtein Distance¶
- class string2string.distance.DamerauLevenshteinDistance(match_weight: float = 0.0, insert_weight: float = 1.0, delete_weight: float = 1.0, substitute_weight: float = 1.0, adjacent_transpose_weight: float = 1.0)[source]¶
- __init__(match_weight: float = 0.0, insert_weight: float = 1.0, delete_weight: float = 1.0, substitute_weight: float = 1.0, adjacent_transpose_weight: float = 1.0) None[source]¶
This function initializes the class variables of the Damerau-Levenshtein distance.
The Damerau-Levenshtein distance is the minimum number of insertions, deletions, substitutions, and transpositions required to transform one string into the other. [D1964]
- Parameters:
match_weight (float) – The weight of a match (default: 0.0).
insert_weight (float) – The weight of an insertion (default: 1.0).
delete_weight (float) – The weight of a deletion (default: 1.0).
substitute_weight (float) – The weight of a substitution (default: 1.0).
adjacent_transpose_weight (float) – The weight of an adjacent transposition (default: 1.0).
- Raises:
AssertionError – If the insert, delete, substite, or adjacent transpose weights are negative.
[D1964]Damerau, F.J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), pp.171-176.
- compute(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Damerau-Levenshtein edit distance between two strings (or lists of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Damerau-Levenshtein distance between the two strings.
Note
The Damerau-Levenshtein distance is a variant of the Levenshtein distance that allows for adjacent transpositions.
The dynamic programming solution to the Damerau-Levenshtein distance has a time complexity of \(\mathcal{O}(nm)\), where n and m are the lengths of the two strings.
Jaccard Index¶
- class string2string.distance.JaccardIndex[source]¶
- __init__() None[source]¶
This function initializes the class variables of the Jaccard index.
The Jaccard index is equal to 1.0 minus the Jaccard similarity coefficient. It is equal to 0.0 if and only if the two sets are equal. [J1938]
[J1938]Jaccard, P., 1912. The Distribution of the Flora in the Alpine Zone. New Phytologist, 11(2), pp.37-50.
- compute(str1: str | List[str], str2: str | List[str]) float[source]¶
This function computes the Jaccard index between two strings (or lists of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The Jaccard index between the two strings.