Alignment¶
The page contains the documentation for the string-to-string alignment algorithms implemented in the package.
Needleman-Wunsch Algorithm¶
- class string2string.alignment.NeedlemanWunsch(match_weight: float = 1.0, mismatch_weight: float = -1.0, gap_weight: float = -1.0, gap_char: str = '-', match_dict: dict | None = None)[source]¶
- __init__(match_weight: float = 1.0, mismatch_weight: float = -1.0, gap_weight: float = -1.0, gap_char: str = '-', match_dict: dict | None = None) None[source]¶
This function initializes the Needleman-Wunsch algorithm, which is used to get the global alignment of sequences (e.g., strings or lists of strings) such as DNA sequences.
The algorithm is described in the following paper: [Needleman1970]
- Parameters:
match_weight (float) – The weight of a match (default: 1.).
mismatch_weight (float) – The weight of a mismatch (default: -1.).
gap_weight (float) – The weight of a gap (default: -1.).
match_dict (dict) – The match dictionary (default: None).
gap_char (str) – The gap character (default: “-“).
[Needleman1970]Needleman, S.B. and Wunsch, C.D., 1970. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3), pp.443-453.
- backtrack(score_matrix: ndarray, str1: str | List[str], str2: str | List[str]) Tuple[str | List[str], str | List[str]][source]¶
This function is an auxilary function, used by the get_alignment() function, that backtracks the score matrix to get the aligned strings.
- Parameters:
score_matrix (np.ndarray) – The score matrix.
str1 – The first string (or list of strings).
str2 – The second string (or list of strings).
- Returns:
The aligned strings (or list of strings). The aligned strings are padded with spaces to make them the same length.
Note
The score matrix is assumed to be a 2D numpy array.
There might be multiple optimal alignments. This function returns one of the optimal alignments.
The backtracking step has a time complexity of \(O(m + n)\), where \(n\) and \(m\) are the lengths of the strings str1 and str2, respectively.
- get_alignment(str1: str | List[str], str2: str | List[str], return_score_matrix: bool = False) Tuple[str | List[str], str | List[str], ndarray | None][source]¶
This is the main function in the NeedlemanWunsch class that gets the alignment of two strings (or list of strings) by using the Needleman-Wunsch algorithm.
- Parameters:
str1 – The first string (or list of strings).
str2 – The second string (or list of strings).
return_score_matrix (bool) – Whether to return the score matrix. (Default: False)
- Returns:
The aligned strings (or list of strings). The aligned strings are padded with spaces to make them the same length. If return_score_matrix is True, the score matrix is also returned.
Note
There might be multiple optimal alignments. This function returns one of the optimal alignments.
The time complexity of this function is \(O(nm)\), where \(n\) and \(m\) are the lengths of the strings str1 and str2, respectively.
The space complexity of this function is \(O(nm)\).
The ” | ” character is used to separate the elements in the aligned strings.
Hirschberg’s Algorithm¶
- class string2string.alignment.Hirschberg(match_weight: int | float = 2, mismatch_weight: int | float = -1, gap_weight: int | float = -2, gap_char: str = '-', match_dict: dict | None = None)[source]¶
- __init__(match_weight: int | float = 2, mismatch_weight: int | float = -1, gap_weight: int | float = -2, gap_char: str = '-', match_dict: dict | None = None) None[source]¶
This function initializes the parameters of the Hirschberg algorithm [H1975], a space-efficient solution to the global alignment problem. It inherits from the NeedlemanWunsch class.
- Parameters:
match_weight (int or float) – The weight of a match (default: 2).
mismatch_weight (int or float) – The weight of a mismatch (default: -1).
gap_weight (int or float) – The weight of a gap (default: -2).
gap_char (str) – The character used to represent a gap (default: ‘-‘).
match_dict (dict) – The dictionary that maps the characters to their match weights (default: None).
Note
The default values are the same as the ones used in the Needleman-Wunsch algorithm.
The time complexity of Hirschberg’s algorithm is \(O(nm)\), where \(n\) and \(m\) are the lengths of the strings str1 and str2, respectively.
The space complexity of Hirschberg’s algorithm is, on the other hand, \(O(min(n, m))\).
We benefited from the following resources to implement this class: [K2015], [W2012], [M2010], [K2002].
[H1975]Hirschberg, Daniel S. “A linear space algorithm for computing maximal common subsequences.” Communications of the ACM 18.6 (1975): 341-343.
[K2015]Kellis, Manolis. Computational Biology: Genomes, Networks, Evolution (MIT Course 6.047/6.878) — https://ocw.mit.edu/ans7870/6/6.047/f15/MIT6_047F15_Compiled.pdf (Accessed on 02-16-2023) (Section 2.5.8; Linear Space Alignment, pg. 38-39).
[W2012]Wayne, Kevin. Lecture Slides for Algorithm Design - Dynamic Programming II (https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/06DynamicProgrammingII.pdf) (Accessed on 02-16-2023).
[M2010]Moura, Lucia. Algorithms in Bioinformatics: Lectures 3-5 - Sequence Similarity. Fall 2010. University of Ottawa. (https://www.site.uottawa.ca/~lucia/courses/2010/comp5511/lectures/03-05.pdf) (Accessed on 02-16-2023).
[K2002]Kingsford, Carl. Lecture 7: Dynamic Programming. 2002. Carnegie Mellon University. (https://www.cs.cmu.edu/~ckingsf/class/02-714/Lec07-linspace.pdf) (Accessed on 02-16-2023).
- get_alignment(str1: str | List[str], str2: str | List[str]) Tuple[str | List[str], str | List[str]][source]¶
This function gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm.
- Parameters:
str1 – The first string (or list of strings).
str2 – The second string (or list of strings).
- Returns:
The aligned strings as a tuple of two strings (or list of strings).
Note
As a notable improvement of the Needleman-Wunsch algorithm, Hirschberg’s algorithm combines both divide-and-conquer and dynamic programming principles. This algorithm provides a space-efficient solution, with a time complexity of \(O(nm)\), where \(n\) and \(m\) are the lengths of the strings str1 and str2, respectively. Its space complexity, however, is \(O(min(n, m))\).
To further improve the algorithm, one may limit the number of insertions and deletions in the alignment. This strategy can notably reduce the time complexity from \(O(mn)\) to \(O((m + n) * k)\), where k is the maximum number of insertions or deletions allowed. By constraining the alignment around the diagonal in the score matrix with (2 * k + 1) cells, the new version of the algorithm can be called the k-banded Hirschberg algorithm. If k is arbitrarily small, this modification can lead to a significant improvement in the time complexity.
The k-banded Hirschberg algorithm, with its time complexity of \(O((m + n) * k)\), is a powerful strategy that balances space and time requirements in sequence alignment.
- get_alignment_helper(str1: str | List[str], str2: str | List[str]) Tuple[str | List[str], str | List[str]][source]¶
This is a helper function that is called by the get_alignment() function. This function gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm.
- Parameters:
str1 – The first string (or list of strings).
str2 – The second string (or list of strings).
- Returns:
The aligned strings as a tuple of two strings (or list of strings).
Note
We assume that the length of str1 is greater than or equal to the length of str2.
Smith-Waterman Algorithm¶
- class string2string.alignment.SmithWaterman(match_weight: int | float = 1, mismatch_weight: int | float = -1, gap_weight: int | float = -1, gap_char: str = '-', match_dict: dict | None = None)[source]¶
- __init__(match_weight: int | float = 1, mismatch_weight: int | float = -1, gap_weight: int | float = -1, gap_char: str = '-', match_dict: dict | None = None) None[source]¶
This function initializes the class variables of the Smith-Waterman algorithm, used for local alignment of sequences (e.g., strings or lists of strings) such as DNA sequences.
- Parameters:
match_weight (int or float) – The weight of a match (default: 1).
mismatch_weight (int or float) – The weight of a mismatch (default: -1).
gap_weight (int or float) – The weight of a gap (default: -1).
gap_char (str) – The character used to represent a gap (default: ‘-‘).
match_dict (dict) – The dictionary that maps the characters to their match weights (default: None).
Note
The default values are the same as the Needleman-Wunsch algorithm.
The Smith-Waterman algorithm can be thought of a variant of the Needleman-Wunsch algorithm, where the max function is replaced by the max function with 0 as the default value and the first row and column of the score matrix are initialized to 0.
\begin{align} \texttt{max score} &= \texttt{max}(\texttt{match score}, \texttt{delete score}, \texttt{insert score}, 0) \end{align}We only need to override the get_alignment and backtrack functions of the NeedlemanWunsch class.
The space and time complexities of the Smith-Waterman algorithm are the same as the Needleman-Wunsch algorithm, that is, \(O(mn)\) and \(O(mn)\), respectively, where \(n\) and \(m\) are the lengths of the two strings (or lists of strings).
- backtrack(score_matrix: ndarray, str1: str | List[str], str2: str | List[str]) Tuple[str | List[str], str | List[str]][source]¶
This function overrides the backtrack function of the NeedlemanWunsch class to get an optimal local alignment between two strings (or list of strings).
- Parameters:
score_matrix (numpy.ndarray) – The score matrix.
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The aligned substrings as a tuple of two strings (or list of strings).
Note
The backtrack function used in this function is different from the backtrack function used in the Needleman-Wunsch algorithm. Here we start from the position with the highest score in the score matrix and trace back to the first position that has a score of zero. This is because the highest-scoring subsequence may not necessarily span the entire length of the sequences being aligned.
On the other hand, the backtrack function used in the Needleman-Wunsch algorithm traces back through the entire score matrix, starting from the bottom-right corner, to determine the optimal alignment path. This is because the algorithm seeks to find the global alignment of two sequences, which means aligning them from the beginning to the end.
- get_alignment(str1: str | List[str], str2: str | List[str], return_score_matrix: bool = False) Tuple[str | List[str], str | List[str]][source]¶
This function overrides the get_alignment function of the NeedlemanWunsch class to get the alignment of two strings (or list of strings) by using the Smith-Waterman algorithm.
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
return_score_matrix (bool) – Whether to return the score matrix (default: False)
- Returns:
The aligned strings as a tuple of two strings (or list of strings). If return_score_matrix is True, the score matrix is also returned.
Note
The Smith-Waterman algorithm is a dynamic programming algorithm that finds the optimal local alignment between two strings (or list of strings).
This function is similar to the get_alignment function in the NeedlemanWunsch class, with two differences. First, the first row and column of the score matrix are initialized to 0. Second, the max function used in the dynamic programming solution is replaced with a max function that defaults to 0, i.e., max_score = max(match_score, delete_score, insert_score, 0).
Despite these differences, the time and space complexity of the Smith-Waterman algorithm remain the same as that of the Needleman-Wunsch algorithm. It should be noted that most of the code in this function is identical to that in the get_alignment function of the NeedlemanWunsch class.
Dynamic Time Warping (DTW)¶
- class string2string.alignment.DTW[source]¶
-
- get_alignment_path(sequence1: str | List[str] | int | List[int] | float | List[float] | ndarray, sequence2: str | List[str] | int | List[int] | float | List[float] | ndarray, distance='absolute_difference', p_value: int | None = None) List[Tuple[int, int]][source]¶
This function gets the alignment indices of two sequences (or list of sequences) by using the DTW algorithm.
- Parameters:
sequence1 – The first sequence.
sequence2 – The second sequence.
distance (str) – The distance function to be used (currently only ‘absolute_difference’ and ‘square_difference’ are supported).
- Returns:
The path of the alignment as a list of tuples of two integers.
- Raises:
TypeError – If the input sequences are not of the same type.
ValueError – If the distance function is not supported.
Note
The DTW algorithm is a dynamic programming algorithm that finds the optimal alignment between two sequences (or list of sequences).
The time complexity of the DTW algorithm is \(O(nm)\), where \(n\) and \(m\) are the lengths of the two sequences, respectively.
Longest Common Subsequence¶
- class string2string.alignment.LongestCommonSubsequence(list_of_list_separator: str = ' ## ')[source]¶
- __init__(list_of_list_separator: str = ' ## ') None[source]¶
This function initializes the Longest Common Subsequence (LCSubsequenceuence) class, which inherits from the StringAlignment class.
Longest common subsequence (LCSubsequence) of two strings is a subsequence of maximal length that appears in both of them.
The following recurrence relation can be used to solve the LCSubsequence problem:
\begin{align} L[i,j] = \begin{cases} 0 &\text{ if } i=0 \text{ or } j=0\\ L[i-1,j-1]+1 &\text{ if } i,j>0 \text{ and } str1[i]=str2[j]\\ \max(L[i-1,j],L[i,j-1]) &\text{ if } i,j>0 \text{ and } str1[i]\neq str2[j]\\ \end{cases} \end{align}where \(L[i,j]\) denotes the length of the LCSubsequence of the prefixes str1[0:i] and str2[0:j]. The solution to the problem is then given by \(L[n,m]\), assuming that str1 and str2 have lengths n and m, respectively.
A dynamic programming solution exists for this problem with a quadratic (i.e., \(\mathcal{O}(nm)\)) space and time complexity.
If the vocabulary is fixed, LCSubsequence admits a “Four-Russians speedup,” which reduces its overall time complexity to subquadratic \(\mathcal{O}(n^2/\log n)\), but this algorithm is not yet implemented in this package.
- Parameters:
list_of_list_separator (str) – Separator to use when the inputs are lists of strings.
- compute(str1: str | List[str], str2: str | List[str], returnCandidates: bool = False) Tuple[float, List[str] | List[List[str]]][source]¶
This function computes the longest common subsequence between two strings (or lists of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings) to compare.
str2 (str or list of str) – The second string (or list of strings).
returnCandidates (bool) – Whether to return the candidates for the longest common subsequence (default: False).
- Returns:
If returnCandidates is False, then the length of the longest common subsequence between the two strings.
If returnCandidates is True, then the set of all candidates for the longest common subsequence is also returned.
Note
Similar to that of the Levenshtein edit distance problem, the dynamic programming solution for the longest common subsequence problem can be further optimized by using the last row of the two-dimensional array L to compute the length of the LCSubsequence. The key idea behind this optimization is that the last row of the array L only depends on the values of the previous row. Therefore, we can store only two rows of the array L at a time and compute the LCSubsequence using these two rows.
This optimization reduces the space complexity of the algorithm to \(\mathcal{O}(m)\), where \(m\) is the length of the shorter input string. This optimization is particularly useful when one of the input strings is much shorter than the other, as it can significantly reduce the amount of memory required to solve the problem.
Longest Common Substring¶
- class string2string.alignment.LongestCommonSubstring(list_of_list_separator: str = ' ## ')[source]¶
- __init__(list_of_list_separator: str = ' ## ') None[source]¶
This function initializes the LongestCommonSubstring (LCSubstring) class.
Longest Common Substring (LCSubstring) of two strings is the longest substring that appears in both of them.
The following recurrence relation can be used to solve the LCSubstring problem:
\begin{align} L[i,j] = \begin{cases} 0 &\text{ if } i=0 \text{ or } j=0\\ L[i-1,j-1]+1 &\text{ if } i,j>0 \text{ and } str1[i]=str2[j]\\ 0 &\text{ if } i,j>0 \text{ and } str1[i]\neq str2[j]\\ \end{cases} \end{align}where \(L[i,j]\) denotes the length of the LCSubstring that ends at indices i and j in str1 and str2, respectively. The solution to the problem is then given by the maximum value of \(L[i,j]\), assuming that str1 and str2 have lengths n and m, respectively.
A dynamic programming solution exists for this problem with a quadratic (i.e., \(\mathcal{O}(nm)\)) space and time complexity.
- Parameters:
list_of_list_separator (str) – Separator to use when the inputs are lists of strings.
- Returns:
None
- compute(str1: str | List[str], str2: str | List[str], returnCandidates: bool = False) Tuple[float, List[str] | List[List[str]]][source]¶
This function computes the longest common substring between two strings (or lists of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
returnCandidates (bool) – A boolean flag indicating whether to return the longest common substring as a list of lists.
- Returns:
If returnCandidates is False, then the length of the longest common substring between the two strings. If returnCandidates is True, then the set of longest common substrings between the two strings (or lists of strings) is also returned.
Note
There exists a linear-time solution to LCSubstring problem that uses generalized suffix trees.
As with the longest common subsequence problem, the longest common substring is not unique. It is possible to have multiple substrings with the same maximum length that appear in both strings.
Similar to the dynamic programming solution for the longest common subsequence problem, the last row of the matrix can also be used to optimize the computation of the longest common substring.
It’s important to note that the longest common substring is different from the longest common subsequence. The longest common substring is a contiguous sequence of symbols that appears in both strings, while the longest common subsequence is a sequence of characters that may not be contiguous.
The longest common substring is a measure of similarity between two strings and is used in various fields, including computational biology, where it is used to compare DNA sequences. Similarly, it is also used in plagirism detection and other applications.
String Alignment Parent Class¶
- class string2string.alignment.StringAlignment(match_weight: int = 1.0, mismatch_weight: int = -1.0, gap_weight: int = -1, gap_char: str = '-', match_dict: dict | None = None)[source]¶
This class is the parent class for all alignment algorithms implemented in this module.
- __init__(match_weight: int = 1.0, mismatch_weight: int = -1.0, gap_weight: int = -1, gap_char: str = '-', match_dict: dict | None = None) None[source]¶
This function initializes the StringAlignment class.
- Parameters:
match_weight (int) – The weight for a match (default: 1).
mismatch_weight (int) – The weight for a mismatch (default: -1).
gap_weight (int) – The weight for a gap (default: -1).
gap_char (str) – The character for a gap (default: “-“).
match_dict (dict) – The match dictionary (default: None).
- Returns:
None
Note
The match_dict represents a dictionary of the match weights for each pair of characters. For example, if the match_dict is {“A”: {“A”: 1, “T”: -1}, “T”: {“A”: -1, “T”: 1}}, then the match weight for “A” and “A” is 1, the match weight for “A” and “T” is -1, the match weight for “T” and “A” is -1, and the match weight for “T” and “T” is 1. The match_dict is particularly useful when we wish to align (or match) non-identical characters. For example, if we wish to align “A” and “T”, we can set the match_dict to {“A”: {“T”: 1}}. This will ensure that the match weight for “A” and “T” is 1, and the match weight for “A” and “A” and “T” and “T” is 0.
- add_space_to_shorter(str1: str, str2: str) Tuple[str, str][source]¶
This function adds gaps to the shorter string to make the two strings the same length.
- Parameters:
str1 (str) – The first string.
str2 (str) – The second string.
- Returns:
The two strings with the same length.
- bool_match(c1: str | List[str], c2: str | List[str]) bool[source]¶
The function returns whether two characters match, according to the match dictionary (if it exists).
- Parameters:
c1 (str or list of str) – The first character or string.
c2 (str or list of str) – The second character or string.
- Returns:
Whether the two characters match (True or False)
- compute_multiple_pairs(pairs: List[Tuple[str | List[str], str | List[str]]], num_workers: int = 1, method: str = 'multiprocessing', **kwargs) List[Tuple[float, List[str] | List[List[str]]]][source]¶
This “meta” function computes the alignment score of multiple pairs of strings (or lists of strings) in parallel.
- Parameters:
pairs (list of tuples) – A list of tuples, where each tuple contains two strings (or lists of strings).
num_workers (int) – The number of workers to use for multiprocessing.
method (str) – The method to use for parallelization. Options are “multiprocessing” and “joblib”.
**kwargs – Additional keyword arguments to pass to the compute function.
- Returns:
A list of tuples, where each tuple contains the alignment score and the alignment of the two strings (or lists of strings), based on the compute function.
Note
This function uses multiprocessing, either via the multiprocessing module or via joblib.
The multiprocessing module is used by default, but joblib can be used instead by setting method=”joblib”.
We found that joblib is empirically faster than multiprocessing for this particular problem.
- get_alignment_score(str1: str | List[str], str2: str | List[str]) float[source]¶
This function returns the alignment score of two strings (or list of strings).
- Parameters:
str1 (str or list of str) – The first string (or list of strings).
str2 (str or list of str) – The second string (or list of strings).
- Returns:
The alignment score of the two strings (or list of strings).
- get_alignment_strings_and_indices(str1: str, str2: str, separator: str = ' | ') Tuple[Tuple[List[int], List[int]], List[str], List[str]][source]¶
This function returns the indices of the aligned characters, and the two strings separated by the separator.
- Parameters:
str1 (str) – The first string, separated by the separator.
str2 (str) – The second string, separated by the separator.
separator (str) – The separator.
- Returns:
The indices of the aligned characters, and the two strings separated by the separator.
- get_gap_weight(c: str | List[str]) float[source]¶
This function returns the gap weight of a character or string.
- Parameters:
c (str or list of str) – The character or string.
- Returns:
The gap weight of the character or string.
- get_match_weight(c1: str | List[str], c2: str | List[str]) float[source]¶
This function returns the match weight of two characters.
- Parameters:
c1 (str or list of str) – The first character or string.
c2 (str or list of str) – The second character or string.
- Returns:
The match weight of the two characters or strings.
- get_score(c1: str | List[str], c2: str | List[str]) float[source]¶
This function returns the score of a character or string pair.
- Parameters:
c1 (str or list of str) – The first character or string.
c2 (str or list of str) – The second character or string.
- Returns:
The score of the character or string pair.