Similarity¶

The page contains the documentation for the string-to-string similarity algorithms implemented in the package.

Cosine Similarity Measure¶

class string2string.similarity.CosineSimilarity[source]¶

__init__() → None[source]¶: This function initializes the CosineSimilarity class.

compute(x1: Tensor | ndarray, x2: Tensor | ndarray, dim: int = 0, eps: float = 1e-08) → Tensor | ndarray[source]¶

Computes the cosine similarity between two tensors (or numpy arrays) along a given dimension.

For two (non-zero) vectors, \(x_1\) and \(x_2\), the cosine similarity is defined as follows:

\begin{align} \texttt{cosine-similarity}(x_1, x_2) & = |x_1|| \ ||x_2|| \cos(\theta) \\ & = \frac{x_1 \cdot x_2}{||x_1|| \ ||x_2||} \\ & = \frac{\sum_{i=1}^n x_{1i} x_{2i}}{\sqrt{\sum_{i=1}^n x_{1i}^2} \sqrt{\sum_{i=1}^n x_{2i}^2}} \end{align}
where \(\theta\) denotes the angle between the vectors, \(\cdot\) the dot product, and \(||\cdot||\) the norm operator.
In practice, the cosine similarity is computed as follows:

\begin{align} \texttt{cosine-similarity}(x_1, x_2) & = \frac{x_1 \cdot x_2}{\max(||x_1|| ||x_2||, \epsilon)} \end{align}
where \(\epsilon\) is a small value to avoid division by zero.

Parameters:

x1 (Union[Tensor, np.ndarray]) – First tensor (or numpy array).
x2 (Union[Tensor, np.ndarray]) – Second tensor (or numpy array).
dim (int) – Dimension to compute cosine similarity (default: 0).
eps (float) – Epsilon value (to avoid division by zero).

Returns:

Cosine similarity between two tensors (or numpy arrays) along a given dimension.

Return type:

Union[Tensor, np.ndarray]

Raises:

TypeError – If x1 and x2 are not of the same type (either tensor or numpy array).
TypeError – If x1 and x2 are not tensors or numpy arrays.

BERTScore¶

class string2string.similarity.BERTScore(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None)[source]¶

This class implements the BERTScore algorithm.

__init__(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None) → None[source]¶

This function initializes the BERTScore class, which computes the BERTScore between two texts.

Parameters:

model_name_or_path (str) – BERT model type to use (e.g., bert-base-uncased).
lang (str) – Language of the texts (e.g., en).
num_layers (int) – Number of layers to use.
all_layers (bool) – Whether to use all layers
use_fast_tokenizer (bool) – Whether to use the fast tokenizer.
device (str) – Device to use (e.g., cpu or cuda).
baseline_path (str) – Path to the baseline file.

Returns:

None

Raises:

ValueError – If model_name_or_path and lang are both None.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bertscore2020,
    title={BERTScore: Evaluating Text Generation with BERT},
    author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Note

If model_name_or_path is not specified, use the default model for the language.
If num_layers is not specified, use the default number of layers.
If device is not specified, use the GPU if available, otherwise use the CPU.
If baseline_path is not specified, use the default baseline file.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, idf: bool = False, nthreads: int = 4, return_hash: bool = False, rescale_with_baseline: bool = False, verbose: bool = False) → dict | str | None[source]¶

This function scores the source sentences based on their similarity to the target sentences using BERTScore.

Parameters:

source_sentences (list of str) – candidate sentences
target_sentences (list of str or list of list of str) – reference sentences
batch_size (int) – bert score processing batch size
idf (bool or dict) – use idf weighting, can also be a precomputed idf_dict
nthreads (int) – number of threads
return_hash (bool) – return hashcode of the setting
rescale_with_baseline (bool) – rescale bertscore with pre-computed baseline
verbose (bool) – turn on intermediate status update

Returns:

A dictionary containing the precision, recall, and F1 score, and the hashcode (if return_hash is True).: where the precision, recall, and F1 score are tensors of shape (len(source_sentences),

Return type:

(Dict[str, Tensor], Optional[str])

Raises:

ValueError – If the number of source sentences and target sentences do not match.

BARTScore¶

class string2string.similarity.BARTScore(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024)[source]¶

This class implements the BARTScore algorithm.

__init__(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024) → None[source]¶

This function initializes the BARTScore class, which computes the BARTScore between two pieces of text.

Parameters:

model_name_or_path (str) – The name or path of the model. Defaults to ‘facebook/bart-large-cnn’.
tokenizer_name_or_path (str) – The name or path of the tokenizer. Defaults to None.
device (str) – The device to use. Defaults to ‘cpu’.
max_length (int) – The maximum length of the input. Defaults to 1024.

Returns:

None

Raises:

ValueError –

If the device is not ‘cpu’ or ‘cuda’.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bartscore2021,
    author = {Yuan, Weizhe and Neubig, Graham and Liu, Pengfei},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
    pages = {27263--27277},
    publisher = {Curran Associates, Inc.},
    title = {BARTScore: Evaluating Generated Text as Text Generation},
    url = {https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf},
    volume = {34},
    year = {2021}
}

Note

The default model is the BART-large-cnn model.
If the tokenizer name or path is not specified, then the model name or path will be used.
If the device is ‘cuda’, then the model will be loaded onto the GPU.
If device is not specified, use the GPU if available, otherwise use the CPU.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, agg: str = 'mean') → Dict[str, List[float]][source]¶

This function scores the target sentences against the source sentences using BARTScore.

Parameters:

source_sentences (List[str]) – The source sentences.
target_sentences (Union[List[str], List[List[str]]]) – The target sentences.
batch_size (int) – The batch size to use (default: 4)
agg (str) – The aggregation method. Defaults to ‘mean’; used only when target_sentences is a list of lists.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

compute_multi_ref_score(source_sentences: List[str], target_sentences: List[List[str]], batch_size: int = 4, agg: str = 'mean') → Dict[str, List[float]][source]¶

Score a batch of examples with multiple references.

Parameters:

source_sentences (List[str]) – The source sentences.
target_sentences (List[List[str]]) – The target sentences.
agg (str) – The aggregation method. Can be “mean” or “max”.
batch_size (int) – The batch size.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

load(weights_path=None) → None[source]¶

This function loads the model weights from a specified path.

Parameters:: weights_path (str) – The path to the weights.
Returns:: None

LCSubstringSimilarity Similarity¶

class string2string.similarity.LCSubstringSimilarity[source]¶

This class contains the Longest Common Substring similarity metric.

This class inherits from the LongestCommonSubstring class.

__init__()[source]¶

This function initializes the LongestCommonSubstring (LCSubstring) class.

Longest Common Substring (LCSubstring) of two strings is the longest substring that appears in both of them.

The following recurrence relation can be used to solve the LCSubstring problem:

\begin{align} L[i,j] = \begin{cases} 0 &\text{ if } i=0 \text{ or } j=0\\ L[i-1,j-1]+1 &\text{ if } i,j>0 \text{ and } str1[i]=str2[j]\\ 0 &\text{ if } i,j>0 \text{ and } str1[i]\neq str2[j]\\ \end{cases} \end{align}

where \(L[i,j]\) denotes the length of the LCSubstring that ends at indices i and j in str1 and str2, respectively. The solution to the problem is then given by the maximum value of \(L[i,j]\), assuming that str1 and str2 have lengths n and m, respectively.

A dynamic programming solution exists for this problem with a quadratic (i.e., \(\mathcal{O}(nm)\)) space and time complexity.

Parameters:: list_of_list_separator (str) – Separator to use when the inputs are lists of strings.
Returns:: None

compute(str1: str | List[str], str2: str | List[str], denominator: str = 'max') → float[source]¶

Returns the LCS-similarity between two strings.

Parameters:

str1 (Union[str, List[str]]) – The first string or list of strings.
str2 (Union[str, List[str]]) – The second string or list of strings.
denominator (str) – The denominator to use. Options are ‘max’ and ‘sum’. Default is ‘max’.

Returns:

The similarity between the two strings.

Return type:

float

Raises:

ValueError – If the denominator is invalid.

LCSubsequenceSimilarity Similarity¶

class string2string.similarity.LCSubsequenceSimilarity[source]¶

This class contains the Longest Common Subsequence similarity metric.

This class inherits from the LongestCommonSubsequence class.

__init__()[source]¶

This function initializes the Longest Common Subsequence (LCSubsequenceuence) class, which inherits from the StringAlignment class.

Longest common subsequence (LCSubsequence) of two strings is a subsequence of maximal length that appears in both of them.

The following recurrence relation can be used to solve the LCSubsequence problem:

\begin{align} L[i,j] = \begin{cases} 0 &\text{ if } i=0 \text{ or } j=0\\ L[i-1,j-1]+1 &\text{ if } i,j>0 \text{ and } str1[i]=str2[j]\\ \max(L[i-1,j],L[i,j-1]) &\text{ if } i,j>0 \text{ and } str1[i]\neq str2[j]\\ \end{cases} \end{align}

where \(L[i,j]\) denotes the length of the LCSubsequence of the prefixes str1[0:i] and str2[0:j]. The solution to the problem is then given by \(L[n,m]\), assuming that str1 and str2 have lengths n and m, respectively.

A dynamic programming solution exists for this problem with a quadratic (i.e., \(\mathcal{O}(nm)\)) space and time complexity.

If the vocabulary is fixed, LCSubsequence admits a “Four-Russians speedup,” which reduces its overall time complexity to subquadratic \(\mathcal{O}(n^2/\log n)\), but this algorithm is not yet implemented in this package.

Parameters:: list_of_list_separator (str) – Separator to use when the inputs are lists of strings.

compute(str1: str | List[str], str2: str | List[str], denominator: str = 'max') → float[source]¶

Returns the LCS-similarity between two strings.

Parameters:

str1 (Union[str, List[str]]) – The first string or list of strings.
str2 (Union[str, List[str]]) – The second string or list of strings.
denominator (str) – The denominator to use. Options are ‘max’ and ‘sum’. Default is ‘max’.

Returns:

The similarity between the two strings.

Return type:

float

Raises:

ValueError – If the denominator is invalid.

Jaro Similarity¶

class string2string.similarity.JaroSimilarity[source]¶

This class contains the Jaro similarity metric.

__init__()[source]¶

compute(str1: str | List[str], str2: str | List[str]) → float[source]¶

This function returns the Jaro similarity between two strings.

Parameters:

str1 (Union[str, List[str]]) – The first string or list of strings.
str2 (Union[str, List[str]]) – The second string or list of strings.

Returns:

The Jaro similarity between the two strings.

Return type:

float