Metrics¶

This page contains the documentation about the string metrics used in the library.

Exact Match¶

class string2string.metrics.ExactMatch[source]¶

__init__() → None[source]¶

__weakref__¶: list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], lowercase: bool = True) → Dict[str, float][source]¶

This function returns the exact match score between the predictions and the references.

Parameters:

predictions (List[str]) – The list of predictions.
references (List[List[str]]) – The list of references.

Returns:

The exact match score.

Return type:

float

Raises:

AssertionError – If the number of predictions does not match the number of references.

sacreBLEU (sBLEU)¶

class string2string.metrics.sacreBLEU[source]¶

This class contains the sacreBLEU metric.

__init__() → None[source]¶: Initializes the BLEU class.

__weakref__¶: list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], smooth_method: str = 'exp', smooth_value: float | None = None, lowercase: bool = False, tokenizer_name: str | None = 'none', use_effective_order: bool = False, return_only: List[str] = ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'])[source]¶

Returns the BLEU score between a list of predictions and list of list of references.

Parameters:

predictions (List[str]) – The predictions.
references (List[List[str]]) – The references (or ground truth strings).
smooth_method (str) – The smoothing method. Default is “exp”. Other options are “floor”, “add-k” and “none”.
smooth_value (Optional[float]) – The smoothing value for floor and add-k smoothing. Default is None.
lowercase (bool) – Whether to lowercase the text. Default is False.
tokenizer_name (str) – The tokenizer name. Default is “none”. Other options are “zh”, “13a”, “intl”, “char”, “ja-mecab”, “ko-mecab”, “spm”, “flores101” and “flores200”.
use_effective_order (bool) – Whether to use the effective order. Default is False.
return_only (Optional[List[str]]) – The list of BLEU score components to return. Default is [‘score’, ‘counts’, ‘totals’, ‘precisions’, ‘bp’, ‘sys_len’, ‘ref_len’].

Returns:

The BLEU score (between 0 and 1).

Return type:

Dict[str, float]

Raises:

ValueError – If the number of predictions does not match the number of references.
ValueError – If the tokenizer name is invalid.

ROUGE¶

class string2string.metrics.ROUGE(tokenizer: Tokenizer | None = None)[source]¶

This class is a wrapper for the ROUGE metric from Google Research’s rouge_score package.

__init__(tokenizer: Tokenizer | None = None) → None[source]¶

This function initializes the ROUGE class, which is a wrapper for the ROUGE metric from Google Research’s rouge_score package.

Parameters:: tokenizer (Tokenizer) – The tokenizer to use. Default is None.
Returns:: None

__weakref__¶: list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], rouge_types: str | List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer: bool = False, interval_name: str = 'mid', score_type: str = 'fmeasure') → Dict[str, float][source]¶

This function returns the ROUGE score between a list of predictions and list of list of references.

Parameters:

predictions (List[str]) – The predictions.
references (List[List[str]]) – The references (or ground truth strings).
rouge_types (Union[str, List[str]]) – The ROUGE types to use. Default is [“rouge1”, “rouge2”, “rougeL”, “rougeLsum”].
use_stemmer (bool) – Whether to use a stemmer. Default is False.
interval_name (str) – The interval name. Default is “mid”.
score_type (str) – The score type. Default is “fmeasure”.

Returns:

The ROUGE score (between 0 and 1).

Return type:

Dict[str, float]

Raises:

ValueError – If the number of predictions does not match the number of references.
ValueError – If the interval name, score type or ROUGE type is invalid.
ValueError – If the prediction or reference is invalid.

Note

The ROUGE score is computed using the ROUGE metric from Google Research’s rouge_score package.
By default, BootstrapAggregator is used to aggregate the scores.
By default, the interval name is “mid” and the score type is “fmeasure”.

BERTScore¶

class string2string.similarity.BERTScore(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None)[source]¶

This class implements the BERTScore algorithm.

__init__(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None) → None[source]¶

This function initializes the BERTScore class, which computes the BERTScore between two texts.

Parameters:

model_name_or_path (str) – BERT model type to use (e.g., bert-base-uncased).
lang (str) – Language of the texts (e.g., en).
num_layers (int) – Number of layers to use.
all_layers (bool) – Whether to use all layers
use_fast_tokenizer (bool) – Whether to use the fast tokenizer.
device (str) – Device to use (e.g., cpu or cuda).
baseline_path (str) – Path to the baseline file.

Returns:

None

Raises:

ValueError – If model_name_or_path and lang are both None.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bertscore2020,
    title={BERTScore: Evaluating Text Generation with BERT},
    author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Note

If model_name_or_path is not specified, use the default model for the language.
If num_layers is not specified, use the default number of layers.
If device is not specified, use the GPU if available, otherwise use the CPU.
If baseline_path is not specified, use the default baseline file.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, idf: bool = False, nthreads: int = 4, return_hash: bool = False, rescale_with_baseline: bool = False, verbose: bool = False) → dict | str | None[source]¶

This function scores the source sentences based on their similarity to the target sentences using BERTScore.

Parameters:

source_sentences (list of str) – candidate sentences
target_sentences (list of str or list of list of str) – reference sentences
batch_size (int) – bert score processing batch size
idf (bool or dict) – use idf weighting, can also be a precomputed idf_dict
nthreads (int) – number of threads
return_hash (bool) – return hashcode of the setting
rescale_with_baseline (bool) – rescale bertscore with pre-computed baseline
verbose (bool) – turn on intermediate status update

Returns:

A dictionary containing the precision, recall, and F1 score, and the hashcode (if return_hash is True).: where the precision, recall, and F1 score are tensors of shape (len(source_sentences),

Return type:

(Dict[str, Tensor], Optional[str])

Raises:

ValueError – If the number of source sentences and target sentences do not match.

BARTScore¶

class string2string.similarity.BARTScore(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024)[source]¶

This class implements the BARTScore algorithm.

__init__(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024) → None[source]¶

This function initializes the BARTScore class, which computes the BARTScore between two pieces of text.

Parameters:

model_name_or_path (str) – The name or path of the model. Defaults to ‘facebook/bart-large-cnn’.
tokenizer_name_or_path (str) – The name or path of the tokenizer. Defaults to None.
device (str) – The device to use. Defaults to ‘cpu’.
max_length (int) – The maximum length of the input. Defaults to 1024.

Returns:

None

Raises:

ValueError –

If the device is not ‘cpu’ or ‘cuda’.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bartscore2021,
    author = {Yuan, Weizhe and Neubig, Graham and Liu, Pengfei},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
    pages = {27263--27277},
    publisher = {Curran Associates, Inc.},
    title = {BARTScore: Evaluating Generated Text as Text Generation},
    url = {https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf},
    volume = {34},
    year = {2021}
}

Note

The default model is the BART-large-cnn model.
If the tokenizer name or path is not specified, then the model name or path will be used.
If the device is ‘cuda’, then the model will be loaded onto the GPU.
If device is not specified, use the GPU if available, otherwise use the CPU.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, agg: str = 'mean') → Dict[str, List[float]][source]¶

This function scores the target sentences against the source sentences using BARTScore.

Parameters:

source_sentences (List[str]) – The source sentences.
target_sentences (Union[List[str], List[List[str]]]) – The target sentences.
batch_size (int) – The batch size to use (default: 4)
agg (str) – The aggregation method. Defaults to ‘mean’; used only when target_sentences is a list of lists.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

compute_multi_ref_score(source_sentences: List[str], target_sentences: List[List[str]], batch_size: int = 4, agg: str = 'mean') → Dict[str, List[float]][source]¶

Score a batch of examples with multiple references.

Parameters:

source_sentences (List[str]) – The source sentences.
target_sentences (List[List[str]]) – The target sentences.
agg (str) – The aggregation method. Can be “mean” or “max”.
batch_size (int) – The batch size.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

load(weights_path=None) → None[source]¶

This function loads the model weights from a specified path.

Parameters:: weights_path (str) – The path to the weights.
Returns:: None