Metrics

This page contains the documentation about the string metrics used in the library.

Exact Match

class string2string.metrics.ExactMatch[source]
__init__() None[source]
__weakref__

list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], lowercase: bool = True) Dict[str, float][source]

This function returns the exact match score between the predictions and the references.

Parameters:
  • predictions (List[str]) – The list of predictions.

  • references (List[List[str]]) – The list of references.

Returns:

The exact match score.

Return type:

float

Raises:

AssertionError – If the number of predictions does not match the number of references.

sacreBLEU (sBLEU)

class string2string.metrics.sacreBLEU[source]

This class contains the sacreBLEU metric.

__init__() None[source]

Initializes the BLEU class.

__weakref__

list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], smooth_method: str = 'exp', smooth_value: float | None = None, lowercase: bool = False, tokenizer_name: str | None = 'none', use_effective_order: bool = False, return_only: List[str] = ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'])[source]

Returns the BLEU score between a list of predictions and list of list of references.

Parameters:
  • predictions (List[str]) – The predictions.

  • references (List[List[str]]) – The references (or ground truth strings).

  • smooth_method (str) – The smoothing method. Default is “exp”. Other options are “floor”, “add-k” and “none”.

  • smooth_value (Optional[float]) – The smoothing value for floor and add-k smoothing. Default is None.

  • lowercase (bool) – Whether to lowercase the text. Default is False.

  • tokenizer_name (str) – The tokenizer name. Default is “none”. Other options are “zh”, “13a”, “intl”, “char”, “ja-mecab”, “ko-mecab”, “spm”, “flores101” and “flores200”.

  • use_effective_order (bool) – Whether to use the effective order. Default is False.

  • return_only (Optional[List[str]]) – The list of BLEU score components to return. Default is [‘score’, ‘counts’, ‘totals’, ‘precisions’, ‘bp’, ‘sys_len’, ‘ref_len’].

Returns:

The BLEU score (between 0 and 1).

Return type:

Dict[str, float]

Raises:
  • ValueError – If the number of predictions does not match the number of references.

  • ValueError – If the tokenizer name is invalid.

ROUGE

class string2string.metrics.ROUGE(tokenizer: Tokenizer | None = None)[source]

This class is a wrapper for the ROUGE metric from Google Research’s rouge_score package.

__init__(tokenizer: Tokenizer | None = None) None[source]

This function initializes the ROUGE class, which is a wrapper for the ROUGE metric from Google Research’s rouge_score package.

Parameters:

tokenizer (Tokenizer) – The tokenizer to use. Default is None.

Returns:

None

__weakref__

list of weak references to the object (if defined)

compute(predictions: List[str], references: List[List[str]], rouge_types: str | List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer: bool = False, interval_name: str = 'mid', score_type: str = 'fmeasure') Dict[str, float][source]

This function returns the ROUGE score between a list of predictions and list of list of references.

Parameters:
  • predictions (List[str]) – The predictions.

  • references (List[List[str]]) – The references (or ground truth strings).

  • rouge_types (Union[str, List[str]]) – The ROUGE types to use. Default is [“rouge1”, “rouge2”, “rougeL”, “rougeLsum”].

  • use_stemmer (bool) – Whether to use a stemmer. Default is False.

  • interval_name (str) – The interval name. Default is “mid”.

  • score_type (str) – The score type. Default is “fmeasure”.

Returns:

The ROUGE score (between 0 and 1).

Return type:

Dict[str, float]

Raises:
  • ValueError – If the number of predictions does not match the number of references.

  • ValueError – If the interval name, score type or ROUGE type is invalid.

  • ValueError – If the prediction or reference is invalid.

Note

  • The ROUGE score is computed using the ROUGE metric from Google Research’s rouge_score package.

  • By default, BootstrapAggregator is used to aggregate the scores.

  • By default, the interval name is “mid” and the score type is “fmeasure”.

BERTScore

class string2string.similarity.BERTScore(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None)[source]

This class implements the BERTScore algorithm.

__init__(model_name_or_path: str | None = None, lang: str | None = None, num_layers: int | None = None, all_layers: bool = False, use_fast_tokenizer: bool = False, device: str = 'cpu', baseline_path: str | None = None) None[source]

This function initializes the BERTScore class, which computes the BERTScore between two texts.

Parameters:
  • model_name_or_path (str) – BERT model type to use (e.g., bert-base-uncased).

  • lang (str) – Language of the texts (e.g., en).

  • num_layers (int) – Number of layers to use.

  • all_layers (bool) – Whether to use all layers

  • use_fast_tokenizer (bool) – Whether to use the fast tokenizer.

  • device (str) – Device to use (e.g., cpu or cuda).

  • baseline_path (str) – Path to the baseline file.

Returns:

None

Raises:

ValueError – If model_name_or_path and lang are both None.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bertscore2020,
    title={BERTScore: Evaluating Text Generation with BERT},
    author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Note

  • If model_name_or_path is not specified, use the default model for the language.

  • If num_layers is not specified, use the default number of layers.

  • If device is not specified, use the GPU if available, otherwise use the CPU.

  • If baseline_path is not specified, use the default baseline file.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, idf: bool = False, nthreads: int = 4, return_hash: bool = False, rescale_with_baseline: bool = False, verbose: bool = False) dict | str | None[source]

This function scores the source sentences based on their similarity to the target sentences using BERTScore.

Parameters:
  • source_sentences (list of str) – candidate sentences

  • target_sentences (list of str or list of list of str) – reference sentences

  • batch_size (int) – bert score processing batch size

  • idf (bool or dict) – use idf weighting, can also be a precomputed idf_dict

  • nthreads (int) – number of threads

  • return_hash (bool) – return hashcode of the setting

  • rescale_with_baseline (bool) – rescale bertscore with pre-computed baseline

  • verbose (bool) – turn on intermediate status update

Returns:

A dictionary containing the precision, recall, and F1 score, and the hashcode (if return_hash is True).

where the precision, recall, and F1 score are tensors of shape (len(source_sentences),

Return type:

(Dict[str, Tensor], Optional[str])

Raises:

ValueError – If the number of source sentences and target sentences do not match.

BARTScore

class string2string.similarity.BARTScore(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024)[source]

This class implements the BARTScore algorithm.

__init__(model_name_or_path='facebook/bart-large-cnn', tokenizer_name_or_path: str | None = None, device: str = 'cpu', max_length=1024) None[source]

This function initializes the BARTScore class, which computes the BARTScore between two pieces of text.

Parameters:
  • model_name_or_path (str) – The name or path of the model. Defaults to ‘facebook/bart-large-cnn’.

  • tokenizer_name_or_path (str) – The name or path of the tokenizer. Defaults to None.

  • device (str) – The device to use. Defaults to ‘cpu’.

  • max_length (int) – The maximum length of the input. Defaults to 1024.

Returns:

None

Raises:

ValueError

If the device is not ‘cpu’ or ‘cuda’.

Attention

If you use this class, please make sure to cite the following paper:

@inproceedings{bartscore2021,
    author = {Yuan, Weizhe and Neubig, Graham and Liu, Pengfei},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
    pages = {27263--27277},
    publisher = {Curran Associates, Inc.},
    title = {BARTScore: Evaluating Generated Text as Text Generation},
    url = {https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf},
    volume = {34},
    year = {2021}
}

Note

  • The default model is the BART-large-cnn model.

  • If the tokenizer name or path is not specified, then the model name or path will be used.

  • If the device is ‘cuda’, then the model will be loaded onto the GPU.

  • If device is not specified, use the GPU if available, otherwise use the CPU.

compute(source_sentences: List[str], target_sentences: List[str] | List[List[str]], batch_size: int = 4, agg: str = 'mean') Dict[str, List[float]][source]

This function scores the target sentences against the source sentences using BARTScore.

Parameters:
  • source_sentences (List[str]) – The source sentences.

  • target_sentences (Union[List[str], List[List[str]]]) – The target sentences.

  • batch_size (int) – The batch size to use (default: 4)

  • agg (str) – The aggregation method. Defaults to ‘mean’; used only when target_sentences is a list of lists.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

compute_multi_ref_score(source_sentences: List[str], target_sentences: List[List[str]], batch_size: int = 4, agg: str = 'mean') Dict[str, List[float]][source]

Score a batch of examples with multiple references.

Parameters:
  • source_sentences (List[str]) – The source sentences.

  • target_sentences (List[List[str]]) – The target sentences.

  • agg (str) – The aggregation method. Can be “mean” or “max”.

  • batch_size (int) – The batch size.

Returns:

The BARTScore for each example.

Return type:

Dict[str, List[float]]

Raises:

ValueError – If the number of source sentences and target sentences do not match.

load(weights_path=None) None[source]

This function loads the model weights from a specified path.

Parameters:

weights_path (str) – The path to the weights.

Returns:

None