Word and Sentence Embeddings

This page contains the documentation about the GloVe and fastText word embeddings, as well as the language model embeddings.

GloVe Word Embeddings

class string2string.misc.word_embeddings.GloVeEmbeddings(model: str = 'glove.6B.200D', dim: int = 50, force_download: bool = False, dir=None, tokenizer: Tokenizer | None = None)[source]

This class implements the GloVe word embeddings.

__call__(tokens: List[str] | str) Tensor[source]

This function returns the embeddings of the given tokens.

Parameters:

tokens (Union[List[str], str]) – The tokens to embed.

Returns:

The embeddings of the given tokens.

Return type:

Tensor

__init__(model: str = 'glove.6B.200D', dim: int = 50, force_download: bool = False, dir=None, tokenizer: Tokenizer | None = None) None[source]

This function initializes the GloVe embeddings class.

Parameters:
  • model (str) – The model to use. Default is ‘glove.6B.200D’. (Options are: ‘glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’.)

  • dim (int) – The dimension of the embeddings. Default is 300.

  • force_download (bool) – Whether to force download the model. Default is False.

  • dir (str) – The directory to save or load the model. Default is None.

  • tokenizer (Tokenizer) – The tokenizer to use. Default is None.

Returns:

None

Raises:

ValueError – If the model is not in the MODEL_OPTIONS [glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’].

Attention

If you use this class, please make sure to cite the following paper:

 @inproceedings{pennington2014glove,
    title={Glove: Global vectors for word representation},
    author={Pennington, Jeffrey and Socher, Richard and Manning, Christopher D},
    booktitle={Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)},
    pages={1532--1543},
    year={2014}
}

Note

  • If directory is None, the model will be saved in the torch hub directory.

  • If the model is not downloaded, it will be downloaded automatically.

get_embedding(tokens: List[str] | str) Tensor[source]

This function returns the embeddings of the given tokens.

Parameters:

tokens (Union[List[str], str]) – The tokens to embed.

Returns:

The embeddings of the given tokens.

Return type:

Tensor

fastText Word Embeddings

class string2string.misc.word_embeddings.FastTextEmbeddings(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None)[source]

This class implements the FastText embeddings.

__call__(tokens: List[str] | str) Tensor[source]

This function returns the embeddings of the given tokens.

Parameters:

tokens (Union[List[str], str]) – The tokens to embed.

Returns:

The embeddings of the given tokens.

Return type:

Tensor

__init__(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None) None[source]

This function initializes the FastTextEmbeddings class.

Parameters:
  • model (str) –

    The model to use. Some of the available models are:

    • ’cc.en.300.bin’: The English model trained on Common Crawl (300 dimensions)

    • ’cc.hi.300.bin’: The Hindi model trained on Common Crawl (300 dimensions)

    • ’cc.fr.300.bin’: The French model trained on Common Crawl (300 dimensions)

    • ’cc.yi.300.bin’: The Yiddish model trained on Common Crawl (300 dimensions)

    • ’wiki.en’: The English model trained on Wikipedia (300 dimensions)

    • ’wiki.simple’: The Simple English model trained on Wikipedia (300 dimensions)

    • ’wiki.ar’: The Arabic model trained on Wikipedia (300 dimensions)

    • ’wiki.bg’: The Bulgarian model trained on Wikipedia (300 dimensions)

    • ’wiki.ca’: The Catalan model trained on Wikipedia (300 dimensions)

    • ’wiki.zh’: The Chinese model trained on Wikipedia (300 dimensions)

    • ’wiki.sw’: The Swahili model trained on Wikipedia (300 dimensions)

    • ’wiki.fr’: The French model trained on Wikipedia (300 dimensions)

    • ’wiki.de’: The German model trained on Wikipedia (300 dimensions)

    • ’wiki.es’: The Spanish model trained on Wikipedia (300 dimensions)

    • ’wiki.it’: The Italian model trained on Wikipedia (300 dimensions)

    • ’wiki.pt’: The Portuguese model trained on Wikipedia (300 dimensions)

    • ’wiki.ru’: The Russian model trained on Wikipedia (300 dimensions)

    • ’wiki.tr’: The Turkish model trained on Wikipedia (300 dimensions)

    • ’wiki.uk’: The Ukrainian model trained on Wikipedia (300 dimensions)

    • ’wiki.vi’: The Vietnamese model trained on Wikipedia (300 dimensions)

    • ’wiki.id’: The Indonesian model trained on Wikipedia (300 dimensions)

    • ’wiki.ja’: The Japanese model trained on Wikipedia (300 dimensions)

  • force_download (bool) – Whether to force the download of the model. Default: False.

  • dir (str) – The directory to save and load the model.

Returns:

None

Raises:

ValueError – If the given model is not available.

Attention

If you make use of this code, please cite the following papers (depending on the model you use):

@inproceedings{mikolov2018advances,
    title={Advances in Pre-Training Distributed Word Representations},
    author={Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand},
    booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
    year={2018}
}
@article{bojanowski2017enriching,
    title={Enriching Word Vectors with Subword Information},
    author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
    journal={Transactions of the Association for Computational Linguistics},
    volume={5},
    year={2017},
    issn={2307-387X},
    pages={135--146}
}
@article{joulin2016fasttext,
    title={FastText.zip: Compressing text classification models},
    author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
    journal={arXiv preprint arXiv:1612.03651},
    year={2016}
}

Note

get_embedding(tokens: List[str] | str) Tensor[source]

This function returns the embeddings of the given tokens.

Parameters:

tokens (Union[List[str], str]) – The tokens to embed.

Returns:

The embeddings of the given tokens.

Return type:

Tensor

Language Model Embeddings

class string2string.misc.model_embeddings.ModelEmbeddings(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str | None = None, device: str = 'cpu')[source]

This class is an abstract class for neural word embeddings.

__init__(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str | None = None, device: str = 'cpu') None[source]

Constructor.

Parameters:
  • model_name_or_path (str) – The name or path of the model to use (default: ‘facebook/bart-large’).

  • tokenizer (Tokenizer) – The tokenizer to use (if None, the model name or path is used).

  • device (str) – The device to use (default: ‘cpu’).

Returns:

None

Raises:

ValueError – If the model name or path is invalid.

__weakref__

list of weak references to the object (if defined)

get_embeddings(text: str | List[str], embedding_type: str = 'last_hidden_state') Tensor[source]

Returns the embeddings of the input text.

Parameters:
  • text (Union[str, List[str]]) – The input text.

  • embedding_type (str, optional) – The type of embedding to use. Defaults to ‘last_hidden_state’.

Returns:

The embeddings.

Return type:

torch.Tensor

Raises:

ValueError – If the embedding type is invalid.

get_last_hidden_state(embeddings: Tensor) Tensor[source]

Returns the last hidden state (e.g., [CLS] token’s) of the input embeddings.

Parameters:

embeddings (torch.Tensor) – The input embeddings.

Returns:

The last hidden state.

Return type:

torch.Tensor

get_mean_pooling(embeddings: Tensor) Tensor[source]

Returns the mean pooling of the input embeddings.

Parameters:

embeddings (torch.Tensor) – The input embeddings.

Returns:

The mean pooling.

Return type:

torch.Tensor