Word and Sentence Embeddings¶
This page contains the documentation about the GloVe and fastText word embeddings, as well as the language model embeddings.
GloVe Word Embeddings¶
- class string2string.misc.word_embeddings.GloVeEmbeddings(model: str = 'glove.6B.200D', dim: int = 50, force_download: bool = False, dir=None, tokenizer: Tokenizer | None = None)[source]¶
This class implements the GloVe word embeddings.
- __call__(tokens: List[str] | str) Tensor[source]¶
This function returns the embeddings of the given tokens.
- Parameters:
tokens (Union[List[str], str]) – The tokens to embed.
- Returns:
The embeddings of the given tokens.
- Return type:
Tensor
- __init__(model: str = 'glove.6B.200D', dim: int = 50, force_download: bool = False, dir=None, tokenizer: Tokenizer | None = None) None[source]¶
This function initializes the GloVe embeddings class.
- Parameters:
model (str) – The model to use. Default is ‘glove.6B.200D’. (Options are: ‘glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’.)
dim (int) – The dimension of the embeddings. Default is 300.
force_download (bool) – Whether to force download the model. Default is False.
dir (str) – The directory to save or load the model. Default is None.
tokenizer (Tokenizer) – The tokenizer to use. Default is None.
- Returns:
None
- Raises:
ValueError – If the model is not in the MODEL_OPTIONS [glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’].
Attention
If you use this class, please make sure to cite the following paper:
@inproceedings{pennington2014glove, title={Glove: Global vectors for word representation}, author={Pennington, Jeffrey and Socher, Richard and Manning, Christopher D}, booktitle={Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)}, pages={1532--1543}, year={2014} }
Note
If directory is None, the model will be saved in the torch hub directory.
If the model is not downloaded, it will be downloaded automatically.
fastText Word Embeddings¶
- class string2string.misc.word_embeddings.FastTextEmbeddings(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None)[source]¶
This class implements the FastText embeddings.
- __call__(tokens: List[str] | str) Tensor[source]¶
This function returns the embeddings of the given tokens.
- Parameters:
tokens (Union[List[str], str]) – The tokens to embed.
- Returns:
The embeddings of the given tokens.
- Return type:
Tensor
- __init__(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None) None[source]¶
This function initializes the FastTextEmbeddings class.
- Parameters:
model (str) –
The model to use. Some of the available models are:
’cc.en.300.bin’: The English model trained on Common Crawl (300 dimensions)
’cc.hi.300.bin’: The Hindi model trained on Common Crawl (300 dimensions)
’cc.fr.300.bin’: The French model trained on Common Crawl (300 dimensions)
’cc.yi.300.bin’: The Yiddish model trained on Common Crawl (300 dimensions)
…
’wiki.en’: The English model trained on Wikipedia (300 dimensions)
’wiki.simple’: The Simple English model trained on Wikipedia (300 dimensions)
’wiki.ar’: The Arabic model trained on Wikipedia (300 dimensions)
’wiki.bg’: The Bulgarian model trained on Wikipedia (300 dimensions)
’wiki.ca’: The Catalan model trained on Wikipedia (300 dimensions)
’wiki.zh’: The Chinese model trained on Wikipedia (300 dimensions)
’wiki.sw’: The Swahili model trained on Wikipedia (300 dimensions)
’wiki.fr’: The French model trained on Wikipedia (300 dimensions)
’wiki.de’: The German model trained on Wikipedia (300 dimensions)
’wiki.es’: The Spanish model trained on Wikipedia (300 dimensions)
’wiki.it’: The Italian model trained on Wikipedia (300 dimensions)
’wiki.pt’: The Portuguese model trained on Wikipedia (300 dimensions)
’wiki.ru’: The Russian model trained on Wikipedia (300 dimensions)
’wiki.tr’: The Turkish model trained on Wikipedia (300 dimensions)
’wiki.uk’: The Ukrainian model trained on Wikipedia (300 dimensions)
’wiki.vi’: The Vietnamese model trained on Wikipedia (300 dimensions)
’wiki.id’: The Indonesian model trained on Wikipedia (300 dimensions)
’wiki.ja’: The Japanese model trained on Wikipedia (300 dimensions)
…
force_download (bool) – Whether to force the download of the model. Default: False.
dir (str) – The directory to save and load the model.
- Returns:
None
- Raises:
ValueError – If the given model is not available.
Attention
If you make use of this code, please cite the following papers (depending on the model you use):
@inproceedings{mikolov2018advances, title={Advances in Pre-Training Distributed Word Representations}, author={Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand}, booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)}, year={2018} }
@article{bojanowski2017enriching, title={Enriching Word Vectors with Subword Information}, author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, journal={Transactions of the Association for Computational Linguistics}, volume={5}, year={2017}, issn={2307-387X}, pages={135--146} }
@article{joulin2016fasttext, title={FastText.zip: Compressing text classification models}, author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, journal={arXiv preprint arXiv:1612.03651}, year={2016} }
Note
The models are downloaded from https://fasttext.cc/docs/en/english-vectors.html.
The models are saved in the torch hub directory, if no directory is specified.
Language Model Embeddings¶
- class string2string.misc.model_embeddings.ModelEmbeddings(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str | None = None, device: str = 'cpu')[source]¶
This class is an abstract class for neural word embeddings.
- __init__(model_name_or_path: str = 'facebook/bart-large', tokenizer_name_or_path: str | None = None, device: str = 'cpu') None[source]¶
Constructor.
- Parameters:
model_name_or_path (str) – The name or path of the model to use (default: ‘facebook/bart-large’).
tokenizer (Tokenizer) – The tokenizer to use (if None, the model name or path is used).
device (str) – The device to use (default: ‘cpu’).
- Returns:
None
- Raises:
ValueError – If the model name or path is invalid.
- __weakref__¶
list of weak references to the object (if defined)
- get_embeddings(text: str | List[str], embedding_type: str = 'last_hidden_state') Tensor[source]¶
Returns the embeddings of the input text.
- Parameters:
text (Union[str, List[str]]) – The input text.
embedding_type (str, optional) – The type of embedding to use. Defaults to ‘last_hidden_state’.
- Returns:
The embeddings.
- Return type:
torch.Tensor
- Raises:
ValueError – If the embedding type is invalid.
Returns the last hidden state (e.g., [CLS] token’s) of the input embeddings.
- Parameters:
embeddings (torch.Tensor) – The input embeddings.
- Returns:
The last hidden state.
- Return type:
torch.Tensor