Miscellaneous

This page contains the documentation about the miscellaneous classes and functions used in the library.

Tokenizer

class string2string.misc.Tokenizer(word_delimiter: str = ' ')[source]

This class contains the tokenizer.

__init__(word_delimiter: str = ' ')[source]

Initializes the Tokenizer class.

Parameters:

word_delimiter (str) – The word delimiter. Default is ” “.

__weakref__

list of weak references to the object (if defined)

detokenize(tokens: List[str]) str[source]

Returns the string from a list of tokens.

Parameters:

tokens (List[str]) – The tokens.

Returns:

The string.

Return type:

str

tokenize(text: str) List[str][source]

Returns the tokens from a string.

Parameters:

text (str) – The text to tokenize.

Returns:

The tokens.

Return type:

List[str]

Polynomial Rolling Hash Function

class string2string.misc.PolynomialRollingHash(base: int = 10, modulus: int = 101)[source]

This class contains the polynomial rolling hash function.

__init__(base: int = 10, modulus: int = 101) None[source]

Initializes the polynomial rolling hash function.

Parameters:
  • base (int) – The base to use. Default is 256.

  • modulus (int) – The modulus to use. Default is 65537.

Returns:

None

Note

  • Why 65537? Because it is a Fermat prime.

compute(str1: str) int[source]

Returns the hash value of a string.

Parameters:

str1 (str) – The string.

Returns:

The hash value of the string.

Return type:

int

reset() None[source]

Resets the hash value.

Parameters:

None

Returns:

None

update(old_char: str, new_char: str, window_size: int) int[source]

Updates the hash value of a string.

Parameters:
  • old_char (str) – The old character.

  • new_char (str) – The new character.

Returns:

The hash value of the string.

Return type:

int

Plot Pairwise Alignment

class string2string.misc.plotting_functions.plot_pairwise_alignment(seq1_pieces: str | List[str | int | float] | ndarray, seq2_pieces: str | List[str | int | float] | ndarray, alignment: List[Tuple[int, int]] = [], str2colordict: dict | None = None, padding_factor: float = 1.4, linewidth: float = 1.5, border_to_box: float = 0.2, title: str = 'Pairwise Alignment', seq1_name: str = 'Seq 1', seq2_name: str = 'Seq 2', show: bool = True, save: bool = False, save_path: str = 'pairwise_alignment.png', save_dpi: int = 300, save_bbox_inches: str = 'tight')[source]

This function is designed to generate a plot that displays the alignment between two given lists of characters, strings, integers, or floats (or a numpy array). To create this plot, the function takes in the two lists and a list of tuples that specifies the alignment between the two lists.

Parameters:
  • seq1_pieces (Union[str, List[Union[str, int, float], np.ndarray]]) – The pieces of the first string or list of strings.

  • seq2_pieces (Union[str, List[Union[str, int, float], np.ndarray]]) – The pieces of the second string or list of strings.

  • alignment (List[Tuple[int, int]]) – The pairwise alignment between the two strings.

  • str2colordict – Optional[dict] = None: A dictionary of colors for each character/string in the union of the two strings.

  • padding_factor (float, optional) – The factor to use for the padding (default is 1.4).

  • linewidth (float, optional) – The linewidth to use for the alignment (default is 1.5).

  • border_to_box (float, optional) – The gap between the border and the box (default is 0.2).

  • title (str, optional) – The title of the plot (default is ‘Pairwise Alignment’).

  • seq1_name (str, optional) – The name of the first sequence (default is ‘Seq 1’).

  • seq2_name (str, optional) – The name of the second sequence (default is ‘Seq 2’).

  • show (bool, optional) – Whether to show the plot (default is True).

  • save (bool, optional) – Whether to save the plot (default is False).

  • save_path (str, optional) – The path to save the plot (default is ‘pairwise_alignment.png’).

  • save_dpi (int, optional) – The dpi to use for the plot (default is 300).

  • save_bbox_inches (str, optional) – The bbox_inches to use for the plot (default is ‘tight’).

Returns:

None

Note

The pairwise alignment is a list of tuples of the form (i, j) where i is the index of the character in the first string and j is the index of the character in the second string.

Plot Heatmap

class string2string.misc.plotting_functions.plot_heatmap(data: List[List[str | int | float]] | ndarray, title: str = 'Heatmap', x_label: str = 'X', y_label: str = 'Y', x_ticks: List[str] | None = None, y_ticks: List[str] | None = None, colorbar_kwargs: dict | None = None, color_threshold: float | None = None, textcolors=('black', 'white'), valfmt='{x:.1f}', legend: bool = False, show: bool = True, save: bool = False, save_path: str = 'heatmap.png', save_dpi: int = 300, save_bbox_inches: str = 'tight', **kwargs)[source]

This function creates a heatmap visualization based on a given 2D array of data. The input array can represent a variety of data structures, such as a confusion matrix or a correlation matrix, and can be represented as a list of lists or a numpy array. The resulting plot will visually represent the data in the input array using a color-coded grid.

Parameters:
  • data (Union[List[List[Union[str, int, float]]], np.ndarray]) – The data to plot.

  • title (str, optional) – The title of the plot (default: ‘Heatmap’).

  • x_label (str, optional) – The label of the x-axis (default: ‘X’).

  • y_label (str, optional) – The label of the y-axis (default: ‘Y’).

  • x_ticks (List[str], optional) – The ticks of the x-axis (default: None).

  • y_ticks (List[str], optional) – The ticks of the y-axis (default: None).

  • colorbar_kwargs (dict, optional) – The keyword arguments for the colorbar (default: None).

  • color_threshold (float, optional) – The threshold to use for the color (default: None).

  • textcolors (tuple, optional) – The colors to use for the text (default: (“black”, “white”)).

  • valfmt (str, optional) – The format to use for the values (default: “{x:.1f}”).

  • legend (bool, optional) – Whether to show the legend (default: False).

  • show (bool, optional) – Whether to show the plot (default: True).

  • save (bool, optional) – Whether to save the plot (default: False).

  • save_path (str, optional) – The path to save the plot (default: ‘heatmap.png’).

  • save_dpi (int, optional) – The dpi to use for the plot (default: 300).

  • save_bbox_inches (str, optional) – The bbox_inches to use for the plot (default: ‘tight’).

  • **kwargs – The keyword arguments for the heatmap.

Generate 2D-Scatter Plot of Embeddings with Plotly

class string2string.misc.plotting_functions.plot_corpus_embeds_with_plotly(corpus_embeddings: List[List[int | float]] | ndarray | Tensor, corpus_labels: List[str], corpus_hover_texts: List[str], corpus_scatter_kwargs: dict | None = {}, layoot_dict: dict | None = None, query_embeddings: List[List[int | float]] | ndarray | None = None, query_labels: List[str] | None = None, query_hover_texts: List[str] | None = None, query_modes: List[str] | str | None = 'markers', query_marker_dict: dict | None = None, show_plot: bool = True, save_path: str | None = None)[source]

The purpose of this function is to generate a 2D scatter plot using plotly, based on a given corpus of embeddings and their corresponding labels. The function takes in the embeddings and labels as input, and plots them in the scatter plot with each point represented by a particular color and shape based on its label. Additionally, the function can also take in a query embedding and its corresponding label as optional inputs, which will be plotted separately on the scatter plot with a distinct color and shape.

Parameters:
  • corpus_embeddings – A list of lists or a numpy array or a torch tensor of corpus embeddings (e.g. sentence embeddings).

  • corpus_labels – A list of labels for the corpus embeddings.

  • corpus_hover_texts – A list of hover texts for the corpus embeddings.

  • corpus_scatter_kwargs – A dictionary of keyword arguments for the corpus scatter plot (e.g. marker size, marker color, etc.) (default: {}).

  • layoot_dict – A dictionary of keyword arguments for the layout of the plot (e.g. title, x-axis title, y-axis title, etc.) (default: None).

  • query_embeddings – A list of lists or a numpy array of query embeddings (e.g. sentence embeddings) (default: None).

  • query_labels – A list of labels for the query embeddings (default: None).

  • query_hover_texts – A list of hover texts for the query embeddings (default: None).

  • query_modes – A list of modes for the query embeddings (default: ‘markers’).

  • query_marker_dict – A dictionary of keyword arguments for the query scatter plot (e.g. marker size, marker color, etc.) (default: None).

  • show_plot – A boolean whether to show the plot (default: True).

  • save_path – A string of the path to save the plot (e.g., ‘corpus_embeddings.html’) (default: None).

Returns:

A plotly figure object.

Return type:

go.Figure

Note

Please refer to the Hands-on Tutorial on Semantic Search with HUPD Patent Data for a good demonstration of how to use this function.