Miscellaneous¶

This page contains the documentation about the miscellaneous classes and functions used in the library.

Tokenizer¶

class string2string.misc.Tokenizer(word_delimiter: str = ' ')[source]¶

This class contains the tokenizer.

__init__(word_delimiter: str = ' ')[source]¶

Initializes the Tokenizer class.

Parameters:: word_delimiter (str) – The word delimiter. Default is ” “.

__weakref__¶: list of weak references to the object (if defined)

detokenize(tokens: List[str]) → str[source]¶

Returns the string from a list of tokens.

Parameters:: tokens (List[str]) – The tokens.
Returns:: The string.
Return type:: str

tokenize(text: str) → List[str][source]¶

Returns the tokens from a string.

Parameters:: text (str) – The text to tokenize.
Returns:: The tokens.
Return type:: List[str]

Polynomial Rolling Hash Function¶

class string2string.misc.PolynomialRollingHash(base: int = 10, modulus: int = 101)[source]¶

This class contains the polynomial rolling hash function.

__init__(base: int = 10, modulus: int = 101) → None[source]¶

Initializes the polynomial rolling hash function.

Parameters:

base (int) – The base to use. Default is 256.
modulus (int) – The modulus to use. Default is 65537.

Returns:

None

Note

Why 65537? Because it is a Fermat prime.

compute(str1: str) → int[source]¶

Returns the hash value of a string.

Parameters:: str1 (str) – The string.
Returns:: The hash value of the string.
Return type:: int

reset() → None[source]¶

Resets the hash value.

Parameters:: None –
Returns:: None

update(old_char: str, new_char: str, window_size: int) → int[source]¶

Updates the hash value of a string.

Parameters:

old_char (str) – The old character.
new_char (str) – The new character.

Returns:

The hash value of the string.

Return type:

int

Plot Pairwise Alignment¶

class string2string.misc.plotting_functions.plot_pairwise_alignment(seq1_pieces: str | List[str | int | float] | ndarray, seq2_pieces: str | List[str | int | float] | ndarray, alignment: List[Tuple[int, int]] = [], str2colordict: dict | None = None, padding_factor: float = 1.4, linewidth: float = 1.5, border_to_box: float = 0.2, title: str = 'Pairwise Alignment', seq1_name: str = 'Seq 1', seq2_name: str = 'Seq 2', show: bool = True, save: bool = False, save_path: str = 'pairwise_alignment.png', save_dpi: int = 300, save_bbox_inches: str = 'tight')[source]¶

This function is designed to generate a plot that displays the alignment between two given lists of characters, strings, integers, or floats (or a numpy array). To create this plot, the function takes in the two lists and a list of tuples that specifies the alignment between the two lists.

Parameters:

seq1_pieces (Union[str, List[Union[str, int, float], np.ndarray]]) – The pieces of the first string or list of strings.
seq2_pieces (Union[str, List[Union[str, int, float], np.ndarray]]) – The pieces of the second string or list of strings.
alignment (List[Tuple[int, int]]) – The pairwise alignment between the two strings.
str2colordict – Optional[dict] = None: A dictionary of colors for each character/string in the union of the two strings.
padding_factor (float, optional) – The factor to use for the padding (default is 1.4).
linewidth (float, optional) – The linewidth to use for the alignment (default is 1.5).
border_to_box (float, optional) – The gap between the border and the box (default is 0.2).
title (str, optional) – The title of the plot (default is ‘Pairwise Alignment’).
seq1_name (str, optional) – The name of the first sequence (default is ‘Seq 1’).
seq2_name (str, optional) – The name of the second sequence (default is ‘Seq 2’).
show (bool, optional) – Whether to show the plot (default is True).
save (bool, optional) – Whether to save the plot (default is False).
save_path (str, optional) – The path to save the plot (default is ‘pairwise_alignment.png’).
save_dpi (int, optional) – The dpi to use for the plot (default is 300).
save_bbox_inches (str, optional) – The bbox_inches to use for the plot (default is ‘tight’).

Returns:

None

Note

The pairwise alignment is a list of tuples of the form (i, j) where i is the index of the character in the first string and j is the index of the character in the second string.

Plot Heatmap¶

class string2string.misc.plotting_functions.plot_heatmap(data: List[List[str | int | float]] | ndarray, title: str = 'Heatmap', x_label: str = 'X', y_label: str = 'Y', x_ticks: List[str] | None = None, y_ticks: List[str] | None = None, colorbar_kwargs: dict | None = None, color_threshold: float | None = None, textcolors=('black', 'white'), valfmt='{x:.1f}', legend: bool = False, show: bool = True, save: bool = False, save_path: str = 'heatmap.png', save_dpi: int = 300, save_bbox_inches: str = 'tight', **kwargs)[source]¶

This function creates a heatmap visualization based on a given 2D array of data. The input array can represent a variety of data structures, such as a confusion matrix or a correlation matrix, and can be represented as a list of lists or a numpy array. The resulting plot will visually represent the data in the input array using a color-coded grid.

Parameters:

data (Union[List[List[Union[str, int, float]]], np.ndarray]) – The data to plot.
title (str, optional) – The title of the plot (default: ‘Heatmap’).
x_label (str, optional) – The label of the x-axis (default: ‘X’).
y_label (str, optional) – The label of the y-axis (default: ‘Y’).
x_ticks (List[str], optional) – The ticks of the x-axis (default: None).
y_ticks (List[str], optional) – The ticks of the y-axis (default: None).
colorbar_kwargs (dict, optional) – The keyword arguments for the colorbar (default: None).
color_threshold (float, optional) – The threshold to use for the color (default: None).
textcolors (tuple, optional) – The colors to use for the text (default: (“black”, “white”)).
valfmt (str, optional) – The format to use for the values (default: “{x:.1f}”).
legend (bool, optional) – Whether to show the legend (default: False).
show (bool, optional) – Whether to show the plot (default: True).
save (bool, optional) – Whether to save the plot (default: False).
save_path (str, optional) – The path to save the plot (default: ‘heatmap.png’).
save_dpi (int, optional) – The dpi to use for the plot (default: 300).
save_bbox_inches (str, optional) – The bbox_inches to use for the plot (default: ‘tight’).
**kwargs – The keyword arguments for the heatmap.

Generate 2D-Scatter Plot of Embeddings with Plotly¶

class string2string.misc.plotting_functions.plot_corpus_embeds_with_plotly(corpus_embeddings: List[List[int | float]] | ndarray | Tensor, corpus_labels: List[str], corpus_hover_texts: List[str], corpus_scatter_kwargs: dict | None = {}, layoot_dict: dict | None = None, query_embeddings: List[List[int | float]] | ndarray | None = None, query_labels: List[str] | None = None, query_hover_texts: List[str] | None = None, query_modes: List[str] | str | None = 'markers', query_marker_dict: dict | None = None, show_plot: bool = True, save_path: str | None = None)[source]¶

The purpose of this function is to generate a 2D scatter plot using plotly, based on a given corpus of embeddings and their corresponding labels. The function takes in the embeddings and labels as input, and plots them in the scatter plot with each point represented by a particular color and shape based on its label. Additionally, the function can also take in a query embedding and its corresponding label as optional inputs, which will be plotted separately on the scatter plot with a distinct color and shape.

Parameters:

corpus_embeddings – A list of lists or a numpy array or a torch tensor of corpus embeddings (e.g. sentence embeddings).
corpus_labels – A list of labels for the corpus embeddings.
corpus_hover_texts – A list of hover texts for the corpus embeddings.
corpus_scatter_kwargs – A dictionary of keyword arguments for the corpus scatter plot (e.g. marker size, marker color, etc.) (default: {}).
layoot_dict – A dictionary of keyword arguments for the layout of the plot (e.g. title, x-axis title, y-axis title, etc.) (default: None).
query_embeddings – A list of lists or a numpy array of query embeddings (e.g. sentence embeddings) (default: None).
query_labels – A list of labels for the query embeddings (default: None).
query_hover_texts – A list of hover texts for the query embeddings (default: None).
query_modes – A list of modes for the query embeddings (default: ‘markers’).
query_marker_dict – A dictionary of keyword arguments for the query scatter plot (e.g. marker size, marker color, etc.) (default: None).
show_plot – A boolean whether to show the plot (default: True).
save_path – A string of the path to save the plot (e.g., ‘corpus_embeddings.html’) (default: None).

Returns:

A plotly figure object.

Return type:

go.Figure

Note

Please refer to the Hands-on Tutorial on Semantic Search with HUPD Patent Data for a good demonstration of how to use this function.