API Reference

Discrete spaces

class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)

Class to define an arbitrary discrete space characterized by the connectivity between different states and optionally by a scalar value (e.g. fitness or energy) associated with each state.

Parameters:
adjacency_matrixscipy.sparse.csr_matrix of shape (n_states, n_states)

Sparse matrix representing the adjacency relationships between states. The (i, j) entry contains a 1 if states i and j are connected, and 0 otherwise.

yarray-like of shape (n_states,), optional

Function value associated with each state.

state_labelsarray-like of shape (n_states,), optional

Labels for the states in the discrete space.

Attributes:
n_statesint

Number of states in the discrete space.

state_labelsarray-like of shape (n_states,)

Labels for the states in the discrete space.

state_idxspd.Series of shape (n_states,)

A pandas Series mapping state labels to their corresponding indices. The index of the Series is state_labels, allowing quick lookup of indices for a given set of state labels.

is_regularbool

Attribute characterizing whether the space is regular, this is, every

Methods

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

get_neighbors(states[, max_distance])

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

get_state_idxs(states)

Returns the indexes for the provided state labels

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.

Returns:
edges_dfpd.DataFrame

A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

Returns:
tuple of np.ndarray

Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:
statesarray-like of shape (state_number,)

A list or numpy array of state labels from which to find neighbors.

max_distanceint, optional, default=1

The maximum distance within which neighbors of the provided states will be included.

Returns:
neighbor_statesnp.array

An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:
statesarray-like of shape (state_number,)

A list or numpy array of state labels from which to find neighbors.

max_distanceint, optional, default=1

The maximum distance within which neighbors of the provided states will be included.

Returns:
neighbor_statesnp.array

An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_state_idxs(states)

Returns the indexes for the provided state labels

class gpmap.space.ProductSpace(elementary_graphs, y=None, state_labels=None)

General class for constructing spaces as Cartesian products of smaller subspaces, each characterized by a set of elementary graphs.

Parameters:
elementary_graphslist of scipy.sparse.csr_matrix

List of adjacency matrices (in CSR format) representing the elementary graphs that define the subspaces.

yarray-like of shape (n,), optional

Array containing the phenotypic values associated with each combination of states in the resulting space. If y is None, no phenotypic values will be stored.

state_labelslist, optional

List of labels associated with each possible state in the product space. If state_labels is None, numeric labels will be assigned by default.

Attributes:
is_regular

Attribute characterizing whether the space is regular, this is, every

n_edges

Methods

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

get_neighbors(states[, max_distance])

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

get_state_idxs(states)

Returns the indexes for the provided state labels

calc_adjacency_matrix

calc_states

format_list_ends

format_values

get_y

init_space

set_dim_sizes

set_y

write_csv

write_edges

class gpmap.space.GridSpace(length, y=None, ndim=2)

N-dimensional grid discrete space.

A discrete space formed by the Cartesian product of one-dimensional spaces of ordered n-states, represented by a line graph.

Parameters:
length: int or array-like

The number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If an array-like of lengths is provided, they will be used to form a grid with the specified dimensions, and the ndim argument will be ignored.

ndim: int

The number of dimensions in the grid when a single length value is provided.

y: array-like of shape (length ** ndim,) or None

Phenotypic values associated with each possible state.

Methods

set_peaks

class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)

Space of all possible sequences of certain length.

Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:

  1. From a set of sequences and function values (X, y).

  2. By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site, and type of alphabet).

Parameters:
Xarray-like of shape (n_genotypes,), optional

Sequences to use as state labels of the discrete sequence space.

yarray-like of shape (n_genotypes,), optional

Quantitative phenotype or fitness associated with each genotype.

seq_lengthint, optional

Length of the sequences in the sequence space. If not provided, it will be inferred from alphabet or n_alleles.

n_alleleslist of int, optional

List containing the number of alleles present at each site in the sequence space. This can only be specified for alphabet_type=’custom’.

alphabet_typestr, default=’dna’

Type of sequence. Options are {‘dna’, ‘rna’, ‘protein’, ‘custom’}.

alphabetlist of lists, optional

A list where each element is itself a list containing the different alleles allowed at each site. The number and type of alleles can vary across sites.

stop_yfloat, optional

Value of the function assigned to protein sequences with an in-frame stop codon. If provided, the protein alphabet will be extended to include * for stop codons.

Attributes:
n_genotypesint

Number of states in the complete sequence space.

genotypesarray-like of shape (n_genotypes,)

Genotype labels in the sequence space.

adjacency_matrixscipy.sparse.csr_matrix of shape (n_genotypes, n_genotypes)

Sparse matrix representing the adjacency relationships between genotypes. The (i, j) entry contains a 1 if genotypes i and j differ by a single mutation, and 0 otherwise.

yarray-like of shape (n_genotypes,), optional

Quantitative phenotype or fitness associated with each genotype.

is_regularbool

Attribute characterizing whether the space is regular, this is, every

Methods

get_single_mutant_matrix(sequence[, center])

Calculate the effects of single point mutations from a focal sequence.

remove_codon_incompatible_transitions([...])

Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.

to_nucleotide_space([codon_table, alphabet_type])

Convert a protein sequence space into a nucleotide sequence space.

get_single_mutant_matrix(sequence, center=False)

Calculate the effects of single point mutations from a focal sequence.

Parameters:
sequencestr

The sequence from which to compute all single point mutant effects.

centerbool, optional, default=False

If True, the results will be centered by position, ensuring that the mean of allelic effects at each position is 0. If False, the focal sequence will have a value of 0, and the results will represent mutational effects relative to it.

Returns:
outputpd.DataFrame of shape (seq_length, total_alleles)

A DataFrame containing the mutational or allelic effects for each allele across all sequence positions.

remove_codon_incompatible_transitions(codon_table='Standard')

Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.

This method updates the adjacency matrix of the sequence space to ensure that transitions between states are compatible with the specified codon table. Only transitions that result in valid amino acid substitutions according to the codon table will be allowed.

Parameters:
codon_tablestr or Bio.Data.CodonTable

The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.

to_nucleotide_space(codon_table='Standard', alphabet_type='dna')

Convert a protein sequence space into a nucleotide sequence space.

This method transforms a protein sequence space into a nucleotide sequence space using a specified codon table for translation. The resulting nucleotide space will have 4 alleles per site and 3 times the number of sites as the original protein space. It assumes that the function associated with each nucleotide sequence depends only on the protein sequence it encodes.

Parameters:
codon_tablestr or Bio.Data.CodonTable

The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.

alphabet_typestr, optional, default=’dna’

The type of nucleotide sequence to use in the resulting space. Must be one of {‘dna’, ‘rna’}.

Returns:
SequenceSpace

A nucleotide sequence space with the specified properties.

class gpmap.space.HammingBallSpace(X0, X=None, y=None, d=None, n_alleles=None, alphabet_type='dna', alphabet=None)

Discrete space representing a Hamming ball space around a target sequence, including all sequences within a specified maximum Hamming distance.

Parameters:
X0str

The focal sequence around which the Hamming ball space is constructed.

Xarray-like of shape (n_genotypes,), optional

Sequences to use as state labels for the discrete sequence space.

yarray-like of shape (n_genotypes,), optional

Quantitative phenotype or fitness values associated with each genotype.

dint, optional

The maximum Hamming distance from the focal sequence to include in the space.

n_alleleslist of int, optional

A list specifying the number of alleles present at each site of the sequence. This can only be specified for alphabet_type=’custom’.

alphabet_typestr, default=’dna’

The type of sequence. Options are {‘dna’, ‘rna’, ‘protein’, ‘custom’}.

alphabetlist of lists, optional

A list where each element is itself a list containing the different alleles allowed at each site. The number and type of alleles can vary across sites.

Attributes:
n_genotypesint

The total number of states (genotypes) in the Hamming ball space.

genotypesarray-like of shape (n_genotypes,)

The genotype labels in the Hamming ball space.

adjacency_matrixscipy.sparse.csr_matrix of shape (n_genotypes, n_genotypes)

A sparse matrix representing the adjacency relationships between genotypes. The (i, j) entry contains a 1 if genotypes i and j differ by a single mutation, and 0 otherwise.

yarray-like of shape (n_genotypes,), optional

Quantitative phenotype or fitness values associated with each genotype.

is_regularbool

Attribute characterizing whether the space is regular, this is, every

Methods

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

get_neighbors(states[, max_distance])

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

get_state_idxs(states)

Returns the indexes for the provided state labels

calc_adjacency_matrix

calc_graph

calc_max_min_path

calc_n_paths

format_list_ends

format_values

get_genotypes

get_y

init_space

set_alphabet_type

set_seq_length

set_y

write_csv

write_edges

Random walks

class gpmap.randwalk.WMWalk(space, log=None, Ns=None)

Class for Weak Mutation Random Walk on a SequenceSpace. This is a time-reversible continuous-time Markov Chain where the transition rates are determined by the differences in fitness between two states, scaled by the effective population size Ns.

The transition rate matrix Q(i, j) is defined as:

\[\begin{split}Q(i, j) = \begin{cases} M(i, j)\frac{S(i, j)}{1 - e^{S(i, j)}} & \text{if $i$ and $j$ are neighbors}\\ -\sum_{k\neq i} Q(i, k) & \text{if } i=j \\ 0 & \text{Otherwise}, \end{cases}\end{split}\]

where: - M(i, j) is the neutral mutation rate between states i and j. - S(i, j) is the scaled fitness difference between states i and j,

typically defined as S(i, j) = Ns(f_j - f_i)

Methods

calc_neutral_mixing_rates(...)

Computes the neutral mixing rates for a SequenceSpace.

calc_rate_matrix([Ns, neutral_stat_freqs, ...])

Computes and stores the rate matrix for the random walk in the discrete space.

calc_visualization([Ns, mean_function, ...])

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk.

write_tables(prefix[, write_edges, ...])

Write the output of the visualization to files with a common prefix.

calc_neutral_rate_matrix

calc_stationary_frequencies

calc_neutral_mixing_rates(site_exchange_rates, neutral_site_freqs)

Computes the neutral mixing rates for a SequenceSpace. If no GTR mutation model is specified, the neutral mixing rate is determined by the site with the fewest alleles. Otherwise, assuming site-independent mutations, the slowest neutral mixing rate is governed by the site with the smallest second eigenvalue in its site-specific rate matrix.

Parameters:
neutral_site_Qslist of array-like of shape (n_alleles, n_alleles)

A list of site-specific rate matrices used to calculate the limiting mixing rate under neutrality. If not provided, uniform mutation rates are assumed.

neutral_site_freqslist of array-like of shape (n_alleles,)

A list of vectors representing the stationary frequencies under neutrality for each site. These are used to compute the eigenvalues of the time-reversible site-specific neutral chain. By default, uniform frequencies are assumed across sites and alleles.

site_weightsarray-like of shape (seq_length,)

A vector of relative weights for each site. These weights scale the individually normalized rate matrices to ensure the specified leaving rate. By default, all weights are equal.

Returns:
neutral_mixing_rate: float

The neutral mixing rate, defined as the smallest second-largest eigenvalue across all sites.

TODO: Re-implement functionality.
calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)

Computes and stores the rate matrix for the random walk in the discrete space.

Parameters:
Nsfloat, optional

Scaled effective population size for the evolutionary model. If not provided, the value of self.Ns will be used.

neutral_stat_freqsarray-like of shape (n_states,), optional

Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.

neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape (n_states, n_states), optional

Sparse matrix containing the neutral exchange rates for the entire sequence space. If not provided, uniform mutational dynamics are assumed.

Notes

  • The resulting rate matrix is stored in the rate_matrix attribute.

  • The method also calculates the symmetrized rate matrix as an intermediate step.

calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist of the right eigenvectors of the associated rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time.

Parameters:
Nsfloat, optional

Scaled effective population size to use in the underlying evolutionary model. If not provided, it will be derived from mean_function or mean_function_perc.

mean_functionfloat, optional

Mean function at stationarity to derive the associated Ns. Either this or mean_function_perc must be provided if Ns is not specified.

mean_function_percfloat, optional

Percentile that the mean function at stationarity takes within the distribution of function values along sequence space. For example, if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values. Either this or mean_function must be provided if Ns is not specified.

n_componentsint, default=10

Number of eigenvectors or Diffusion axes to calculate.

neutral_stat_freqsarray-like of shape (n_states,), optional

Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, uniform stationary frequencies are assumed.

neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape

(n_states, n_states), optional Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.

tolfloat, default=1e-12

Tolerance for the eigendecomposition solver. Lower values result in higher precision but may increase computation time.

Notes

  • The visualization coordinates are stored in self.nodes_df, which includes the scaled eigenvectors, function values, and stationary frequencies for each state.

  • Relaxation times and decay rates are stored in self.decay_rates_df.

write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')

Write the output of the visualization to files with a common prefix. The output can include up to three different tables, depending on the options provided:

  • Nodes coordinates: Contains the coordinates for each state, along with the associated function values and stationary frequencies. Stored in either CSV format with the suffix “nodes.csv” or Parquet format with the suffix “nodes.pq”.

  • Decay rates: Contains the decay rates and relaxation times associated with each component or diffusion axis. Stored in CSV format with the suffix “decay_rates.csv”.

  • Edges: Contains the adjacency relationships between states. This is not stored by default unless write_edges=True. Since the edges remain unchanged for any visualization on the same SequenceSpace, they only need to be stored once. Stored in either CSV format or the more efficient NPZ format for sparse matrices.

Parameters:
prefixstr

Prefix for the filenames used to store the tables.

write_edgesbool, optional, default=False

Whether to write the adjacency relationships between states (edges) to a file.

nodes_format{‘parquet’, ‘csv’}, optional, default=’parquet’

Format for storing the nodes information. Parquet is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.

edges_format{‘npz’, ‘csv’}, optional, default=’npz’

Format for storing the edges information. NPZ is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.

Landscape Inference

class gpmap.inference.MinimumEpistasisInterpolator(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', P=2, a=None, cg_rtol=1e-16)

A class for performing Minimum Epistasis Interpolation (MEI) to infer complete genotype-phenotype maps from incomplete and noisy data. This model applies a prior that penalizes local epistatic coefficients of order P and infers the posterior distribution based on experimental data for a subset of sequences.

Parameters:
n_allelesint, optional

The number of alleles per site. If not provided, it will be inferred from the data.

seq_lengthint, optional

The length of the genotype sequences. If not provided, it will be inferred from the data.

genotypesarray-like, optional

A list or array of genotypes to be used in the interpolation.

alphabet_typestr, optional

The type of alphabet used for genotypes. Default is “custom”.

Pint, optional

The order of epistasis to consider. Default is 2.

afloat, optional

The regularization parameter. If not provided, it will be inferred during fitting.

cg_rtolfloat, optional

The relative tolerance for the conjugate gradient solver. Default is 1e-16.

Methods

fit(X, y[, y_var])

Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

predict([X_pred, calc_variance])

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

sample_prior()

Generate a sample from the prior distribution.

simulate([X, y_var, p_missing, seed])

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

smooth

fit(X, y, y_var=None)

Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.

This method infers the optimal regularization parameter a by computing the Minimum Epistasis Interpolation solution. It determines the value of a such that the expected average squared Pth epistatic coefficients match those of the MEI solution.

Parameters:
Xarray-like of shape (n_obs,)

Array containing the genotypes for which observations are provided in y.

yarray-like of shape (n_obs,)

Array containing the observed phenotypes corresponding to the genotypes in X.

y_vararray-like of shape (n_obs,), optional

Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:
contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)

A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:
contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include: - estimate: Posterior mean for each contrast. - std: Posterior standard deviation for each contrast. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. - p(|x|>0): Posterior probability that the absolute value

of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:
X_predarray-like of shape (n_genotypes,), optional

Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.

calc_variancebool, optional, default=False

If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:
predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included: - f_var: Posterior variance for each genotype. - f_std: Posterior standard deviation for each genotype. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. The genotype labels are used as the row index.

Notes

  • The MAP estimate is computed using the posterior mean.

  • If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space: >>> pred = model.predict()

Predict phenotypes for specific genotypes with variance: >>> pred = model.predict(X_pred=[“AAA”, “AAC”], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:
f: numpy.ndarray

A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional

Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.

y_varfloat or array-like, optional

Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..

p_missingfloat, optional

Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.

seedfloat, optional

Random seed for reproducibility. Default is None.

Returns:
farray-like

The true simulated measurements without experimental noise.

Xarray-like

The input sequences used for the simulation.

yarray-like

The simulated measurements with experimental noise added.

y_vararray-like

The standard deviation of the experimental noise for each input sequence.

Raises
ValueError

If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters: >>> f, X, y, y_var = gp.simulate() Simulate data with custom noise and missing probability: >>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

class gpmap.inference.VCregression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', lambdas=None, beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=1e-16, progress=True)

Variance Component regression model for sequence-function relationships.

This model enables the inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior. The prior is parameterized by the contribution of different orders of interaction to the observed genetic variability of a continuous phenotype.

Parameters:
n_allelesint, optional

The number of alleles per site. If not provided, it will be inferred from the data.

seq_lengthint, optional

The length of the genotype sequences. If not provided, it will be inferred from the data.

genotypesarray-like, optional

A list or array of genotypes to be used in the interpolation.

alphabet_typestr, optional

The type of alphabet used for genotypes. Default is “custom”.

lambdasarray-like, optional

Variance components for each order of interaction. If not provided, they will be inferred during fitting.

betafloat, optional

The regularization parameter for the kernel alignment. Default is 0.

cross_validationbool, optional

Whether to perform cross-validation to select the best penalization constant for regularized variance component inference. Default is False.

nfoldsint, optional

The number of folds for cross-validation. Default is 5.

cv_loss_functionstr, optional

The loss function to use during cross-validation. Options are “frobenius_norm”, “logL”, or “r2”. Default is “frobenius_norm”.

num_betaint, optional

The number of beta values to evaluate during cross-validation. Default is 20.

min_log_betafloat, optional

The minimum log10(beta) value for cross-validation. Default is -2.

max_log_betafloat, optional

The maximum log10(beta) value for cross-validation. Default is 7.

cg_rtolfloat, optional

The relative tolerance for the conjugate gradient solver. Default is 1e-16.

progressbool, optional

Whether to display progress bars during fitting. Default is True.

Methods

fit(X, y[, y_var])

Infers the Variance Components from the provided data.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

predict([X_pred, calc_variance])

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

sample_prior()

Generate a sample from the prior distribution.

simulate([X, y_var, p_missing, seed])

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

lambdas_to_variance

fit(X, y, y_var=None)

Infers the Variance Components from the provided data.

This method infers the variance components, which represent the relative contribution of different orders of interaction to the variability in the sequence-function relationships. Variance components are determined through kernel alignment with the empirical distance-covariance function.

After fitting, the optimal variance components (lambdas) are stored in the VCregression.lambdas attribute for use in predictions.

Parameters:
Xarray-like of shape (n_obs,)

Array containing the genotypes for which observations are provided in y.

yarray-like of shape (n_obs,)

Array containing the observed phenotypes corresponding to the genotypes in X.

y_vararray-like of shape (n_obs,), optional

Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:
contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)

A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:
contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include: - estimate: Posterior mean for each contrast. - std: Posterior standard deviation for each contrast. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. - p(|x|>0): Posterior probability that the absolute value

of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:
X_predarray-like of shape (n_genotypes,), optional

Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.

calc_variancebool, optional, default=False

If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:
predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included: - f_var: Posterior variance for each genotype. - f_std: Posterior standard deviation for each genotype. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. The genotype labels are used as the row index.

Notes

  • The MAP estimate is computed using the posterior mean.

  • If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space: >>> pred = model.predict()

Predict phenotypes for specific genotypes with variance: >>> pred = model.predict(X_pred=[“AAA”, “AAC”], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:
f: numpy.ndarray

A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional

Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.

y_varfloat or array-like, optional

Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..

p_missingfloat, optional

Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.

seedfloat, optional

Random seed for reproducibility. Default is None.

Returns:
farray-like

The true simulated measurements without experimental noise.

Xarray-like

The input sequences used for the simulation.

yarray-like

The simulated measurements with experimental noise added.

y_vararray-like

The standard deviation of the experimental noise for each input sequence.

Raises
ValueError

If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters: >>> f, X, y, y_var = gp.simulate() Simulate data with custom noise and missing probability: >>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

class gpmap.inference.SeqDEFT(n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, P=2, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)

Model for inference of a genotype-phenotype map from observations of sequences.

Sequence Density Estimation using Field Theory (SeqDEFT) model for inferring a complete sequence probability distribution under a Gaussian Process prior. The prior is parameterized by the variance of local epistatic coefficients of order P.

Parameters:
Pint

The order of local interaction coefficients penalized under the prior. For example, P=2 penalizes local pairwise interactions across all possible faces of the Hamming graph, while P=3 penalizes local 3-way interactions across all possible cubes.

afloat, optional, default=None

A parameter related to the inverse variance of the P-order epistatic coefficients being penalized. Larger values induce stronger penalization, approximating the Maximum-Entropy model of order P-1. If a=None, the optimal value of a is determined through cross-validation.

num_regint, optional, default=20

The number of a values to evaluate during the cross-validation procedure.

nfoldsint, optional, default=5

The number of folds to use in the cross-validation procedure.

lambdas_P_invarray-like, optional, default=None

The inverse of the variance components for the first P orders of interaction. If provided, these values are used to regularize the kernel basis.

a_resolutionfloat, optional, default=0.1

The resolution for determining the range of a values during cross-validation.

max_a_maxfloat, optional, default=1e12

The maximum value of a to consider during cross-validation.

fac_maxfloat, optional, default=0.1

A factor to determine the maximum value of a relative to the number of P-order faces in the Hamming graph.

fac_minfloat, optional, default=1e-6

A factor to determine the minimum value of a relative to the number of P-order faces in the Hamming graph.

optimization_optsdict, optional, default={}

A dictionary of options for the optimization procedure used to calculate the maximum entropy model.

maxiterint, optional, default=10000

The maximum number of iterations for the optimization procedure.

gtolfloat, optional, default=1e-6

The gradient tolerance for the optimization procedure.

ftolfloat, optional, default=1e-8

The function tolerance for the optimization procedure.

Methods

fit(X[, y, baseline_phi, baseline_X, ...])

Infers the SeqDEFT model hyperparameter a from the provided data.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

predict([X_pred, calc_variance])

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

sample_prior()

Generate a sample from the prior distribution.

simulate(N[, seed])

Simulates data under the specified a penalization for local P-epistatic coefficients.

fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)

Infers the SeqDEFT model hyperparameter a from the provided data.

This method determines the optimal regularization parameter a by evaluating the log-likelihood of held-out sequences under a grid search for a in cross-validation settings.

Parameters:
Xarray-like of shape (n_obs,)

Array containing the observed sequences.

yarray-like of shape (n_obs,)

Array containing the weights for each observed sequence. By default, each sequence is assigned a weight of 1. These weights can be computed using phylogenetic correction.

baseline_Xarray-like of shape (n_genotypes,), optional

Array containing the sequences associated with baseline_phi.

baseline_phiarray-like of shape (n_genotypes,), optional

Array containing the baseline values (baseline_phi) to include in the model.

positionsarray-like of shape (n_pos,), optional

If provided, subsequences at these positions in the input sequences will be used as input.

phylo_correctionbool, optional, default=False

Whether to apply phylogenetic correction using the full-length sequences.

adjust_freqsbool, optional, default=False

Whether to adjust densities by the expected allele frequencies in the full-length sequences.

allele_freqsdict or codon_table, optional

Dictionary containing the expected allele frequencies for each allele in the set of possible sequences, or a codon table to generate expected amino acid frequencies. If None, these frequencies will be calculated from the full-length observed sequences.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:
contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)

A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:
contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include: - estimate: Posterior mean for each contrast. - std: Posterior standard deviation for each contrast. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. - p(|x|>0): Posterior probability that the absolute value

of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:
X_predarray-like of shape (n_genotypes,), optional

Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.

calc_variancebool, optional, default=False

If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:
predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included: - f_var: Posterior variance for each genotype. - f_std: Posterior standard deviation for each genotype. - ci_95_lower: Lower bound of the 95% credible interval. - ci_95_upper: Upper bound of the 95% credible interval. The genotype labels are used as the row index.

If neither X_pred nor calc_variance are provided, the output DataFrame includes additional columns: - freq: Empirical frequencies of the genotypes. - Q_star: Estimated genotype probabilities.

Notes

  • The MAP estimate is computed using the posterior mean.

  • If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

  • When neither X_pred nor calc_variance are provided, the additional columns for empirical frequencies and estimated probabilities are included to provide further insights into the genotype distribution.

Examples

Predict phenotypes for the entire genotype space: >>> pred = model.predict()

Predict phenotypes for specific genotypes with variance: >>> pred = model.predict(X_pred=[“AAA”, “AAC”], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:
f: numpy.ndarray

A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(N, seed=None)

Simulates data under the specified a penalization for local P-epistatic coefficients.

Parameters:
Nint

Number of total sequences to sample.

seedint, optional (default=None)

Random seed to use for simulation.

Returns:
phiarray-like of shape (N,)

Vector containing the true phi values from which samples were generated.

Xarray-like of shape (N,)

Vector containing the sampled sequences from the probability distribution.

Sequence utils

gpmap.seq.guess_space_configuration(seqs, ensure_full_space=True, force_regular=False, force_regular_alleles=False)

Infer the sequence space configuration from a collection of sequences.

This function determines the sequence space configuration, allowing for different numbers of alleles per site while maintaining the order in which alleles appear in the sequences. It can also enforce constraints such as ensuring a full sequence space or a constant number of alleles per site.

Parameters:
seqsarray-like of shape (n_genotypes,)

A list or array containing the sequences from which the space configuration is to be inferred.

ensure_full_spacebool, optional, default=True

If True, ensures that the entire sequence space is represented by the provided sequences. This is useful for identifying missing genotypes before defining the space.

force_regularbool, optional, default=False

If True, ensures that all sites have the same number of alleles. New allele names will be added to sites with fewer alleles than the maximum across all sites.

force_regular_allelesbool, optional, default=False

If True, ensures that the same alleles are used across all sites, in addition to enforcing the same number of alleles per site.

Returns:
configdict

A dictionary with the inferred configuration of the sequence space. Keys include: - ‘length’: The length of the sequences. - ‘n_alleles’: A list containing the number of alleles per site. - ‘alphabet’: A list of lists, where each inner list contains the

alleles for a specific site.

  • ‘alphabet_type’: The inferred type of alphabet (‘dna’, ‘rna’, ‘protein’, or ‘custom’).

gpmap.seq.get_custom_codon_table(aa_mapping)

Constructs a Biopython CodonTable for translation using a custom genetic code.

Parameters:
aa_mappingpd.DataFrame

A pandas DataFrame with columns “Codon” and “Letter” representing the genetic code mapping. Stop codons should be denoted with “*”.

Returns:
codon_tableBio.Data.CodonTable.CodonTable

A Biopython CodonTable object that can be used for translating sequences with the specified custom genetic code.

gpmap.seq.get_one_hot_from_alleles(alphabet)

Generate a one-hot encoding CSR matrix for a complete combinatorial space.

This function uses a fast recursive method to construct the one-hot encoding matrix, avoiding redundant computations for common blocks in the full matrix.

Parameters:
alphabetlist of list

A list where each inner list contains the alleles for a specific site in the sequence space.

Returns:
scipy.sparse.csr_matrix

A CSR matrix of shape (n_genotypes, total_n_alleles), where n_genotypes is the total number of genotypes in the sequence space and total_n_alleles is the sum of alleles across all sites. The matrix contains the one-hot encoding of the full sequence space, with genotypes sorted lexicographically.

gpmap.seq.get_alphabet(n_alleles=None, alphabet_type=None)

Generate an alphabet based on the specified number of alleles or alphabet type.

Parameters:
n_allelesint

The number of alleles per site. If alphabet_type is not specified, this determines the size of the custom alphabet.

alphabet_typestr, optional

The type of alphabet to use. Must be one of {None, ‘dna’, ‘rna’, ‘protein’}. If None or ‘custom’, a custom alphabet is generated based on n_alleles.

Returns:
alphabetlist

A list containing the alleles in the desired alphabet. For custom alphabets, the alleles are represented as strings of numbers or characters.

gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')

Generate a mapping from each allele in the observed sequences to a reduced alphabet with at most n_alleles per site. The least frequent alleles are grouped into a single allele.

Parameters:
seqsarray-like of shape (n_genotypes,) or (n_obs,)

Observed sequences. If counts is None, each sequence is assumed to appear once. Otherwise, frequencies are calculated using the counts as the number of times a sequence appears in the data.

n_allelesint or array-like of shape (seq_length,)

Maximum number of alleles allowed per site. If an array is provided, each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles.

countsNone or array-like of shape (n_genotypes,)

Number of times each sequence in seqs appears in the data. If not provided, each sequence is assumed to appear exactly once.

keep_allele_namesbool, optional

If True, allele names are preserved. Otherwise, they are replaced by new alleles taken from the alphabet. Default is True.

last_characterstr, optional

Character to use for pooled alleles when keep_allele_names is True. Default is “X”.

Returns:
codelist of dict of length seq_length

A list of dictionaries, where each dictionary maps the original alleles to the new reduced alphabet for each site.

gpmap.seq.transcribe_seqs(seqs, code)
gpmap.seq.translate_seqs(seqs, codon_table='Standard')
gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)

Extracts unique sequences and their counts from a Multiple Sequence Alignment (MSA). Optionally, subsequences can be selected based on specific positions, and sequence identity re-weighting can be applied to account for sequence similarities across the full alignment.

Parameters:
Xarray-like of aligned sequences

Input sequences from which to extract unique sequences and counts.

yarray-like of weights, optional (default=None)

Pre-calculated weights associated with the input sequences. If not provided, weights are calculated based on sequence identity.

positionsarray-like of int, optional (default=None)

Subset of positions to extract subsequences from the MSA. If not provided, the full sequences are used.

phylo_correctionbool, optional (default=False)

If True, applies sequence identity re-weighting. Observations are weighted as 1 divided by the number of similar sequences in the MSA. Similar sequences are defined based on the max_dist parameter.

max_distfloat, optional (default=0.2)

Maximum sequence identity distance for considering sequences as similar during re-weighting. Only used if phylo_correction is True.

Returns:
Xnp.array of shape (n_unique_seqs,)

Unique subsequences at the specified positions in the MSA.

ynp.array of shape (n_unique_seqs,)

Counts or re-weighted counts for each unique subsequence in the MSA.

Genotypes handling

gpmap.utils.read_dataframe(fpath)
gpmap.utils.read_edges(fpath, log=None, return_df=True)

Reads the incidence matrix containing the adjacency information among genotypes from a sequence space.

Parameters:
fpathstr

File path containing the edges of a sequence space. The extension will be used to differentiate between csv, parquet, and the more efficient npz format.

logLogTrack, optional

Logger instance to log messages. Default is None.

return_dfbool, default=True

Whether to return a pandas DataFrame with the edges. If False, it will return a csr_matrix.

Returns:
edges_dfpd.DataFrame or csr_matrix

If return_df is True, returns a DataFrame with columns i and j containing the indices of the genotypes that are separated by a single mutation in a sequence space. If return_df is False, returns a csr_matrix representation of the edges.

gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)

Selects the specified genotypes from nodes_df, along with the corresponding edges among the remaining genotypes if edges are provided.

Parameters:
nodes_df: pd.DataFrame of shape (n_genotypes, n_features)

DataFrame containing the genotypes from a full sequence space as the index. Typically, it includes at least the coordinates for visualization of each genotype, but it may also retain any other columns for later use.

genotypes: array-like of shape (n_genotypes,)

Array of genotypes to select from the input landscape. By default, it should contain genotype labels, or indexes if the is_idx option is set to True.

edges: pd.DataFrame of shape (n_edges, 2) or scipy.sparse.csr_matrix

of shape (n_genotypes, n_genotypes), optional

DataFrame or csr_matrix representing the adjacency relationships among genotypes provided in nodes_df within the discrete space.

is_idx: bool, optional

Indicates whether the genotypes argument is an array of indexes instead of an array of genotype labels.

Returns:
output: (nodes_df, edges)

A tuple containing the filtered landscape with the selected genotypes and the adjacency relationships between them.

gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})

Filters and returns the genotype labels that satisfy the specified conditions based on maximum and minimum values for the columns in the input DataFrame.

Parameters:
nodes_dfpd.DataFrame

A DataFrame with genotypes as the index and various features as columns. Typically, it contains at least the coordinates for visualization, but it may also include other metadata.

max_valuesdict, optional

A dictionary where keys are column names and values are the maximum thresholds for filtering genotypes. Genotypes with values greater than these thresholds in the specified columns will be excluded.

min_valuesdict, optional

A dictionary where keys are column names and values are the minimum thresholds for filtering genotypes. Genotypes with values less than these thresholds in the specified columns will be excluded.

Returns:
genotypespd.Index

An index containing the labels of genotypes that meet the specified filtering criteria.

gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)

Marginalizes specific positions in the sequences and averages numeric values across the remaining genetic backgrounds.

Parameters:
nodes_dfpd.DataFrame

DataFrame with sequence names as the index and at least one numeric column to calculate the average across the selected genetic backgrounds.

keep_posarray-like, optional

List of 0-indexed positions to preserve. The sequences will be averaged across all genetic backgrounds specified by the remaining positions. If not provided, skip_pos must be specified.

skip_posarray-like, optional

List of 0-indexed positions to marginalize out. The sequences will be averaged across these positions. If not provided, keep_pos must be specified.

return_edgesbool, optional, default=False

If True, returns an additional DataFrame containing the edges of the reduced sequence space for visualization.

Returns:
nodes_dfpd.DataFrame

DataFrame containing the average value of every numeric column in the input DataFrame, with the subsequences at the desired positions as the index.

edges_dfpd.DataFrame, optional

DataFrame containing the edges of the reduced sequence space. This is only returned if return_edges=True.

Plotting

gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})

Plots the relaxation times associated to each of the calculated components from using ``WMWalk.calc_visualization

Parameters:
decay_dfpd.DataFrame of shape (n_components, 3)

pd.DataFrame containing the decay rates and the associated mean relaxation times for each of the calculated components

axesmatplotlib Axes object (None)

Axes where to plot. If not provided, a new figure will be created automatically for this plot and save in the path provided by fpath

fpathstr (None)

File path to store the plot. If fpath=None, axes argument must be provided for plotting.

log_scalebool (False)

Plot the relaxation times in log scale

neutral_timefloat (None)

If provided, an additional horizontal line will be plotted representing the relaxation time associated to the neutral process. This is useful when selecting the number of relevant dimensions to plot

kwargsdict

Additional key-word arguments dictionary provided for axes.plot and axes.scatter e.g. color.

gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None)

Plots the edges representing the connections between states that are conneted in the discrete space under a particular embedding

Parameters:
axesmatplotlib Axes in which to plot the edges.
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

edges_dfpd.DataFrame of shape (n_edges, 2)

pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.

xstr (‘1’)

Column in nodes_df to use for plotting the genotypes on the x-axis

ystr (‘2’)

Column in nodes_df to use for plotting the genotypes on the y-axis

zstr (None)

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.

alphafloat (0.1)

Transparency of lines representing the edges

zorderint (1)

Order in which the edges will be rendered relative to other elements. Generally, we would want this to be smaller than the zorder used for plotting the nodes

colorstr (‘grey’)

Column name for the values according to which edges will be colored or the specific color to use for plotting the edges

cmapcolormap or str

Colormap to use for coloring the edges according to column color

widthfloat or str

Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.

max_widthfloat (1)

Maximum linewidth for the edges when scaled by

min_widthfloat (0.1)

Maximum linewidth for the edges when scaled by

Returns:
line_collectionLineCollection or Line3DCollection
gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0, rasterized=False)

Plots the nodes representing the states of the discrete space on the provided coordinates.

Parameters:
axesmatplotlib.axes.Axes

The matplotlib Axes object in which to plot the nodes or states.

nodes_dfpandas.DataFrame

DataFrame of shape (n_genotypes, n_components + 2) containing the coordinates in each of the n_components, along with additional columns such as “function” and “stationary_freq”. Additional columns are also allowed.

xstr, default=’1’

Column in nodes_df to use for plotting the genotypes on the x-axis.

ystr, default=’2’

Column in nodes_df to use for plotting the genotypes on the y-axis.

zstr, optional

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced, provided the axes object supports it.

alphafloat, default=1

Transparency of markers representing the nodes.

zorderint, default=2

Order in which the nodes will be rendered relative to other elements. Typically, this should be greater than the zorder used for plotting edges.

colorstr, default=’grey’

Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.

vcenterbool, default=False

Whether to center the color scale around the 0 value.

vmaxfloat, optional

Maximum value for the colormap.

vminfloat, optional

Minimum value for the colormap.

cmapstr or matplotlib.colors.Colormap, default=’viridis’

Colormap to use for coloring the nodes based on the color column.

cbarbool, default=True

Whether to display the colorbar.

cbar_labelstr, optional

Label for the colorbar associated with the nodes’ color scale.

cbar_axesmatplotlib.axes.Axes, optional

Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.

palettedict, optional

Dictionary mapping categories in the color column to specific colors, if the column represents categorical data.

sizefloat or str, default=2.5

Size of the markers for the nodes. If a float is provided, it will be used for all nodes. If a string is provided, node sizes will be scaled based on the corresponding column in nodes_df.

max_sizefloat, default=40

Maximum size for the nodes when scaled.

min_sizefloat, default=1

Minimum size for the nodes when scaled.

lwfloat, default=0

Line width of the edges around the markers representing the nodes.

edgecolorstr, default=’black’

Color of the edges around the markers representing the nodes.

legendbool, default=True

Whether to display a legend on the plot.

legend_locint or tuple, default=0

Location of the legend if coloring is based on a categorical variable.

rasterizedbool, default=False

Whether to rasterize the scatterplot when rendering the plot in vector format.

gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None, rasterized=False)

Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states if provided.

Parameters:
axesmatplotlib.axes.Axes

Matplotlib Axes object in which to plot the edges and nodes.

nodes_dfpd.DataFrame of shape (n_genotypes, n_variables)

DataFrame containing the coordinates in each of the n_components, in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed.

edges_dfpd.DataFrame of shape (n_edges, 2), optional

DataFrame containing the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected. If not provided, only nodes will be plotted.

xstr, optional, default=’1’

Column in nodes_df to use for plotting the genotypes on the x-axis.

ystr, optional, default=’2’

Column in nodes_df to use for plotting the genotypes on the y-axis.

zstr, optional, default=None

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced.

nodes_alphafloat, optional, default=1

Transparency of the markers representing the nodes.

nodes_zorderint, optional, default=2

Order in which the nodes will be rendered relative to other elements.

nodes_colorstr, optional, default=’function’

Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.

nodes_cmapstr, optional, default=’viridis’

Colormap to use for coloring the nodes based on the nodes_color column.

nodes_palettedict, optional, default=None

Dictionary mapping categories in the nodes_color column to specific colors, if the column represents categorical data.

nodes_vminfloat, optional, default=None

Minimum value for the colormap.

nodes_vmaxfloat, optional, default=None

Maximum value for the colormap.

nodes_vcenterbool, optional, default=False

Whether to center the color scale around the 0 value.

nodes_cbarbool, optional, default=True

Whether to display the colorbar for the nodes.

nodes_cbar_axesmatplotlib.axes.Axes, optional, default=None

Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.

nodes_cmap_labelstr, optional, default=’Function’

Label for the colorbar associated with the nodes’ color scale.

nodes_sizefloat, optional, default=2.5

Size of the markers for the nodes.

nodes_min_sizefloat, optional, default=1

Minimum size for the nodes when scaled.

nodes_max_sizefloat, optional, default=40

Maximum size for the nodes when scaled.

nodes_lwfloat, optional, default=0

Line width of the edges around the markers representing the nodes.

nodes_edgecolorstr, optional, default=’black’

Color of the edges around the markers representing the nodes.

edges_alphafloat, optional, default=0.1

Transparency of the lines representing the edges.

edges_zorderint, optional, default=1

Order in which the edges will be rendered relative to other elements.

edges_colorstr, optional, default=’grey’

Column name in edges_df for the values used to color the edges, or a specific color to use for all edges.

edges_cmapstr, optional, default=’binary’

Colormap to use for coloring the edges based on the edges_color column.

edges_palettedict, optional, default=None

Dictionary mapping categories in the edges_color column to specific colors, if the column represents categorical data.

edges_cbarbool, optional, default=False

Whether to display the colorbar for the edges.

edges_cbar_axesmatplotlib.axes.Axes, optional, default=None

Axes to plot the colorbar for the edges. If not provided, it will be automatically adjusted to the current Axes.

edges_widthfloat, optional, default=0.5

Width of the lines representing the edges.

edges_max_widthfloat, optional, default=1

Maximum width for the edges when scaled.

edges_min_widthfloat, optional, default=0.1

Minimum width for the edges when scaled.

sort_bystr, optional, default=None

Column in nodes_df to use for sorting the nodes before plotting.

sort_ascendingbool, optional, default=False

Whether to sort the nodes in ascending order based on the sort_by column.

center_spinesbool, optional, default=False

Whether to center the spines of the plot at (0, 0).

add_histbool, optional, default=False

Whether to add a histogram inset showing the distribution of node colors.

inset_cbarbool, optional, default=False

Whether to add an inset colorbar for the nodes.

inset_postuple, optional, default=(0.7, 0.7)

Position of the inset colorbar or histogram as a fraction of the Axes.

prev_nodes_dfpd.DataFrame, optional, default=None

DataFrame containing the previous positions of the nodes. If provided, the current nodes will be aligned to minimize their distance from the previous positions.

rasterizedbool, optional, default=False

Whether to rasterize the plot for better performance with large datasets.

gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)
gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)
gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)

Makes an interactive plot of fitness landscape with genotypes as nodes and single point mutations as edges using plotly

Parameters:
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

edges_dfpd.DataFrame of shape (n_edges, 2)

pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.

xstr (‘1’)

Column in nodes_df to use for plotting the genotypes on the x-axis

ystr (‘2’)

Column in nodes_df to use for plotting the genotypes on the y-axis

zstr (None)

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.

nodes_colorstr (‘function’)

Column name for the values according to which states will be colored or the specific color to use for plotting the states

nodes_sizefloat (2.5)

Size of the markers provided for plotting to axes.scatter. If a float is provided, that will be the size used to plot every nodes. If str, then node sizes will be scaled according to the corresponding column in nodes_df.

nodes_cmapcolormap or str

Colormap to use for coloring the nodes according to column color

nodes_cmap_labelstr

Label for colorbar

edges_widthfloat or str

Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.

edges_colorstr

Column name for the values according to which edges will be colored or the specific color to use for plotting the edges

edges_alphafloat (0.2)

Transparency of lines representing the edges

textarray-like of shape (nodes_df.shape[0]) (None)

Labels to show for each state when hovering over the markers representing them. If not provided, rownames of the nodes_df DataFrame will be used

fpathstr

File path in which to store the interactive plot as an html file

gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)
gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)
gpmap.plot.mpl.plot_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)

Generates a 2-panel figure summarizing the SeqDEFT model results.

The first panel shows how the cross-validated likelihood changes with the a hyperparameter and highlights the best-selected value for model fitting. If sequence density data is provided, the second panel visualizes the relationship between observed sequence frequencies and estimated densities.

Parameters:
log_Lspd.DataFrame

A DataFrame of shape (num_a, 3) containing the columns: - a: The hyperparameter values. - logL: The log-likelihood values. - fold: The cross-validation fold identifiers.

seq_densitypd.DataFrame, optional

A DataFrame of shape (n_genotypes, >= 2) with the following columns: - frequency: Observed frequencies for each sequence. - Q: Estimated densities for each sequence. If not provided, only a single-panel figure with the cross-validated likelihood curve will be generated.

err_barsstr, default=’stderr’

Specifies the type of error bars to display: - 'sd': Standard deviation across the different folds. - 'stderr': Standard error of the mean.

show_foldsbool, default=False

Whether to display the out-of-sample log-likelihoods for the individual folds in the cross-validation procedure.

legend_locint, default=1

The location of the legend in the plot. Follows matplotlib’s legend location codes.

normalize_logLbool, default=True

If True, normalizes the log-likelihood values relative to the value at a = .

Returns:
figmatplotlib.figure.Figure

The resulting figure object containing the generated plots.

Datasets

gpmap.datasets.list_available_datasets()

Retrieve the names of all available built-in datasets.

This function scans the directory specified by LANDSCAPES_DIR and extracts the names of all files present, excluding their extensions. It returns these names as a list.

Returns:

list: A list of strings, where each string is the name of a built-in dataset.

class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)

DataSet object for managing and manipulating various components related to a specific dataset. This includes the original data, reconstructed landscape, and visualization coordinates.

Parameters:
dataset_namestr

The name of the dataset to load from the built-in list. If data or landscape are provided, this will be the name assigned to the new dataset.

datapd.DataFrame, shape (n_obs, n_features), optional

A DataFrame containing the experimental data with genotypes as the index.

landscapepd.DataFrame, shape (n_genotypes, 1), optional

A DataFrame containing the complete combinatorial landscape used to build the remaining components of the dataset.

Attributes:
data
edges
landscape
nodes
relaxation_times

Methods

to_sequence_space()

Generate a SequenceSpace object from the dataset's landscape.

calc_visualization

plot

save

to_sequence_space()

Generate a SequenceSpace object from the dataset’s landscape.

This method constructs a SequenceSpace object using the genotypes and their corresponding values from the dataset’s landscape.

Returns:
SequenceSpace

A SequenceSpace object representing the dataset’s landscape.